DeepMind’s Journey to Develop Self-Improving Table Tennis Agents

Hardly a day goes by without impressive new robotic platforms emerging from academic labs and commercial startups worldwide. Humanoid robots in particular look increasingly capable of assisting us in factories and eventually in homes and hospitals. Yet, for these machines to be truly useful, they need sophisticated “brains” to control their robotic bodies. Traditionally, programming robots involves experts spending countless hours meticulously scripting complex behaviors and exhaustively tuning parameters, such as controller gains or motion-planning weights, to achieve desired performance. While machine learning (ML) techniques have promise, robots that need to learn new complex behaviors still require substantial human oversight and reengineering. At Google DeepMind, we asked ourselves: How do we enable robots to learn and adapt more holistically and continuously, reducing the bottleneck of expert intervention for every significant improvement or new skill?

This question has been a driving force behind our robotics research. We are exploring paradigms where two robotic agents playing against each other can achieve a greater degree of autonomous self-improvement, moving beyond systems that are merely preprogrammed with fixed or narrowly adaptive ML models toward agents that can learn a broad range of skills on the job. Building on our previous work in ML with systems like AlphaGo and AlphaFold, we turned our attention to the demanding sport of table tennis as a testbed.

We chose table tennis precisely because it encapsulates many of the hardest challenges in robotics within a constrained, yet highly dynamic, environment. Table tennis requires a robot to master a confluence of difficult skills: Beyond just perception, it demands exceptionally precise control to intercept the ball at the correct angle and velocity and involves strategic decision-making to outmaneuver an opponent. These elements make it an ideal domain for developing and evaluating robust learning algorithms that can handle real-time interaction, complex physics, high-level reasoning and the need for adaptive strategies—capabilities that are directly transferable to applications like manufacturing and even potentially unstructured home settings.

The Self-Improvement Challenge

Standard machine learning approaches often fall short when it comes to enabling continuous, autonomous learning. Imitation learning, where a robot learns by mimicking an expert, typically requires us to provide vast numbers of human demonstrations for every skill or variation; this reliance on expert data collection becomes a significant bottleneck if we want the robot to continually learn new tasks or refine its performance over time. Similarly, reinforcement learning, which trains agents through trial-and-error guided by rewards or punishments, often necessitates that human designers meticulously engineer complex mathematical reward functions to precisely capture desired behaviors for multifaceted tasks, and then adapt them as the robot needs to improve or learn new skills, limiting scalability. In essence, both of these well-established methods traditionally involve substantial human involvement, especially if the goal is for the robot to continually self-improve beyond its initial programming. Therefore, we posed a direct challenge to our team: Can robots learn and enhance their skills with minimal or no human intervention during the learning-and-improvement loop?

Learning Through Competition: Robot vs. Robot

One innovative approach we explored mirrors the strategy used for AlphaGo: Have agents learn by competing against themselves. We experimented with having two robot arms play table tennis against each other, an idea that is simple yet powerful. As one robot discovers a better strategy, its opponent is forced to adapt and improve, creating a cycle of escalating skill levels.

DeepMind

To enable the extensive training needed for these paradigms, we engineered a fully autonomous table-tennis environment. This setup allowed for continuous operation, featuring automated ball collection as well as remote monitoring and control, allowing us to run experiments for extended periods without direct involvement. As a first step, we successfully trained a robot agent (replicated on both the robots independently) using reinforcement learning in simulation to play cooperative rallies. We fine-tuned the agent for a few hours in the real-world robot-versus-robot setup, resulting in a policy capable of holding long rallies. We then switched to tackling the competitive robot-versus-robot play.

Out of the box, the cooperative agent didn’t work well in competitive play. This was expected, because in cooperative play, rallies would settle into a narrow zone, limiting the distribution of balls the agent can hit back. Our hypothesis was that if we continued training with competitive play, this distribution would slowly expand as we rewarded each robot for beating its opponent. While promising, training systems through competitive self-play in the real world presented significant hurdles. The increase in distribution turned out to be rather drastic given the constraints of the limited model size. Essentially, it was hard for the model to learn to deal with the new shots effectively without forgetting old shots, and we quickly hit a local-minima in the training where after a short rally, one robot would hit an easy winner, and the second robot was not able to return it.

While robot-on-robot competitive play has remained a tough nut to crack, our team also investigated how the robot could play against humans competitively. In the early stages of training, humans did a better job of keeping the ball in play, thus increasing the distribution of shots that the robot could learn from. We still had to develop a policy architecture consisting of low-level controllers with their detailed skill descriptors and a high-level controller that chooses the low-level skills, along with techniques for enabling a zero-shot sim-to-real approach to allow our system to adapt to unseen opponents in real time. In a user study, while the robot lost all of its matches against the most advanced players, it won all of its matches against beginners and about half of its matches against intermediate players, demonstrating solidly amateur human-level performance. Equipped with these innovations, plus a better starting point than cooperative play, we are in a great position to go back to robot-versus-robot competitive training and continue scaling rapidly.

DeepMind

The AI Coach: VLMs Enter the Game

A second intriguing idea we investigated leverages the power of vision language models (VLMs), like Gemini. Could a VLM act as a coach, observing a robot player and providing guidance for improvement?

DeepMind

An important insight of this project is that VLMs can be leveraged for explainable robot policy search. Based on this insight, we developed the SAS Prompt (summarize, analyze, synthesize), a single prompt that enables iterative learning and adaptation of robot behavior by leveraging the VLM’s ability to retrieve, reason, and optimize to synthesize new behavior. Our approach can be regarded as an early example of a new family of explainable policy-search methods that are entirely implemented within an LLM. Also, there is no reward function—the VLM infers the reward directly from the observations given in the task description. The VLM can thus become a coach that constantly analyzes the performance of the student and provides suggestions for how to get better.

AI robot practicing ping pong with specific ball placements on a blue table. DeepMind

Toward Truly Learned Robotics: An Optimistic Outlook

Moving beyond the limitations of traditional programming and ML techniques is essential for the future of robotics. Methods enabling autonomous self-improvement, like those we are developing, reduce the reliance on painstaking human effort. Our table-tennis projects explore pathways toward robots that can acquire and refine complex skills more autonomously. While significant challenges persist—stabilizing robot-versus-robot learning and scaling VLM-based coaching are formidable tasks—these approaches offer a unique opportunity. We are optimistic that continued research in this direction will lead to more capable, adaptable machines that can learn the diverse skills needed to operate effectively and safely in our unstructured world. The journey is complex, but the potential payoff of truly intelligent and helpful robotic partners make it worth pursuing.

The authors express their deepest appreciation to the Google DeepMind Robotics team and in particular David B. D’Ambrosio, Saminda Abeyruwan, Laura Graesser, Atil Iscen, Alex Bewley, and Krista Reymann for their invaluable contributions to the development and refinement of this work.

The post “DeepMind’s Quest for Self-Improving Table Tennis Agents” by Pannag Sanketi was published on 07/21/2025 by spectrum.ieee.org

Contents

The Self-Improvement Challenge

Learning Through Competition: Robot vs. Robot

The AI Coach: VLMs Enter the Game

Toward Truly Learned Robotics: An Optimistic Outlook