How Large Behavior Models Are Assisting Atlas in Achieving Success

Boston Dynamics can be forgiven, I think, for the relative lack of acrobatic prowess displayed by the new version of Atlas in (most of) its latest videos. In fact, if you look at this Atlas video from late last year, and compare it to Atlas’ most recent video, it’s doing what looks to be more or less the same logistics-y stuff—all of which is far less visually exciting than backflips.

But I would argue that the relatively dull tasks Atlas is working on now, moving car parts and totes and whatnot, are just as impressive. Making a humanoid that can consistently and economically and safely do useful things over the long term could very well be the hardest problem in robotics right now, and Boston Dynamics is taking it seriously.

Last October, Boston Dynamics announced a partnership with Toyota Research Institute with the goal of general-purpose-izing Atlas. We’re now starting to see the results of that partnership, and Boston Dynamics’ vice president of robotics research, Scott Kuindersma, takes us through the progress they’ve made.

Building AI Generalist Robots

While the context of this work is “building AI generalist robots,” I’m not sure that anyone really knows what a “generalist robot” would actually look like, or even how we’ll even know when someone has achieved it. Humans are generalists, sort of—we can potentially do a lot of things, and we’re fairly adaptable and flexible in many situations, but we still require training for most tasks. I bring this up just to try and contextualize expectations, because I think a successful humanoid robot doesn’t have to actually be a generalist, but instead just has to be capable of doing several different kinds of tasks, and to be adaptable and flexible in the context of those tasks. And that’s already difficult enough.

The approach that the two companies are taking is to leverage large behavior models (LBMs), which combine more general world knowledge with specific task knowledge to help Atlas with that adaptability and flexibility thing. As Boston Dynamics points out in a recent blog post, “the field is steadily accumulating evidence that policies trained on a large corpus of diverse task data can generalize and recover better than specialist policies that are trained to solve one or a small number of tasks.” Essentially, the goal is to develop a foundational policy that covers things like movement and manipulation, and then add more specific training (provided by humans) on top of that for specific tasks. This video below shows how that’s going so far.

– YouTube

What the video doesn’t show is the training system that Boston Dynamics uses to teach Atlas to do these tasks. Essentially imitation learning, an operator wearing a motion tracking system teleoperates Atlas through motion and manipulation tasks. There’s a one-to-one mapping between the operator and the robot, making it fairly intuitive, although as anyone who has tried to teleoperate a robot with a surfeit of degrees of freedom can attest to, it takes some practice to do it well.

Robot and VR user interact in a lab workspace. A motion tracking system provides high-quality task training data for Atlas.Boston Dynamics

This interface provides very high-quality demonstration data for Atlas, but it’s not the easiest to scale—just one of the challenges of deploying a multipurpose (different than generalist!) humanoid.

For more about what’s going on behind the scenes in this video and Boston Dynamics’ strategy with Atlas, IEEE Spectrum spoke with Kuindersma.

In a video from last October just as your partnership with Toyota Research Institute was beginning, Atlas was shown moving parts around and performing whole-body manipulation. What’s the key difference between that demonstration and what we’re seeing in the new video?

Scott Kuindersma: The big difference is how we programmed the behavior. The previous system was a more traditional robotics stack involving a combination of model-based controllers, planners, and machine learning models for perception all architected together to do end-to-end manipulation. Programming a new task on that system generally required roboticists or system integrators to touch code and tell the robot what to do.

For this new video, we replaced most of that system with a single neural network that was trained on demonstration data. This is much more flexible because there’s no task-specific programming or other open-ended creative engineering required. Basically, if you can teleoperate the robot to do a task, you can train the network to reproduce that behavior. This approach is more flexible and scalable because it allows people without advanced degrees in robotics to “program” the robot.

We’re talking about a large behavior model (LBM) here, right? What would you call the kind of learning that this model does?

Kuindersma: It is a kind of imitation learning. We collect many teleoperation demonstrations and train a neural network to reproduce the input-output behaviors in the data. The inputs are things like raw robot camera images, natural language descriptions of the task, and proprioception, and the outputs are the same teleop commands sent by the human interface.

What makes it a large behavior model is that we collect data from many different tasks and, in some cases, many different robot embodiments, using all of that as training data for the robot to end up with a single policy that knows how to do many things. The idea is that by training the network on a much wider variety of data and tasks and robots, its ability to generalize will be better. As a field, we are still in the early days of gathering evidence that this is actually the case (our [Toyota Research Institute] collaborators are among those leading the charge), but we expect it is true based on the empirical trends we see in robotics and other AI domains.

So the idea with the behavior model is that it will be more generalizable, more adaptable, or require less training because it will have a baseline understanding of how things work?

Kuindersma: Exactly, that’s the idea. At a certain scale, once the model has seen enough through its training data, it should have some ability to take what it’s learned from one set of tasks and apply those learnings to new tasks. One of the things that makes these models flexible is that they are conditioned on language. We collect teleop demonstrations and then post-annotate that data with language, having humans or language models describing in English what is happening. The network then learns to associate these language prompts with the robot’s behaviors. Then, you can tell the model what to do in English, and it has a chance of actually doing it. At a certain scale, we hope it won’t take hundreds of demonstrations for the robot to do a task; maybe only a couple, and maybe way in the future, you might be able to just tell the robot what to do in English, and it will know how to do it, even if the task requires dexterity beyond simple object pick-and-place.

There are a lot of robot videos out there of robots doing stuff that might look similar to what we’re seeing here. Can you tell me how what Boston Dynamics and Toyota Research Institute are doing is unique?

Kuindersma: Many groups are using AI tools for robot demos, but there are some differences in our strategic approach. From our perspective, it’s crucial for the robot to perform the full breadth of humanoid manipulation tasks. That means, if you use a data-driven approach, you need to somehow funnel those embodied experiences into the dataset you’re using to train the model. We spent a lot of time building a highly expressive teleop interface for Atlas, which allows operators to move the robot around quickly, take steps, balance on one foot, reach the floor and high shelves, throw and catch things, and so on.

The ability to directly mirror a human body in real time is vital for Atlas to act like a real humanoid laborer. If you’re just standing in front of a table and moving things around, sure, you can do that with a humanoid, but you can do it with much cheaper and simpler robots, too. If you instead want to, say, bend down and pick up something from between your legs, you have to make careful adjustments to the entire body while doing manipulation. The tasks we’ve been focused on with Atlas over the last couple months have been focused more on collecting this type of data, and we’re committed to making these AI models extremely performant so the motions are smooth, fast, beautiful, and fully cover what humanoids can do.

Is it a constraint that you’re using imitation learning, given that Atlas is built to move in ways that humans can’t? How do you expand the operating envelope with this kind of training?

Kuindersma: That’s a great question. There are a few ways to think about it:

Atlas can certainly do things like continuous joint rotation that people can’t. While those capabilities might offer efficiency benefits, I would argue that if Atlas only behaved exactly like a competent human, that would be amazing, and we would be very happy with that.
We could extend our teleop interface to make available types of motions the robot can do but a person can’t. The downside is this would probably make teleoperation less intuitive, requiring a more highly trained expert, which reduces scalability.
We may be able to co-train our large behavior models with data sources that are not just teleoperation-based. For example, in simulation, you could use rollouts from reinforcement learning policies or programmatic planners as augmented demonstrations that include these high-range-of-motion capabilities. The LBM can then learn to leverage that in conjunction with teleop demonstrations. This is not just a hypothetical, we’ve actually found that co-training with simulation data has improved performance in the real robot, which is quite promising.

Can you tell me what Atlas was directed to do in the video? Is it primarily trying to mirror its human-based training, or does it have some capacity to make decisions?

Kuindersma: In this case, Atlas is responding primarily to visual and language queues to perform the task. At our current scale and with the model’s training, there’s a limited ability to completely innovate behaviors. However, you can see a lot of variety and responsiveness in the details of the motion, such as where specific parts are in the bin or where the bin itself is. As long as those experiences are reflected somewhere in the training data, the robot uses its real-time sensor observations to produce the right type of response.

So, if the bin was too far away for the robot to reach, without specific training, would it move itself to the bin?

Kuindersma: We haven’t done that experiment, but if the bin was too far away, I think it might take a step forward because we varied the initial conditions of the bin when we collected data, which sometimes required the operator to walk the robot to the bin. So there is a good chance that it would step forward, but there is also a small chance that it might try to reach and not succeed. It can be hard to make confident predictions about model behavior without running experiments, which is one of the fun features of working with models like this.

It’s interesting how a large behavior model, which provides world knowledge and flexibility, interacts with this instance of imitation learning, where the robot tries to mimic specific human actions. How much flexibility can the system take on when it’s operating based on human imitation?

Kuindersma: It’s primarily a question of scale. A large behavior model is essentially imitation learning at scale, similar to a large language model. The hypothesis with large behavior models is that as they scale, generalization capabilities improve, allowing them to handle more real-world corner cases and require less training data for new tasks. Currently, the generalization of these models is limited, but we’re addressing that by gathering more data not only through teleoperating robots but also by exploring other scaling bets like non-teleop human demonstrations and sim/synthetic data. These other sources might have more of an “embodiment gap” to the robot, but the model’s ability to assimilate and translate between data sources could lead to better generalization.

How much skill or experience does it take to effectively train Atlas through teleoperation?

Kuindersma: We’ve had people on day tours jump in and do some teleop, moving the robot and picking things up. This ease of entry is thanks to our teams building a really nice interface: The user wears a VR headset, where they’re looking at a re-projection of the robot’s stereo RGB cameras, which are aligned to provide a 3D sense of vision, and there are built-in visual augmentations like desired hand locations and what the robot is actually doing to give people situational awareness.

So novice users can do things fairly easily, they’re probably not generating the highest quality motions for training policies. To generate high-quality data, and to do that consistently over a period of several hours, it typically takes a couple of weeks of onboarding. We usually start with manipulation tasks and then progress to tasks involving repositioning the entire robot. It’s not trivial, but it’s doable. The people doing it now are not roboticists; we have a team of ‘robot teachers’ who are hired for this, and they’re awesome. It gives us a lot of hope for scaling up the operation as we build more robots.

How is what you’re doing different from other companies that might lean much harder on scaling through simulation? Are you focusing more on how humans do things?

Kuindersma: Many groups are doing similar things, with differences in technical approach, platform, and data strategy. You can characterize the strategies people are taking by thinking about a “data pyramid,” where the top of the pyramid is the highest quality, hardest-to-get data, which is typically teleoperation on the robot you’re working with. The middle of the pyramid might be egocentric data collected on people (e.g., by wearing sensorized gloves), simulation data, or other synthetic world models. And the bottom of the pyramid is data from YouTube or the rest of the Internet.

Different groups allocate finite resources to different distributions of these data sources. For us, we believe it’s really important to have as large a baseline of actual on-robot data (at the top of the pyramid) as possible. Simulation and synthetic data are almost certainly part of the puzzle, and we’re investing resources there, but we’re taking a somewhat balanced data strategy rather than throwing all of our eggs in one basket.

Ideally you want the top of the pyramid to be as big as possible, right?

Kuindersma: Ideally, yes. But you won’t get to the scale you need by just doing that. You need the whole pyramid, but having as much high-quality data at the top as possible only helps.

But it’s not like you can just have a super large bottom to the pyramid and not need the top?

Kuindersma: I don’t think so. I believe there needs to be enough high-quality data for these models to effectively translate into the specific embodiment that they are executing on. There needs to be enough of that “top” data for the translation to happen, but no one knows the exact distribution, like whether you need 5 percent real robot data and 95 percent simulation, or some other ratio.

Is that a box of ‘Puny-os’ on the shelf in the video?

Robot handling a box beside a Boston Dynamics robot dog on a shelf. Part of this self-balancing robot.Boston Dynamics

Kuindersma: Yeah! Alex Alspach from [Toyota Research Institute] brought it in to put in the background as an easter egg.

What’s next for Atlas?

Kuindersma: We’re really focused on maximizing the performance manipulation behaviors. I think one of the things that we’re uniquely positioned to do well is reaching the full behavioral envelope of humanoids, including mobile bimanual manipulation, repetitive tasks, and strength, and getting the robot to move smoothly and dynamically using these models. We’re also developing repeatable processes to climb the robustness curve for these policies—we think reinforcement learning may play a key role in achieving this.

We’re also looking at other types of scaling bets around these systems. Yes, it’s going to be very important that we have a lot of high-quality on-robot on task data that we’re using as part of training these models. But we also think there are real opportunities and being able to leverage other data sources, whether that’s observing or instrumenting human workers or scaling up synthetic and simulation data, and understanding how those things can mix together to improve the performance of our models.

The post “Large Behavior Models Are Helping Atlas Get to Work” by Evan Ackerman was published on 09/07/2025 by spectrum.ieee.org