Smart Glasses Assist in Training Versatile Robots

General-purpose robots are hard to train. The dream is to have a robot like the Jetson’s Rosie that can performing a range of household tasks, like tidying up or folding laundry. But for that to happen, the robot needs to learn from a large amount of data that match real-world conditions—that data can be difficult to collect. Currently, most training data is collected from multiple static cameras that have to be carefully set up to gather useful information. But what if bots could learn from the everyday interactions we already have with the physical world?

That’s a question that the General-purpose Robotics and AI Lab at NYU, led by assistant professor Lerrel Pinto, hopes to answer with EgoZero, a smart-glasses system that aids robot learning by collecting data with a souped-up version of Meta’s glasses.

In a recent pre-print, which serves as a proof of concept for the approach, the researchers trained a robot to complete seven manipulation tasks, such as picking up a piece of bread and placing it on a nearby plate. For each task, they collected 20 minutes of data from humans performing these tasks while recording their actions with glasses from Meta’s Project Aria. (These sensor-laden glasses are used exclusively for research purposes.) When then deployed to autonomously complete these tasks with a robot, the system achieved a 70 percent success rate.

The Advantage of Egocentric Data

The “ego” part of EgoZero refers to the “egocentric” nature of the data, meaning that it is collected from the perspective of the person performing a task. “The camera sort of moves with you,” like how our eyes move with us, says Raunaq Bhirangi, a postdoctoral researcher at the NYU lab.

This has two main advantages: First, the setup is more portable than external cameras. Second, the glasses are more likely to capture the information needed because wearers will make sure they—and thus the camera—can see what’s needed to perform a task. “For instance, say I had something hooked under a table and I want to unhook it. I would bend down, look at that hook and then unhook it, as opposed to a third-person camera, which is not active,” says Bhirangi. “With this egocentric perspective, you get that information baked into your data for free.”

The second half of EgoZero’s name refers to the fact that the system is trained without any robot data, which can be costly and difficult to collect; human data alone is enough for the robot to learn a new task. This is enabled by a framework developed by Pinto’s lab that tracks points in space, rather than full images. When training robots on image-based data, “the mismatch is too large between what human hands look like and what robot arms look like,” says Bhirangi. This framework instead tracks points on the hand, which are mapped onto points on the robot.

EgoZero localizes object points via triangulation over the camera trajectory, and computes action points via Aria MPS hand pose and a hand estimation model. The EgoZero system takes data from humans wearing smart glasses and turns it into useable 3D navigation data for robots to do general manipulation tasks.Vincent Liu, Ademi Adeniji, Haotian Zhan et al.

Reducing the image to points in 3D space means the model can track movement the same way, regardless of the specific robotic appendage. “As long as the robot points move relative to the object in the same way that the human points move, we’re good,” says Bhirangi.

All of this leads to a generalizable model that would otherwise require a lot of diverse robot data to train. If the robot was trained on data picking up one piece of bread—say, a deli roll—it can generalize that information to pick up a piece of ciabatta in a new environment.

A Scalable Solution

In addition to EgoZero, the research group is working on several projects to help make general-purpose robots a reality, including open-source robot designs, flexible touch sensors, and additional methods of collecting real-world training data.

For example, as an alternative to EgoZero, the researchers have also designed a setup with a 3D-printed handheld gripper that more closely resembles most robot “hands.” A smartphone attached to the gripper captures video with the same point-space method that’s used in EgoZero. But by having people collect data without having to bring a robot into their homes, both approaches could provide a more scalable solution for collecting training data.

That scalability is ultimately the researcher’s goal. Large language models can harness the entire Internet, but there is no Internet equivalent for the physical world. Tapping into everyday interactions with smart glasses could help fill that gap.

The post “Smart Glasses Help Train General-Purpose Robots” by Gwendolyn Rak was published on 08/19/2025 by spectrum.ieee.org

Contents

The Advantage of Egocentric Data

A Scalable Solution