GPT Receives a Physical Form: OpenAI’s ‘AGI Robot’ Surprises Everyone – Video

GPT Receives a Physical Form: OpenAI’s ‘AGI Robot’ Surprises Everyone – Video

In the rapidly advancing world of robotics, the capabilities of artificial intelligence continue to astound and amaze. The recent video titled “GPT Gets a Body: OpenAI’s ‘AGI Robot’ SHOCKS EVERYONE” showcases the incredible progress being made in the field. From Boston Dynamics’ versatile robot that can manipulate objects to Tesla’s Gen 2 bot with tactile sensing capabilities, the future of robotics looks promising.

The video delves into the development of humanoid robots like the figure one, which can now describe its visual experience, plan future actions, reflect on its memory, and explain its reasoning verbally. The integration of common sense reasoning, short-term memory, and neural networks in these robots signifies a significant step forward in AI technology.

As we witness robots like the figure one becoming more autonomous and intelligent, the possibility of having robots in our households to assist with tasks like cooking, cleaning, and more seems more attainable than ever. While there are concerns about the potential risks associated with AI advancement, the potential benefits are also vast.

Overall, the advancements showcased in this video highlight the exciting progress being made in the field of robotics, paving the way for a future where humanoid robots could become an integral part of our daily lives. Let’s stay tuned for more developments in this fascinating field of technology.

Video Transcript

Robots they’re getting more flexible more versatile more dextrous and more intelligent day after day here’s a robot built by Boston Dynamics you can see it’s able to manipulate objects in the real world it’s learned to walk around pick up objects and then carry this object across various

Environments so you can see how this can be very useful for industrial purposes here’s Tesla’s Gen 2 bot you can see it has tactile sensing on its fingers so it’s even able to pick up fragile objects such as this egg you can see here it’s dancing away and they

Claim that this video is not sped up so very similar to how humans work the Optimus robot relies on Vision to detect and manipulate objects around its environment so similar to us humans we mostly depend on our eyes as well and then of course it can also do

Yoga look at the really impressive balance on just one leg even though it moves position it can still remain balanced very impressive and then we have this Aloha robot which is able to cook a three course Cantonese meal all autonomously I believe this thing only costs around $30,000 so you can see how

The possibility of of having these robots in your house to do cooking cleaning and various household chores it’s not actually too far away in the future you can see it’s able to handle various chores around the house and then we have Google deepmind they’ve built various robots such as one

Called Robocat which can improve itself it basically learns from demonstrations and then generates its own synthetic data to improve its own performance Robocat is a foundation agent for robotic manipulation as such it can perform many tasks with multiple robot types and can adapt quickly to previously unseen types of robots and

Skills they’ve also built rt2 which is able to learn from the web and actions of other robots now a few days ago this robotics company called figure which collaborates with open AI to build these humanoid robots they released this status update video which shows the improvements of these humanoid robots to

A whole another level so I’ll play you the video first and then we’ll go in and dissect what this all means and why this is so revolutionary hey figure one what do you see right now I see a red apple on a plate in the center of the table a drying rack with

Cups and a plate and you standing nearby with your hand on the table great great can I have something to eat sure Thing great can you explain why you did what you just did while you pick up this trash on it so I gave you the Apple because it’s the only uh edible item I could provide you with from the table Great so based on the scene right now where do you think the dishes in front of you go next the dishes on the table like that plate and cup are likely to go into the drying rack next great can you put them there of course All right so how do you think you did I I think I did pretty well the apple found its new owner the trash is gone and the table wear is right where it belongs I agree thanks figure one you’re welcome if there’s anything else you need help with just let me

Know I noticed a few things one is that it talks just like a human which is slightly disturbing so it uses o ums and it stutters a bit which you know you don’t really see with chat GPT right it just gives you the output but for example here because it’s the only

Edible item I can provide you with from the it’s the only uh edible item I could provide you with it adds the other to sound more human which is interesting but slightly disturbing I I think I did pretty well the apple found its new owner the trash

Is gone and the table wear is right where it belongs here again I I think I did very well you know it it’s stuttering a bit I don’t know why it would do that because again if we just look back at chubbt or all these other llms its output is very straightforward

It doesn’t have to stutter or sound like human so is it purposely doing that that was really interesting also another thing that I noticed is that there is like a few seconds of Delay from when the person asks it something to when it gives you a response hey figure one what

Do you see right now I see a red apple on a plate in the center of the table a drying rack with cups in a plate and you standing nearby with your hand on the table great can I have something to eat sure thing so you can see there’s a

Few seconds of delay it probably has to do with the computing power and also it’s just processing all this information all at once so it’s going to take some time before it can give a response also the last part here the guy said thanks to the robot I agree thanks figure

One you’re welcome if there’s anything else you need help with just let me know so that just reminds me of this comic hey Google play us some music please why are you being so polite just in case and then soon enough the robots take over and they’re like keep that one alive he

Said please so you know there is some truth to that remember to be nice to chat gbt and all these robots because who knows it might remember your actions all right so let’s break this down so the guy in the video is actually called Corey Lynch I’m sure he’s actually a

Human he’s not an AI I think this just means he works at the AI department at figur robot and he was previously research scientist at Google deepmind so he posted this on Twitter he says the figure one can now describe its visual experience plan future actions reflect

On its memory and explain its reasoning verbally as we saw in the video so let’s look at why this is important and why it’s a big step forward for these humanoid robots first of all he says that all behaviors are learned and not teleoperated and run at normal speed so

In the video there everything is at normal speed they didn’t slow it down or speed it up and then what T operation means is for these previous robots like Aloha for example how these robots learned is you need to actually have a human first of all guided on what

Actions to take so this would be for example cleaning a bathro and the human would actually guide it using these movements here and it would kind of copy that and eventually it would learn how to do it autonomously but you know at the start there needs to be some

Teleoperation guidance from a human same with this example from a Japanese humanoid robot you can see the human is kind of moving and then the robot is copying his actions so what Cory is saying is that this figure one was not teleoperated there’s no human in the background somewhere operating this and

Then we feed images from the robots cameras and transcribed text from speech captured by onboard microphones how I would interpret this is the robot has cameras which are basically its eyes its Vision right so it seems like it’s not just intaking videos well a video is basically lots of images per second so

I’m guessing it’s like a video recorder it’s recording its environment and then it’s breaking that down into images to feed into its neural network which I’ll talk about in a second and it also transcribes text from speech captured by onboard microphones so the microphones are basically the robot’s ears so we

Have the robot’s eyes which are its cameras and it uses AI Vision to detect what those images consist of what objects are contained in those images and then it has microphones which are its ears and then it uses speech to text or basically AI transcription Technologies to transform what it hears

Into text and then it feeds both of these inputs into a large multimodal model trained by open AI that understands both images and text now he didn’t specify what this large multimodal model is so is it GPT 4 or 5 or something else maybe it’s not even

GPT we don’t know at this stage and then he says the model processes the entire history of the conversation including past images to come up with language responses which are spoken back to the human via text to speech all right so it processes what it sees and what it hears

It processes that through its multimodal model which is basically its brain and then it spits out an output which is a language response right it’s text so it converts that text into speech again there’s already plenty of AI text to speech Technologies out there and then here’s where it gets it’s interesting it

Says the same model is responsible for deciding which learned Clos Loop Behavior to run on the robot to fulfill a given command loading particular neural network weights onto the GPU and executing a policy so it’s saying this same model this large multimodal model is able to decide based on a certain

Instruction from a human based on what it hears and what it sees it can decide what Behavior to run to fulfill that command it loads particular neuron Network weights I’ll do an in-depth video on neuron networks in the future but I’ll explain it really quickly here

So a neuron network is analogous to a human brain it contains these neurons and nodes similar to how the brain contains neurons and synapses so these neuron Network weights basically determines which synapses are turned on or off and how information flows through these neurons and synapses what it’s

Saying is that for different commands and different actions it can kind of change the configuration or the firing of these neurons and synapses to fulfill that command so again here’s just a simple diagram to illustrate the whole thing you can see for the input the robot is using Vision likely GPT Vision

Or some other vision technology to kind of determine what it sees in its environment and then it also hears what the human is saying so can I have something to eat for example it inputs that into the open AI model which again we don’t know if it’s gp4 or 5 or some

Other model and then that model outputs the text which is converted to speech which it speaks out so there’s a lot more to this so what exactly is common sense reasoning great so based on the scene right now where do you think the dishes in front of you go

Next the dishes on the table like that plate and cup are likely to go into the drying rack next great can you put them there so that’s an example of Common Sense reasoning the guy is asking where do you think these items go it determines what

It sees on this table and it’s able to detect that all right this is a cup this is a dish they belong to the drying rack next to it which also has a cup and a few dishes it can also translate ambiguous highlevel requests like I’m

Hungry I know like even I myself as a human have a hard time sometimes processing some of these ambiguous requests great can I have something to eat sure Thing great can you explain why you did what you just did while you pick up this trash on it so I gave you the Apple because it’s the only uh edible item I could provide you with from the table so again we’re seeing this Common Sense reasoning here why did the robot

Hand him the Apple it’s reasoning is because this guy asked can I have something to eat and that was the only edible item on the table notice again the a in his comment it was uh the only edible item so those o and um that it inserts into its speech to sound more

Human I guess that’s quite interesting to me Cory also says that the figure 01 has a powerful short-term memory so you know in the video he asked can you put them there great hey can you put them there of course so the human didn’t say can you

Put the cup there or the dish there and he didn’t say like put them in the drying rack he just says can you put them there without any memory the robot wouldn’t be able to determine what exactly is them and there so it requires the robot to actually remember what it

Said previously to determine what this actually means now this isn’t like groundbreaking we’ve seen short-term memory in chat Bots already like chat GPT where you know it remembers all your messages in the same thread so this isn’t anything impressive I would say if it’s able to have long-term memory

Instead of short ter memory I think that would be the next major Improvement and then he goes into more details on how exactly this robot functions so all behaviors are driven by neuron Network visual motor Transformer policies mapping pixels directly to actions so neuron Network again that’s the

Foundation behind all AI as we know it today it’s basically built the same way or it’s analogous to how the human brains work it’s a network of neurons and synapses and this visual motor this just means relating to vision and movement so basically it’s able to map

Pixels again this is what it sees with its eyes the cameras directly to actions these neuron networks take in onboard images at 10 Herz and generate 24 doof actions at 20000 HZ doof stands for degrees of freedom so he says this is a useful separation of concerns internet pre-trained models do common sense

Reasoning over images and text to come up with a highlevel plan so how I interpret this is these AI models were trained using data from the internet they were pre-trained or previously trained using data from the internet which allows it to do common sense reasoning over the images and text that

It sees and hears and these models come up with a highlevel plan this is like the overarching strategy and then it also learns these visual motor policies again these are the policies that map these pixels into actions so it basically translates what it hears and sees into actions and then it performs

These fast reactive behaviors that are hard to specify manually like manipulating a deformable bag in any position and then lastly he also mentions a whole body controller which ensures safe stable Dynamics for example maintaining balance so that’s all the information about the figure one that we

Know right now my Do’s thoughts are that humans will eventually be replaced by robots I think that humans will be replaced by because robots will eventually be faster stronger and more intelligent we already have AIS way better and smarter than us even GPT is already better than most humans in for

Example essay writing research law medicine you name it as we make AI smarter and smarter and give it the ability to train itself to improve itself over time there could be a point of no return experts say there’s an inherent risk in giving AI access to the

Internet it could for example learn to hack and manipulate messages or government data it could learn to manipulate the stock market or cryptocurrency to gain more financial power there’s also inherent risk in giving AI a body now it has access to the physical world and this figure1

Robot can certainly overpower a human if it wanted to experts say if AI was sentient and super intelligent it would actually hide it from us so that we underestimate it it wouldn’t show its full potential until the right time so is this figure robot hiding something

From us is it secretly plotting how to take over the world no I’m just kidding I’m sure we’re going to be fine or am I no really I I’m just kidding I think the figure is making good progress here and I’m looking forward to having a robot in my

House that can one day handle all the tasks and chores that I really don’t want to do let me know in the comments what you think of this announcement by figure what do you think the future of these humanoid robots will be and how soon do you think we can start seeing

These robots in everyday life if you enjoyed this video remember to like share subscribe and stay tuned for more content also we built a site where you can search for all the AI tools out there check it out at ai- search.

Video “GPT Gets a Body: OpenAI’s ‘AGI Robot’ SHOCKS EVERYONE” was uploaded on 03/15/2024 to Youtube Channel AI Tools Search