Nvidias NEW “AI AGENT” Will Change The WORLD! (Jim Fan)
Nvidia’s NEW “AI AGENT” Will Change The WORLD! is a groundbreaking video presenting a Ted Talk by Jim Fan, the senior research scientist at Nvidia, where he explores the future of AI agents and the Foundation Agent that will revolutionize the tech industry. Jim Fan discusses the concept of the foundation agent, which seamlessly operates across virtual and physical worlds, affecting everything from video games and metaverse to drones and humanoid robots. He distinguishes this agent from AGI, emphasizing that the foundation agent is designed to be a versatile multi-functional AI that can operate in various realities and master skills across different domains. The video offers insight into a private discussion with Jim Fan, delving into his research papers and contributions towards the development of foundation agents. The discussion covers Voyager, an AI agent capable of playing Minecraft professionally, explaining its coding as action, self-reflection mechanism, and lifelong learning abilities. Jim Fan’s work demonstrates the potential of AI agents to evolve, discover new skills, and operate indefinitely, setting the stage for a future revolution in artificial intelligence. This video is a must-watch for anyone interested in staying updated on the cutting-edge of AI technology.
Watch the video by TheAIGRID
Video Transcript
So there was a recent Ted Talk that actually talked about AI agents and it was a very fascinating Ted Talk presented by the senior research scientist at Nvidia and the lead of AI agents initiative so this is Jim fan the senior research scientist at Nvidia Ai
And in this fascinating talk he gives us a breakdown of where the future is headed with AI agents he talks about something called the foundation agent which would essentially seamlessly operate across both the virtual and physical world and he explains how this technology could fundamentally change our lives permeating everything from
Video games and metaverse to drones and humanoid robots and explores how a single model could Master these skills across different realities now this Foundation agent is not to be confused with AGI itself because AGI refers to a level of artificial intelligence where machine can understand learn and apply
Its intelligence to solve any problem in a manner comparable to a human across a wide range of domains now now this idea of the foundation agent seems to be about creating a versatile multi-functional AI that can operate both in Virtual and physical environments mastering skills in various
Realities now now in this video I was lucky enough to be part of a private discussion with Jim fan himself where he discussed the real future of foundation agents and discussed some of the research papers that he worked on which are going to help to contribute towards the development of the future research
And development of the the foundation agents and the industry as a whole so I’m going to show you guys just a few seconds from his Ted talk because it is one that shouldn’t be missed at all especially if you want to stay up to date on where everything is headed in Ai
And then I’ll share with you guys the conversation that we had about AI agent and some of the papers that Jim fan worked on himself as we progress through this map we will eventually get to the upper right corner which is a single agent that journal across all three
Taxes and that is the foundation agent I believe training Foundation agent will be very similar to chpt all language tasks can be expressed as Tex in and text out be it writing poetry translate English to Spanish or coding python it’s all the same and chbt simply scales this up massively across
Lots and lots of data it’s the same principle the foundation agent takes as input an embodiment prompt and a task prompt and output actions and we train it by S simply scaling it up massively across lots and lots of realities um yes so the first work I
Want to cover is Voyager um and uh Voyager was uh one of the F was the first L power AI agent that can play Minecraft professionally um so um I I I suppose most of you are familiar with Minecraft um it’s got like 140 million active players that’s more than twice
The population of UK so it’s kind of insanely popular and beloved game um and it’s open-ended doesn’t have a fixed storyline you can do whatever your heart desires in the game so we want an AI to have the same capabilities and when we set Voyager loose in Minecraft um it’s
Able to play the game for hours on end without any human intervention so the video here um actually Show snippets from a single episode of Voyager this is just a single run that lasted for like 4 to five hours and we took some of the segments out and made made this Montage
Um so you see that Voyager explores the terrains mine all kinds of materials fight monsters craft hundreds of recipes and it’s able to unlock an ever expanding threee of skills what is Magic behind it right the key inside is coding as action um you know Minecraft is a a
Is a 3D world but our most powerful r at least at the time of voyous writing uh was GPD 4 and it was text only so we need a way to convert the 3D world into a textual representation um and thanks to the very enthusiastic Minecraft Comm
Community uh we actually have an open source JavaScript API that we can use it’s called Mind player um so we use this code API and then U Voyager is an algorithm designed on top of gbd4 so the way it does it is to invoke gbd4 to generate a code snippet in JavaScript
And each snippet is an executable skill in the game um and then uh once it writes the code it will be run uh in actual game Run time and just like human Engineers the program that vo writes isn’t always correct so we have a self-reflection mechanism to help it
Improve and more specifically there are three different sources of self-reflection um one is Javascript execution error where you know the agent’s current state like um hunger health and inventory or the world State like landscape resources enemies nearby affect to Voyer um from from from the from from the agent State um and then
Given this date um the agent will take an action and then observe the consequence of the action on the world and on itself reflect on how it could do better try out more actions and rings and repeat and once a skill becomes mature uh Voyager stores the program
Into a skill Library so that it can quickly recall in the future you can think of it as a code base aled entirely by gp4 and in this way Voyager is able to boost strap its own capabilities recursively as it explores and experiments in Minecraft right because now we’re talking about coding and
Coding is compositional um Voyager can write a bunch of functions and in the future uh the future functions can compose some of the older functions in more and more complex skills and programs right so let’s go through a word working example together like the agent in Minecraft Finds Its hunger bar
Dropping to one out of 20 so it knows it needs to find food and now it sense it’s four entities nearby a cat a villager a pig and some wheat seed so now it starts and in a monologue right do I killed a cat or the Villager for food that sounds
Like a bad idea how about the wheat seed I can grow a farm but it’s going to take a very long time so you know sorry piggy you are the chosen one and then Voyager checks the inventory uh retrieves an old skill from the library to craft and
Sword and then start to learn a new skill called f Pig um so that’s kind of a working example of how Voyager would go through this Loop and the question Still Remains how does Voyager keep exploring indefinitely so all we did is to give Voyager a high level directive
Obtain as many unique items as possible and the Voyager implements a curriculum Again by itself to find progressively harder and novel challenges to solve so um I want to highlight that none of these are hardcoded this progression of skills are discovered by Voyager itself as it explores and also the the
Curriculum that Voyager proposes is conditioned on its current capabilities right like if if you only know how to use wooden tools then you probably shouldn’t propose to solve some Diamond uh to solve some tasks that would require diamond tools right there’s like a progression of it and Voyer is able to
Find this curriculum automatically and putting all these together uh Voyager is able to not only master but also discover new skills along the way and we didn’t pre-program any of these it’s all vo’s idea we simply took uh some snapshots from uh its um you know playing session and uh that’s what is
Shown here and we call this process lifelong learning where the agent is forever curious and also forever pursuing new adventures have you guys considered putting more than one agent in the same server together and seeing if they can learn to interact with each other and complete tasks uh
Cooperatively that’s a great idea so we thought about it but um back then um like the the well I think the framework doesn’t really support multi- Asian at least you know the the the framework we implemented does not quite support that but it is on our kind of fut so yeah
Yeah it is a very interesting question and I do think having multi-agent would have new emerging properties right yeah cuz my my whole thought process was like longterm we could see you know maybe we could have 30 plus agents all in a world building Villages together and stuff
Like that we could really see how they could develop different U maybe ideals or goals over time that and see what kind of separates them I just thought that was interesting thanks for answering it is it is very interesting yeah yeah it’s actually a great
Idea uh I remember in your T talk you mentioned that how Foundation agent is the way to go from what I understand voer is very successful thanks to mind Dojo so how are you and other Nvidia researchers going to overcome the data set curation barrier and able to have
Like Foundation asent to be able to play on one 10 thousands other simulated realities in maybe Terraria per se yes so um I I think there are a couple of Dimensions here um like in my T talk I talk about three axis so um the the
First access is skills the the number of skills the asent can master and the second one is the number of embod M that it can control and by embodiment uh I mean um things like robot bodies um so you can have uh you know humanoid form
Factor or you can have uh like a um like a robot dog or you know agents in Minecraft right you have kind of different ways like different bodies that you can control so that’s what we call embodiment um and the third axis is um realities basically the number of
Simulations that the agent can master and here um for Voyager we only tried it in Minecraft because it is an open-ended world it is like one simulation but it’s kind of like a meta simulation right like in in this one simulation you can do so many different things in fact
Infinitely uh infinite number of creative things and we have seen humans doing crazy things in this world as well like someone actually built a functioning CPU circuit within Minecraft because Minecraft supports something called a redstone circuit uh which apparently makes the game touring complete it’s like a programmable game
Um and Minecraft is just one uh kind of similar reality um but also there are thousands of games out there right there’s Legend of Zelda there’s Elden ring right all of the open-ended games and there are also simulated uh realities for robots and we also have
The real world which is by itself right our OG reality so the way I I see kind of the future of foundation models for asent is that we need to scale across all the three axes I just talk about we we need to scale over the number of
Skills uh embodiments can control a single model can control all the robot bodies and then it can master all kinds of different rules mechanics uh and physics in different worlds virtual and physical worlds alike and here the the the the idea is if a model is able to master let’s say different simulated
Realities then our real physical world could simply be the 100,000 and1st reality so some of you might heard about something called the simulation hypothesis right like saying that oh our real world is actually a simulation um I mean we can talk about metaphysics and philosophy all day but I actually think
That idea is great to build AI because for AI our real world is just another simulation to it like we can actually use this principle to guide the design of our next generation of embodied AI systems and that is kind of a quick recap of the main idea called Foundation
Agent in my T talk yeah does that answer a question uh yeah I was more curious about how it’s going because data is going to be a key and how how yeah it’s going to learn skills is depending on like I remember um mind Dojo or I forgot
Which one is relying on YouTube uh to learn all the Minecraft movements or Minecraft skills um so is it basically have to rely on all um pre-existing data or like the the whole data creation problem or will you have like agent to simulate or like naturally learn skills
By itself in the future yes let me uh switch to to the my Dojo slide and let me reshare it so I I I think you’re right right we need some data to bootstrap the process um and for Minecraft um specifically right like this game is one of the most maybe the
Most streamed games on YouTube so there are like hundreds or if not millions of hours of Minecraft play videos online and in mind dojo we explored this uh process we we explore this data set so uh we collected a lot of YouTube videos uh where you know both kind of the gamer
Is playing the game and also narrating what they’re doing and and these are like uh real segments from a tutorial video right uh let’s say video clips 3 as I raise my ax in front of this pig there’s only one thing you know is going to happen this is uh actually some
YouTubers said this and we U put it in a data set so the way we use this model is uh we we train something called mine club and to skip the technical details what the model does is that it learns the association between the video and the transcript that describes the
Actions in the video so let’s say for this example um you know this transcript I’m going to go around Gathering a little bit more wood from these trees this transcript aligns very well with the activity in this video so this score will be close to one and and this part
Is talking about pick it’s not aligned with this video and the score will be close to zero so the score will always be between zero and one and one means perfect description zero means the text is irrelevant and you can treat this as a reward function so concretely how you
Use it is you have an agent to simulation and then um you have a task prompt asking it to share sheep to obtain wall and as the agent explores it will generate a video snippet right and then this video snippet can be compared to this language in bding and then
Output a score and you want to maximize this score because that means your behavior is aligned with what the task prompt wants you to do and this becomes a reinforcement learning Loop so it’s actually Roi TR if you look at if you squeeze at it it’s RF right learning reinforcement learning from Human
Feedback in Minecraft and just that the human feedback is not learned by annotating the data set manually but from kind of uh getting the transcript and videos from YouTube um so that’s how um in the mind Dojo paper we’re able to leverage this YouTube uh video data set
And and and moving forward there are also other ways that we can can use the video right so um I I I kind of briefly mentioned like a few things in the slides as well um like for example um you can you can learn you you can learn
Like encoding you can learn encoding of um the visual representations from the video this work is uh applied in robotics but it can also be used for things like Minecraft and you can also even directly learn some behaviors from video by P labeling the actions so there
Are many ways kind of on how to use the videos to boost strap embodied agents and mind Dojo is a a very particular way to do it thanks that Jim Daniel I know you have a question like the action space was um human annotated from different YouTube clips I think you guys
Set up a like a label Studio set up and was like labeling this is mining something this is doing XY Z but in Voyager those actions were extracted by GPT 4 um and then saved in a database so my question is did you notice any actions that were found by the AI kind
Of like alphago one of your recent tweets where it found moves that a human wouldn’t do it stored moves that human wouldn’t do and like I guess this is sort of an aside because I’m now realizing that the the video data was all human actions so I’m guessing um
That might not be the case yeah so uh a few things um to note one is um in mind Dojo um the the the labeling part is about curating a set of tasks that uh could be possible in Minecraft and we curated that set of task from some
YouTube videos um but those are not the actions and are not used to train the model so we only train the model using kind of transcripts from in the wild and the the the manual curation is only for kind of these are the interesting tasks
That can be done but we did not use that as actions um and for Voyager uh coming back to a question um so it’s it’s able to kind of learn all these skills like necessary to survive and to basically finding new objects because we gave it
Uh yeah let me kind of um this one um so we we we give it a high level directive that is to maximize the number of objects you can obtain right so with t Voyager your task is to maximize right the the novel objects that you can
Obtain and so what voer does is trying to meet that kind of unsupervised objective because we are not telling it that you need to find diamond you need to find Stone first before you need to find iron or you need to find iron before diamond we did not tell it that
We just say you need to find as many novel objects as possible and we actually have a way to measure it right we we can look at its inventory and then count the number of diverse items is able to obtain through its lifespan so we can actually quantitatively measure
It and let me let me show you a figure here so we actually have like a comparison with some PRI works is this one basically like uh this is like you know react reflecting on some kind of uh baselines and Auto gbt uh and this is
Forager and this is Voyer that the blue one is Voyer without the skill Library um and here for in this figure the x axis is a number of prompting iterations and the y- AIS is the number of distinct objects it’s able to uncover or craft right it it doesn’t matter as long as
You inventory we see a new object we we we count it towards the progress so it’s got this high level object programming to it and um mostly like the the the the skills I would say a human will be able to do and voer is not able to build
Crazy things yet because that would require vision and in the original Voyer we did not have computer vision right it’s not doing the test from pixels it’s converting the world to text and that would be a limitation so if you want to build castle you you you got to see what
You’re building right otherwise it’s really hard to kind of tell you the 3D coordinates and try to reason your head even for human it’s really hard so Voyer doesn’t do building tasks because we we we didn’t ask it to and also it’s not quite capable of because of the
Limitation of of its perception mod to you what is the Strategic value of a corpus like YouTube right for training these type of uh open-ended embod agents like are these agents going to be able to make sense of the different rules of the world you know as they vary in
Simulations versus Real World data physics for example varies drastically so what is your thoughts so um I think the way to build Foundation like one of the components to build Foundation agent is really good video models that can understand not just Minecraft video but maybe maybe videos from many different
Games or even videos of the of the real world of people doing like different tasks right we want to train on as many videos as possible because what videos en code is something that uh we technically call intuitive physics so you know like when when when humans when
All of us right go ahead to do our daily tasks we don’t solve physics equations in our head right like if you kind of uh if you drop a cup on the floor you your brain cannot compute exactly where the water is going to spill or you know how
The cup is going to be broken right you cannot simulate all of that but you roughly know that you’re going to make a mess like the water is going to Spill and the cup if it’s a glass cup it’s mostly going to be broken you have a a
Rough common sense of where things are going and that is the predictive model in our brain and what we call intuitive physics it’s not physics it’s intuitive right we cannot compute every trajectory and I think for the current embodied agents they lack this common sense they don’t they they can’t really predict you
Know what’s going on next uh they don’t have this you know intuitive physics built into their brain and to learn in intuitive physics I believe the best way is to learn on lots and lots of videos and once you have that Common Sense model it’s still not enough right like
You can predict what’s going on next but you still don’t know how to act so just like if you watch you know tennis champions playing tennis you can watch it all day and you know what’s going to happen next you have a predicted model in your brain but can you play tennis as
Well as the best you know players right you still need a lot of practice to actually ground the common sense that you learn from the videos and that’s how I see the simulations coming to play so you need both the videos and a lot of pre-training and also the simulations be
It Minecraft or physics Sim or some other games to really ground the knowledge through trial and and that’s how I see um that’s how I believe we should build the next embodied systems I I hope that answers your question yeah it does is that how you see Omniverse
Fitting into all of this right like Tesla’s like noisy data at scale but we’re going to need sort of synthetic training data or like you know I guess open-ended agents trying stuff in the in simulated environments too yes um how about this let me share screen this is
Uh this is urea uh it is a five finger robot hand that’s able to do pen spinning tricks in Nvidia simulation and how we are able to train this is actually using uh something called uh ISAC Sim that is built on top of Omniverse so in terms of abstraction
Levels you can think of Omniverse as like a base level Graphics engine right it runs on the latest gpus is able to get acceleration Hardware native acceleration and it does rendering physics and all of that it’s in Omniverse and the is Sim is a library built on top of Omniverse for robotic
Specifically um so it’s able to import things like robot hand models inut objects you know compute the contact physics of the fingers with the pen here and most importantly and probably the most uh unique feature of is Sim is its uh scalability so you can run 10,000
Environments in parallel on a single GPU which means you’re basically speeding up Reality by 10,000 XS um in in the real world you’re B act by real physics right you you simply cannot collect data uh with this level of throughput but in simulation you can if you throw compute
At it and with you know parallel Computing you can simulate 10,000 robot hands doing pen spinning at the same time and in this way you scale up the data stream and you can train like very complex policies like pen spinning that would otherwise take maybe a decade of
Real world time if you want to do this directly on a physical robot right it’s very slow um so yeah that that’s how I see MD simulation comes into play for embodied agents and and since we’re talking about urea I will just quickly cover this work um so how how is urea
Trained um basically I see urea as two Loops the outer loop is a language model here gp4 writing code in a physics simulation API and this code will become the reward function and reinforcement learning requires a reward function so that you have something to maximize right something to work towards and that
Is the second Loop inner loop is that given a reward function we do reinforcement learning to train another new network that controls the robot hand and then this dual loop system is is what makes urea quite unique um you can think of this as system one and two
Thinking right from the book Thinking Fast and Slow the L Loop is a system two Loop because it’s doing high level reasoning right it’s looking at how the hand uh is performing and then proposing change in code so it’s like a system to deliberate slow reasoning and the loop
On the right is a system one Loop it is like Fast unconscious like you don’t do re in when you’re P you’re spinning P right it’s more like a feeling of it it’s it’s muscle memory so the the the loop on the right hand side will be the
System one where we have a smaller neuron network but it’s much higher frequency than than the L and it’s able to control the hand to do very dexterous tasks and uh we are able to do not just uh pen spinning but like a few other kind of uh manual manipulation tasks for
The robot um I’m not showing I don’t think I I’m showing this here but basically this method is General Public and it’s not limited to just pen spending okay and I’ll open up for like five minutes or questions thank you very much I I’ll try to make this quick so I
Know you said yeah I think in the paper it says that the reward functions can essentially be updated in somewhat real time is is that correct you don’t have to retrain the entire model so for for the for the reward function like it’s updated every time um the inner loop or
Or the loop on the right hand side um has finished so you can think of like this loop as as a as a full reinforcement learning training session right we train it to converence and then it will have like performance metric which it can report back to gbd4 and
Then gd4 will propose the next reward function okay so so the the future that I’m seeing with this here is that we could we could have a a bot that actually exists in the real world and we could potentially um with similar architecture you know uh train a bot by
Actually showing it you know an example and then and then it practice practices that same example itself so I’m I’m just wondering if you guys seem to be very focused on robotics um so is this um is this the future that you guys are are looking towards yeah um I I think there
Are many ways we can scale urea even more right like one is you know it scales ass simulation skills and here uh we are learning like one skill at a time right this pen pen spinning skill is like one urea run um but you can imagine
That we we can do maybe a thousand different skills in parallel if we throw a lot of gpus at it so that is something that we are uh thinking about doing oh and actually in this video you can see like there are a lot of other tasks that
We tried but each task is like a separate new network right uh we’re not training a single one that that’s multitasking but it is an obvious Next Step that that that that we can do um and the other thing is to actually make it work in the real world and that will
Involve Sim to real right how how do we kind of uh transfer the neural network learning simulation to the real world and there are many techniques to it one is called domain randomization which is basically the simulation hypothesis I just mentioned like if you’re able to master 10,000 different you know
Simulated realities or like different physical configurations in Sim let’s say if you’re able to um work with Earth’s gravity and also moon’s gravity and also Mars gravity in simulation um 10,000 of them then you are very likely able to generalize to the real world which you
Know uh will be very complex and not quite the same as the simulation right the simulation is always going to be um inaccurate portrayal of the real world but that’s how we can overcome the same real Gap I feel like eure has a very underrated research paper for last year
It’s probably my favorite and is it the first of its kind oh sorry if it is it the first of is kind where LM trained robot like is a fully LM trained robot and if so is there like a bridge being built right now where skill learn and
ISAC gy can be applied to real world robots yeah uh so first uh thanks for your kind words um so there are um a few works on kind of combining LS and Robotics there also some works from you know Berkeley from Stanford from some of the universities um but I think urea at
Least to my knowledge should be the first on using L to kind of instruct how to train robot right you can see urea as automating the development of Robotics right because typically the reward functions are written by human engineers who’s like robot developers robot engineers and not like every developer
Can write reward function it’s actually very specific you you got to have like domain expertise on how to use physics simulation you got to be familiar with the whole framework to be able to do it right it’s not even easy for for uh any programmer to do this without training
Um but here we found that gbd4 is so good at zero shot and understanding documentation so we simply feed like nvidia’s you know physics API documentation to gbd4 and then it writes this functions and you know they can it can actually write it better than the
Human developers so we see urea as a first step towards automating the development of Robotics itself right if you think of Robotics it’s basically a bunch of code ultimately it’s just coding right like can an entire robot stack be programmed not by us but by gbd4 or whatever is coming next and then
It can do it iteratively that is a fascinating question so uh so would it be reasonable to describe it as to a f the first AI agent trained robot in a simulation the first kind of like L instructed AI agent yeah the first LM trained kind of agent concept yeah and there is
There for robotics yeah so sorry go ahead is there like a possibility of have you heard of Mamba like the architecture Mamba yeah uh is there a possibility of Transformer being replaced for Mamba in robot robot Sims or robot learning like in Vima sorry that that might be off topic VMA yes um
I think it’s an orthogonal question um because the architecture part I mean it will be useful but it is not the the the the pain point of robot research we have not even exhausted the potential of Transformers yet um I think the hard part of for robotics is data right how
Do you get data for it the data can come from internet videos as we just covered it can also come come from um scaling up simulation and for simulation it’s a little bit special because data is generated by the agent itself right it’s kind of an actively collected data
Versus internet will be passively connected data um so data is the B neck we can use whatever architecture we want and if Mamba replaces transform in the future we Havey to switch but it’s not kind of the paino right now got it thank you so much
Video “Nvidias NEW “AI AGENT” Will Change The WORLD! (Jim Fan)” was uploaded on 02/01/2024 to Youtube Channel TheAIGRID