OpenAI’s “Sora” Simulates REALITY – AGI, Emergent Capabilities, and Simulation Theory
OpenAI has recently unveiled its newest text-to-video product, Sora, and the results have left the entire industry stunned. Sora is an AI model that has the ability to create one-minute long videos with consistent objects throughout the entirety of the video. This means that the objects within the video do not transform or change and new objects do not appear out of nowhere. This level of realism and consistency is unprecedented in the world of AI video creation.
Sora’s impressive capabilities have significant implications for the future of media and content creation. Its ability to generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background is truly revolutionary. Sora’s efficiency in creating these videos inexpensively compared to traditional methods such as Unreal Engine, without the need to understand every single detail of every pixel in the video, further enhances its appeal.
The potential applications of Sora are vast, ranging from dynamic video game environments to creating personalized videos based on descriptive prompts. OpenAI’s technical report on Sora’s performance highlights the exponential improvement in video quality with increased computing power. Sora’s use of diffusion transformers and the ability to sample different video sizes demonstrates its versatility and scalability.
Ultimately, Sora’s emergence as a groundbreaking video creation tool signifies a major advancement in AI technology and has far-reaching implications for the future of media and content creation.
Watch the video by Matthew Berman
Video Transcript
Open ai’s new text to video product Sora actually shocked the entire industry nobody actually saw it coming and nothing will actually be the same I can genuinely use all of the click bity titles that you all really dislike Sora is incredible I’m going to show you a
Bunch of examples I’m going to explain how this has the potential to really change a lot of how media works today and as a content creator I’m both nervous and excited about it and I’m going to show you some of the best examples and funniest mistakes that I
Found with it so here’s the blog post announcement from open AI creating video from text Sora is an AI model that can create realistic and imaginative scenes from text instructions and from all of the videos that I’ve seen about Sora the thing that really sticks out to me is
Its ability to create one minute long videos with consistent objects throughout the entirety of the video I’ve never seen anything like this before typically when you’re looking at AI video the objects within the video transform over the duration of the video or new objects get created out of
Nowhere but with Sora I’ve seen some of the most incredible examples that I’ve ever seen so here’s an example this is a woman walking down the street in Tokyo her hands look real her walking looks real the scene behind her looks real all of the signage in the background stays
Consistent throughout the entire video when they finally zoom into to her face you can see every little texture on her face how the hair and the dress moves looks correct it is extremely extremely impressive here it says Sora is able to generate complex scenes with multiple characters specific types of motion and
Accurate details of the subject and background the model understands not only what the user has asked for in The Prompt but also how those things exist in the physical world a Chinese Lunar New Year celebration video with Chinese dragon and every single person in it
Looks good and as you can tell as the dragon goes over people in the background and then the people in the background are visible again they don’t change change and that is the most amazing part to me here’s another example that really blew my mind more so
Than any other example that I’ve seen so this is a scene of somebody from within a train holding a smartphone taking a video outside of the train and you can see the entire world with all the buildings passing by thousands of buildings all consistent throughout the
Video and then when they go through a tunnel you can all of a sudden see the reflection of the person taking the video through the window and it looks Flawless and I am so impressed by this just I can watch it over and over again because it seems to actually know that
There’s a person inside the train taking a video outside of the train shows it briefly and then goes back to showing what’s outside the train again here’s an example of a bunch of puppies playing around in the snow and it is hyper detailed the puppies look real you can
See all of the individual pieces of fur on the puppies the snow looks great everything I’m going to show you some more great examples and I’m also going to show you some funny looking and mesmerizing mistakes that Sora has made so stick around for that Sora doesn’t actually have an understanding of the
Objects in the image and that could be a benefit or a drawback the benefit is that it’s able to generate these incredible videos really inexpensively compared to something like Unreal Engine which is extremely compute heavy because it has to understand every single detail every single Pixel in the video and it’s
Computing where each pixel is going to go and when you factor in things like textures and light and occlusion that becomes very very expensive however with this method it’s generating the endtoend video and everything in the video all in one go and it doesn’t actually have an understanding of each individual object
Within the video so let’s talk about video games for a second take a look at this video this is Minecraft but it’s not Minecraft it was created by Sora everything looks Flawless so it is creating this environment dynamically for you and not only that that but if
You think about it we can use large language models to generate the actual actions occurring and then use Sora to generate the video so with both of those things this could be the future of video games completely Dynamic built for one with the story and the graphics both
Tailored to exactly what you want it to be look at this this looks so real it is something that looks like a Range Rover driving through the mountains there’s a bunch of trees around it’s on a dirt road the dirt looks so real this looks better than any video game I’ve ever
Played maybe the best performing video game might look something similar to this but this is still better and the thing is the compute cost to run this video is exponentially less than it would be if we were using Unreal Engine and open AI put out a technical report
About how they were able to achieve these incredible results with Sora and one that I want to show is the difference in compute the more that they scaled up compute the better the results were so so there seems to be a very linear relationship between compute and
Quality here’s the base compute on the left side and as you can see it does not look good everything’s blurry nothing really makes sense the puppy has multiple faces or no face at all here’s one with four times compute and now it’s starting to look better we can actually
Tell the individual objects in the video there’s a person in the background but still things look blurry especially in the background now here’s 32 times compute and this looks like 4K this looks like perfect quality everything in the image looks good the puppy the Hat the fur the grass the snow the trees
Blurred in the background everything looks great and Sora is using diffusion Transformers that is the tech it’s using Sora is a diffusion model given input noisy patches that’s what we’re seeing on the left here and conditioning information like text prompts it’s trained to predict the original clean
Patches and that’s what we’re seeing on the right side importantly Sora is a diffusion Transformer Transformers have demonstrated remarkable scaling properties across a variety of domains including language modeling computer vision and image generation and Sora can sample different sizes too so here’s a vertical video here’s a square video and
Here’s a wide video it can do 1920 x 1080P and also vertical 1080 X1 1920 and very similar to Dolly it is translating your prompt into a very descriptive prompt using chat GPT so it’s not just taking your prompt and putting that directly into Sora again same way Dolly
Works it’s taking your prompt and trying to understand the intention of your prompt and translating that into a much better prompt let’s take a look at a few more incredible examples here’s one of a drone flying through the Coliseum and it’s like a camera is behind the Drone
And everything within the Coliseum is extremely consistent that is what sets Sora apart from anything I’ve seen before and if I were Runway ml I’d be a little nervous right now now because this is better than anything I’ve seen them produce and it’s really the consistency that blows me away and the
Way that they were able to do that is by calculating the entire duration of the one minute video all in one go rather than trying to predict the next frame from the previous frame and that allows all of the objects within the video to be really consistent and now here’s a
Very similar video except the Drone turns into a butterfly and the Coliseum turns into an underwater world really really cool but again everything generated in this video is new compared to the other video so it’s not just changing the previous video it’s actually generating something new and
Here’s one of my favorites this is an old western California Gold Rush town that looks like it was filmed in the 1800s and the potential for this to change movies and television and all media is enormous I’ve been talking about having media catered to an
Audience of one for a while now and this really shows What’s going to be possible you can just just describe the TV show that you want to see and you will get it exactly how you want it so this idea of having a central production company creating one product and serving it to
Lots of people that might be a thing of the past soon thanks to the sponsor of this video Nero Studio Nero Studio empowers users to transform text into compelling video in over 120 languages thanks to Aid driven capabilities I’ve tested the platform myself and Nero studio is packed with unique capabili
Ities starting with its very simple interface simply input your text and select a voice from over 140 languages and moods and then Nero Studio does the rest create videos with authentic and relatable voices and emotions to engage audiences worldwide and they even have this awesome whisper feature in which
You can have avatars speak gently and smoothly use Nero Studios pre-made avatars voice conversation lip sync and so many more features it goes on and on so you can create user generat videos to promote your brand and create both vertical and horizontal videos depending on where you want to put that content so
Product demos business just for fun Nero does it all from beginners to season creators so check out Nero Studio I’ll drop a link in the description below along with the promo code Burman which will give you 50% off any of Nero Studio’s paid plans so check it out
Thanks again to Nero studio now back to the video and Sora is also capable of generating images so I don’t know if they’re thinking about this as a replacement for Dolly but Sora is Leaps and Bounds better at videos than dolly is at images but take a look at these
Sora images all four of these are created by Sora not Dolly and they look beautiful but here’s the super fascinating SL scary part this could be the technology that brings us full circle with simulation Theory let’s take a look at what it says here we find that video models exhibit a number of
Interesting emergent capabilities when trained at scale these capabilities enable Sora to simulate some aspects of of people animals and environments from the physical world these properties emerge without any explicit induction biases for 3D objects Etc they are purely phenomena of scale so one 3D consistency Sora can generate videos
With Dynamic camera motion as the camera shifts and rotates people and scene elements move consistently through three-dimensional space and again it doesn’t actually have an understanding of the individual objects in the 3D space but it’s able to generate consistent 3D objects in the space and long range coherence and object
Permanence we’ve already talked about this look at this example right here we have people walking in front of a window which has a puppy leaning out of it and as the people pass over it completely blocking the puppy the movement of the puppy in the background is consistent
Even though you can’t see it while it’s moving it also has a good understanding of interacting with the world Sora can sometimes simulate actions that affect the state of the world in simple ways for example example a painter can leave new Strokes along a canvas that persist
Over time or a man can eat a burger and leave bite marks look at how cool that is and as I already mentioned simulating digital worlds this is probably the biggest potential in my mind video games can be created dynamically in real time dependent on exactly what you do in the
World but it also makes mistakes and it’s not perfect check out this video right here this is one where it shows a glass being picked up and the liquid falls through the glass incorrectly one of the limitations is it does not accurately model the physics of many basic interactions like glass shattering
Other interactions like eating food do not always yield correct changes in object State we enumerate other common failure modes of the model such as incoherencies that develop in long duration samples or spontaneous appearances of objects now this was a huge problem for previous textto video
Models this seems to be much less of a problem but still a problem for Sora let’s look at some of these examples now here’s is my favorite mistake video this is a team of people pulling what looks to be plastic chairs out of the sand the
Chair itself is moving by itself and at the beginning of the video there is no second chair they pull out what looks to be sand that transforms into a chair this other guy pulls a second piece off of the chair and seems to be just holding plastic so everything is wrong
About this video but it is still mesmerizing to watch it is really really cool but it is still capable of making mistakes here’s another example of a very scary looking Grandma blowing out candles except when she goes to blow them out nothing actually gets blown out
The candles are still there but look at all the people in the background they all look great they’re all very consistent so it still looks really good this could be a scary movie here’s another example with some wolf puppies that seem to be getting generated out of nowhere their bodies start colliding so
Definitely a lot of mistakes in this video and here’s another one of people relaxing at the beach when all of a sudden a shark jumps out at them the shark’s fin is not there and then the sharks fin swims up to it and then the lady’s head performs a 180 and
Completely switches around looking at the camera and that is terrifying also in the corner the guy’s foot or hand or something is kind of just hanging and waving there really really odd and kind of gross but funny to look at so I’m going to play a few more awesome
Examples of Sora video right now if you like to this video please consider giving a like And subscribe and I’ll see you in the next one n
Video “OpenAI’s “Sora” Simulates REALITY – AGI, Emergent Capabilities, and Simulation Theory” was uploaded on 02/18/2024 to Youtube Channel Matthew Berman