Sora – Full Analysis (with new details)
The emergence of Sora, the text-to-video model from Open AI, has generated both excitement and concern within the AI community. The release of the technical report for Sora, along with new demos and details, has provided a deeper insight into the capabilities of this groundbreaking model. While the demos showcasing Sora’s abilities are undeniably impressive, it is important to remain cautious in our assessment of its true understanding of the world.
Open AI admits that Sora has its weaknesses, such as struggling with accurately simulating the physics of complex scenes and often mixing up left and right. Despite these limitations, Sora’s ability to generate high-resolution videos up to 1080p is truly remarkable. The model was trained on a massive scale, drawing from billions and trillions of patterns from the world, but it still lacks the ability to reason about these patterns effectively.
One of the key innovations of Sora is its training on video data, inadvertently solving image generation as well. The possibilities are endless, from animating photos of loved ones to creating dynamic movie trailers with fast cuts. The potential business applications of Sora are vast, with the ability to bring static images to life in ways we have never seen before.
In conclusion, Sora represents a significant leap forward in AI technology, with the potential to revolutionize the way we interact with visual media. While there are still challenges to overcome, the future looks bright for Sora and the endless possibilities it offers. Keep an eye on Sora as it continues to push the boundaries of what AI can achieve.
Watch the video by AI Explained
Video Transcript
Sora the text video model from open AI is here and it appears to be exciting people and worrying them in equal measure there is something visceral about actually seeing the rate of progress in AI that hits different than leaderboards or benchmarks and in just the last 18 hours the technical report
For Sora has come out and more demos and details have been released I’m going to try to unpack what Sora is what it means means and what comes next before getting into any details though we just have to admit that some of the demos are frankly
Astonishing this one a tour of an art gallery is jawdropping to me but that doesn’t mean we have to get completely carried away with open ai’s marketing material that the model understands what the user asks for and understands how those things exist in the physical world
I don’t even think the authors of Sora would have signed off on that statement and I know it might seem I’m being pedantic but these kind of edge case failures is what’s held back self-driving for a decade yes Sora has been trained at an immense scale but I
Wouldn’t say that it understands the world it has derived billions and trillions of patterns from the world but can’t yet reason about those patterns hence anomalies like the video you can see and later on in the release notes open AI says this the current model has weaknesses it may struggle with
Accurately simulating the physics of a complex scene it doesn’t quite get cause and effect it also mixes up left and right and objects appear spontaneously and disappear for no reason it’s a bit like gp4 in that it’s breathtaking and intelligent but if you probe a bit too
Closely Things Fall Apart a little bit to be clear I am stunned by Sora just as much as everyone else I just want it to be put in a little bit of context that being said if and when models crack reasoning itself I will try to be among
The first to let you know it’s time for more details and Sora can generate videos up to a full minute long up to 1080p it was trained on and can output different aspect ratios and resolutions and speaking of high resolution this demo was amongst the most shocking it is
Incredible just look at the consistent Reflections in terms of how they made it they say model and implementation details are not included in this report but later on they give hints in terms of the papers they site in the appendices almost all of them funnily enough come from Google we have Vision Transformers
Adaptable aspect ratio and resolution Vision Transformers also from Google Deep Mind and we saw that being implemented with Sora and many other papers from Facebook and Google were cited that even LED one Google deep minder to jokingly say this you’re welcome open AI I’ll share my home
Address in DM if you want to send us flowers and chocolate by the way my 30second summary of how it’s done would be this this just think to yourself about the task of predicting the next word it’s easy to imagine how you test yourself you’d cover the next word make
Prediction and check but how would you do that for images or frames of a video if all you did was cover the entire image it would be pretty impossible to guess say a video frame of a monkey playing chess so how would you bridge that Gap well as you can see below how
About adding some noise like a little bit of cloudiness to the image you can still see most of the image but now you have to infer patches here and there with say a text caption to help you out that’s more manageable right and now it’s just a matter of scale scale up the
Number of images or frames of images from a video that you train on ultimately you could go from a highly descriptive text caption to the full image from scratch especially if the captions are particularly descriptive as they are for Sora now by the way all you
Need to do is find a sugar daddy to invest $13 billion into you and boom you’re there of course I’m being a little bit factious it builds on years of work including bu notable contributors from open aai they pioneered the autoc captioning of images with highly descriptive language using those synthetic captions massively
Optimized the training process when I mentioned scale by the way look at the difference that more compute makes when I say compute think of arrays of gpus in a data somewhere in America when you forx the compute you get this and if you 16 exit you get that more images more
Training more compute better results now I know what you’re thinking just 100x the compute there’s definitely enough data I did a back of the envelope calculation that there are quadrillions of frames just on YouTube definitely easier to access if you’re Google by the way but I will caveat that as we’ve seen
With gp4 scale doesn’t get you all the way to reasoning so you’ll still get weird breaches of the laws of physics until you get other Innovations thrown in but then we get to something big that I don’t think enough people are are talking about by training on video You’re inadvertently solving images an
Image after all is just a single frame of a video the images from Sora go up to 2K by 2K pixels and of course they could be scaled up further with a tool like magnific I tried that for this image and honestly there was nothing I could see
That would tell me that this isn’t just a photo I’d almost ask the question of whether this means that there won’t be a darly four because Sora supersedes it take anim in an image and this example is just incredible of this shba Inu dog wearing a beret and black turtleneck
That’s the image on the left and it being animated on the right you can imagine the business use cases of this where people bring to life photos of themselves friends and family or maybe even deceased loved ones or how about every page in what would be an otherwise
Static children’s book being animated on demand you just click and then the characters get animated honestly the more I think about it the more I think Sora is going to make open AI billions and billions of dollars the number of other companies and apps that it just subsumes within it is innumerable I’ll
Come back to that point but meanwhile here is a handful of other incredible demos this is a movie trailer and notice how Sora is picking quite Fast Cuts obviously all automatically it gets that a cinematic trailer is going to be pretty Dynamic and fast-paced likewise this is a single video generated by Sora
Not a compilation and if you ignore some text spelling issues it is astonishing and here is another one that I’m going to have to spend some time on the implications of this feature alone are astonishing all three videos that you can see are going to end with the
Exact same frame even that final frame of the cable car crashing into that sign was generated by Sora including the minor misspelling at the top but just think of the implications you could have a photo with your friends and imagine a hundred different ways that you could
Have got there to that final photo or maybe you have your own website and every user gets a unique Voyage to your landing page and of course when we scale this up we could put the ending of a movie in and Sora 2 or Sora 3 would calculate all the different types of
Movies that could have led to that point you could have daily variations to your favorite movie ending as as a side note this also allows you to create these funky Loops where the starting and finishing frame are identical I could just let this play for a few minutes
Until people got really confused but I won’t do that to you and here is yet another feature that I was truly bold away with the video you can see on screen was not generated by Sora and now I’m going to switch to another video which was also not generated by Sora but
What Sora can do is interpolate between those videos to to come up with a unique creation this time I’m not even going to list the potential applications because again they are innumerable what I will do though is give you one more example that I thought of when I saw this
Another demo that open AI used was mixing together this chameleon and this funky looking bird I’m not sure it name to create this wild mixture now we all know that open AI are not going to allow you to do this with human images but an open-source version of Sora will be
Following close behind so imagine putting a video of of you and your partner and creating this hybrid Freaky video or maybe you and your pet now the best results you’re going to get from Sora are inevitably when there’s not as much movement going on the less movement the fewer problems with things like
Object permanence mind you even when there is quite a lot going on the results can still be pretty incredible look at how Sora handles object permanence here with the dog fully covered and then emerging looking exactly the same likewise this video of a man eating a burger because he’s
Moving in slow motion it’s much more High Fidelity aside from the Boker effect it could almost be real and then we get this gorgeous video where you almost have to convince me it’s from Sora look at how the paint marks stay on the page and then we get simulated
Gaming where again if you ignore some of the physics and the rule breaking the visuals alone are just incredible obviously they train Sora on thousands of hours of Minecraft videos I mean look how accurate some of the boxes are I bet some of you watching this think I simply
Replaced a SORA video with an actual Minecraft video but no I didn’t that has been quite a few hype demos so time for some anti-hype ones here is Sora clearly not understanding the world around it just like chat’s understanding can sometimes be paper thin so can sorus it
Doesn’t get the physics of the cup the ice or the spill I can’t forget to mention though that you can also change the style of a video here is the input video presumably from a game now with one prompt you can change the background to being a
Jungle or maybe you prefer to play the game in the 1920s I mean you can see how the wheels aren’t moving properly but the overall effect is incredible well actually this time I want to play the game underwater how about that job done or maybe I’m high
And I want the game to look like a rainbow or maybe I prefer the oldfashioned days of pixel art I’ve noticed a lot of people by the way speculating where open AI got all the data to train Sora I think many people have forgotten that they did a deal back
In July with shutter stock in case you don’t know shutter stock has 32 million stock videos and most of them are high resolution they probably also used millions of hours of video game frames would be my guess one more thing you might be wondering don’t these worlds
Just disappear the moment you move on to the next prompt well with video 2 3D that might not always be the case this is from Luma Ai and imagine a world generated at first by Sora then turned into a universally sharable 3D landscape that you can interact with effectively
You and your friends could inhabit a world generated by Sora and yes ultimately with scale you could generate your own High Fidelity video game and given that you can indefinitely extend Clips I am sure many people will be creating their own short movies perhaps voiced by AI here’s an 11 Labs voice
Giving you a snippet of the caption to this video an adorable happy otter confidently stands on a surfboard wearing a yellow life jacket riding along turquoise tropical waters near lush tropical Islands well how about hooking Sora up to the Apple Vision Pro or meta Quest especially for those who
Can’t travel that could be an incredible way of exploring the world of course being real here the most common use case might be children using it to make cartoons and play games but still that counts as a valid use case to me but underneath all of these use cases are
Some serious points in a since deleted tweet one openai employee said this we are very intentionally not sharing it widely yet the hope is that a mini public demo kicks a social response into gear I’m not really sure what social response people are supposed to give though however it’s not responsible to
Let people just Panic which is why I’ve given the caveats I have throughout this video I believe as with language and self-driving that the edge cases will still take a number of years to solve that’s at least my best guess but it seems to me when reasoning is solved and
Therefore even long videos actually make sense a lot more jobs other than just videographers might be under threat as the creator of GitHub co-pilot put it if openai is going to continue to eat AI startups sector by sector they should go public building the new economy where
Only 500 people benefit is a dodgy future and the founder of stability AI tweeted out this image it does seem to be the best of times and the worst of times to be an AI startup you never know when open aai or Google are going to drop a model that massively changes and
Affects your business it’s not just Sora whacking pea Labs Runway ML and maybe mid Journey if you make the chips that open AI uses they want to make them instead I’m going to be doing a separate video about all of that when you use the
Chat PT app on a phone they want to make the phone you’re using you come up with character Ai and open AI comes out with the GPT store I bet open AI are even cooking up an open world game with GPT powered NPCs don’t forget that they are acquired Global illumination the makers
Of this Minecraft clone if you make agents we learned last week that open aai want to create an agent that operates your entire device again I’ve got more on that coming soon or what about if you’re making a search engine powered by a GPT model that’s the case
Of course with perplexity and I will be interviewing the CEO and founder of perplexity for AI insiders next week insiders can submit questions and of course do feel free to join on patreon but fitting with the trend we learned less than 48 hours ago that open AI is
Developing a web search product I’m not necessarily critiquing any of this but you’re starting to see the theme open AI will have no quals about eating your lunch and of course there’s one more implication that’s a bit more longterm two lead authors from Sora both retweeted this video from Berkeley
You’re seeing a humanoid transformer robot trained with large scale reinforcement learning in Sim imulation and deployed to the real world zero shot in other words it learned to move like this by watching and acting in simulations if you want to learn more about learning from simulations do check
Out my Eureka video and my interview with Jim fan tldr better simulations mean better robotics two final demos to end this video with first a monkey playing chess in a park this demo kind of sums up Sora it looks gorgeous I was astounded like everyone else however if
You look a bit closer the piece position and board don’t make any sense Sora doesn’t understand the world but it is drawing upon billions and billions of patterns and then there’s this obligatory comparison the Will Smith spaghetti video and I wonder what source they originally got some of the images
From you could say this was around state-of-the-art just 11 months ago and now here’s Sora not perfect look at the paws but honestly remarkable indeed I would call Sora a milestone human achievement but now I want to thank you for watching this video all the way to
The end and know despite what many people think it isn’t generated by an AI have a wonderful day
Video “Sora – Full Analysis (with new details)” was uploaded on 02/16/2024 to Youtube Channel AI Explained