OpenAI’s Sora is revolutionizing the world of AI-generated videos with its incredible capabilities. From creating stunning videos from text prompts to simulating fluid dynamics and physics, Sora is pushing the boundaries of what AI can achieve. This closer look at Sora reveals its ability to create infinitely looping videos, generate high-quality images, and simulate complex scenes with long-term coherence.
Sora’s innovative approach to video generation involves learning the underlying rules of the world it is simulating, resulting in realistic and captivating visuals. By utilizing a diffusion-based transformer model, Sora can refine multiple image sequences simultaneously to maintain temporal coherence and eliminate flickering effects.
As we continue to increase computational power, Sora will only become more advanced and sophisticated. The power of human ingenuity and research is on full display with Sora, showcasing the endless possibilities of AI technology. Subscribe to keep up with the latest advancements in AI and witness the future of video generation unfold before your eyes.
Watch the video by Two Minute Papers
Video Transcript
This is a closer look at Sora, OpenAI’s amazing text to video AI. We already know that it can create amazing videos from your text prompts with unprecedented quality. It is a huge leap in capabilities.
But it can do so much more. We know that it can take a still image, and extend it forward into a video. But get this, it can also do the same, backward. And this one comes with a twist. We prescribe how the video should end, and it writes several possible ways of
Getting there. And they all feel completely natural to me. That is awesome. I love it. Dear Fellow Scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér. And they really know how to make me happy because I am a light transport simulation researcher by
Trade. That is ray tracing if you will, so looking at the glossy reflections on the road, how they change over time makes me really, really happy. Just look at that! Not just the quality, but also the long-term coherence as well. I am used to looking at papers that can create videos that
Are at most 5 seconds long, and this particular one is a 20-second snippet. That is amazing, but I read that it can go up to 60 seconds as well. And look at those beautiful refractions, and even little specks of dust and grease marks on the glass are synthesized. Wow. Now, clearly
There are lots of mistakes here, we can all see that, but this is also a remarkable sign of understanding of the world. And it can generate a full scene in one go. We don’t need to splice many results together, it can really give us a full scene without any cuts. Absolutely incredible.
But it gets better. This result puts even the 60 second to shame. Let’s look at this together. And if we do it for a while, wait, so when this does end exactly? Well, did you notice? Yes, you are seeing correctly. It never ends. It can also create infinitely looping videos as well.
And it can also perform limited physics simulations. We are going to get back to that in a moment because that is not only beautiful, I mean, just look at that, but this has a profoundly important story behind it.
Now, it can also create still images. Yes, I hear you asking, Károly, why is that interesting? We just saw videos, those are way more difficult. Well, these are 2048×2048 images. Those are huge with tons of detail. And remember, DALL-E 3 appeared approximately 6 months ago. That
Technique is a specialist in creating high-quality images, and this one even beats DALL-E 3, the king in its own game. And it does all this as an almost completely unintentional side effect. Wow. And large language models think in tokens, batches of letters. Sora is for video,
And it does not think in tokens, at least not in the way the language models do. It extends the definition of tokens to visual content. They call them patches. Then, when the neural network sees a raw video,
It compresses it down into a latent space where it is easier to generate similar new ones. Now wait, what is that? A latent space looks something like this. This is my paper where you can walk around in this 2D latent space, and each point in this space represents a material
For a virtual world. The key here is that nearby points generate similar materials, making it excellent for exploring and generating new ones. The link to the paper is available in the video description. Now imagine the same concept but for video.
Now, our Scholarly question is as follows: so what did this neural network learn exactly? To be able to understand that, we have to look back to GPT-2. GPT-2 was an early technique that read lots and lots of product reviews on Amazon, and then,
It was given a new review, split in half. And then it was told, little AI, continue it. Then, the AI recognizes that this is not gibberish text, it has structure. It is in English, so to
Be able to continue this, it has to learn English. But, it got better, it also needed to know whether this is a positive or a negative review. Did the user like the product? If it seems like it from
The first half, you need to know that to continue it. So it learned sentiment detection as well. And the incredible thing is that no one told it what it has to do. It learned this all by itself.
So now, if we give it an image, it has to extend it forward as a video. That is simulation. To be able to simulate the world, it has to understand the world. And for that, it has to learn the underlying rules. The world contains plenty of liquids,
And smoke, so it had to learn fluid dynamics as well. And once again, it learns physics as a completely unintentional side effect. It learns it because it has to. The architecture they use for this is called a diffusion-based transformer model. When you do
Text to image, the model starts out from a bunch of noise and over time, reorganizes this noise to make it resemble your text prompt. However here, we are talking about video. Not just one image. So, easy, just do it many times after each other, right? No, not at all. You see,
If you do that, you get something like this. You get this flickering effect because the neural network does not remember well enough what images it made previously, and the differences show up as flickering. You can’t do that. What you need to do is create
Not just one bunch of noise, but many-many bunches, and refine them at the same time, while taking into consideration not just their neighbors, but every single image. This is how you get long-term temporal coherence. And OpenAI Sora just nailed it. Bravo.
And don’t forget, as we add more and more compute, it gets better and better. It only comes alive in the presence of huge computational power. So I hope this gives a closer look at what it can do,
And how. Not so long ago, this was the best video we could do, and now we have this. Can you believe this? This is the power of human ingenuity, and the power of research. And just imagine what we
Will be capable of just two more papers down the line. My goodness. Subscribe and hit the bell icon if you don’t want to miss out on it when it comes. You can count on me being here and flipping out.
Video “OpenAI Sora: A Closer Look!” was uploaded on 02/24/2024 to Youtube Channel Two Minute Papers