Googles New Text To Video BEATS EVERYTHING (LUMIERE)
Google’s new text to video technology, Lumiere, has set a new standard in video generation. By utilizing a unique SpaceTime unit architecture, Lumiere is able to generate full-frame rate videos with coherent and realistic motion. Unlike traditional models, Lumiere creates the entire temporal duration of the video in one go, leading to impressive results.
Lumiere also leverages pre-trained texture image diffusion models, adapting them for video generation to handle the complexities of video data. One of the biggest challenges in video generation is maintaining global temporal consistency, which Lumiere successfully addresses, ensuring the generated videos exhibit realistic motion throughout.
The demos provided by Lumiere showcase the incredible capabilities of this new technology, from a rotating Lamborghini to pouring beer into a glass with incredible detail and realism. With the potential to revolutionize the field of text to video generation, Lumiere’s advancements have set a new gold standard for video models, surpassing previous benchmarks in terms of quality and performance.
Watch the video by TheAIGRID
Video Transcript
So Google research recently released a stunning paper and which they show off a very very state-of-the-art text to video generator and by far this is likely going to be the very best text to video generator you’ve seen and I want you guys to take a look at the video demo
That they’ve shown us because it’s fascinating and then after that I’ll dive into why this is state-of-the-art and just how good this N N You So now one of the most shocking things from Lum and I’m not sure if that’s exactly how you pronounce it but one of the very very most shocking things that we did see was of course the consistency in the videos and how certain things are rendered now there are a bunch more
Examples that they didn’t actually Showcase in this small video so I will be showcasing you those from the actual web page but it is far better than anything we’ve seen before and some studies that they did actually do confirm this so for example in their user study what they found was that our
Method was preferred by users in both text to video and in image to video generation so one of the benchmarks that they did I’m not sure what the quality score was but you can see that theirs which is of course the Lum actually performed pered a lot better than imin a
Lot better than PE collabs a lot better than zeroscope and Gen 2 which is Runway so Gen 2 if you don’t know is being compared against runway’s video model and Runway actually recently did launch a bunch of stuff but if we look at text alignment as well we can see that across
All different video models this is the winner and then of course on image to video or video quality you can see that against Pika it wins a lot of the time against State diffusion video I’m pretty sure that’s what that is then we can see
For image to video you can see it wins against PE collabs and wins against Gen 2 I’m not sure if this is stable diffusion video to but if you haven’t seen that it’s actually something that is really good too so overall we do know that right now this is actually the gold
Standard in a text to video which is a very good Benchmark because many people have been discussing how 2024 is likely to be the year for text to video now what I do want to talk about before I dive into some of the more crazy examples of their stuff was of course
The new architecture because what exactly is making this so good because as you know it looks fascinating in terms of everything that we can do and when I show you some of the more examples you’re going to see exactly why this is even better than you thought so
Essentially the first thing that they do is they utilize the SpaceTime unit architecture so unlike traditional video generation models that create key frames and then fill in the gaps Lum generates the entire temporal duration of the video in one go and this is achieved through a unique SpaceTime unit architecture which efficiently handles
Both spatial and temporal aspects of the video data now what they also do is they have temporal down sampling and upsampling and Lum incorporates both spatial and temporal downsampling and upsampling in its architecture now this approach allows the model to process and generate full frame rate videos much more effectively leading to more
Coherent re and realistic Motion in the generated content now of course what they also did was they leverag pre-trained texture image diffusion models and the research is built upon existing text to image diffusion models adapting them for video generation and this allows the model to benefit from the strong generative capabilities of
These pre-trained models while extending them to handle complexities of video data now one of the significant challenges in video generation is of course maintaining Global temporal consistency and lum’s architecture and training approach are specifically designed to address this ensuring that the generated videos exhibit coherent and realistic motion throughout their
Duration now this is lum’s GitHub page and this is by far one of the very best things I’ve ever seen cuz I want to show you guys some of these examples to just show you how advanced this really is so one of the clips I want you to pay
Attention to is of course and I’m going to zoom in here is of course this Lamborghini because this actually shows us how crazy this technology is so we can see that the Lamborghini is driving driving driving and then as it rotates we can actually see that the Lamborghini
Wheel is not only moving but also we can see the other am angles of that Lamborghini too so I would say that you know if we can compare it to some of the other video models one of the things that we do struggle with is of course
The motion and of course the rotation but seemingly they’ve managed to solve this by using this new architecture and we can see that things like the Lamborghini and rotations which is a real struggle for video isn’t going to be a problem now another one of my favorite examples was of course beer
Being poured into glass so if we take a look at this this is absolutely incredible because we can see that the glass is just being filled up and it looks so good and realistic I mean we have the foam we have the you know beer actually just moving up we also do have
The bubbles and we have things just looking really realistic like if someone was to say this is just a low FPS video of me pouring liquid into a glass I would honestly believe them and even if you don’t think that it is realistic I think we can all agree that this is very
Very good for text to video um and if you just hover over it you can see the input now some of these as well there are just you know really really good showcases of how good it is at rotations cuz I’ve seen some of the other video
Models and this is something that we’ve only recently like literally yesterday I saw a preview um and only recently we’ve managed to um you know solve that a little bit so I mean if we take a look at the bottom left we can see that the
Sushi is rotating um and it looks to me like this it doesn’t look as AI generated as many other videos the only one issue that you know AI generated videos do suffer from is of course low resolution and low frames per second but I mean I I think that that is going to
Be solved very very soon and with what we have here as well like I mean if we look at The Confident teddy bear surfer rides waves in the tropics I mean if we look at how the water ripples every single time the surfboard actually makes
Impact with the water I think we can say that it does look very very realistic and then of course we have the chocolate muffin video clip now this one right here as well looks super super temporarily consistent I mean just the way that it rotates just looks like
Nothing with ever seen before um and of course this wolf one silhouette against a wolf a silhouette of a wolf against a twilight sky also looks very very accurate and very very good so I mean these demos of the texture video I would say are just absolutely outstanding this
One right here the fireworks that we’re looking at is definitely something that I’ve seen done by other models before but it does go to show how good it is and this one right here camera mthing through dry grass at an Autumn morning also does so just how good it is now
With regards to you know walking and legs and stuff like that there is still a bit of a small issue there and there are some other things that I do want to discuss about this entire project because this entire project is I’m pretty sure a collaboration of some
Other AI projects that Google has done before and I can’t wait to um see if Google manages to finally release this so um one of the other models so some of the other ones that are my favorites of of course the chocolate syrup pouring on vanilla ice cream that looks really well
And then this clip of the skywalking doesn’t look too bad and I think that when we take a look at certain videos that are you know very subtle in nature so for example blooming cherry tree in the garden that looks pretty subtle and then of course the Aurora Borealis that
One looks pretty subtle too so a lot of these videos I think personally do just are just absolutely the best and of course we do need to take a look at stylized generation cuz this is something that is really really important for generating certain styles of videos but Google’s Lum does it
Really really well so another thing that I did also see was because I stay up to date with pretty much all of Google’s AI research is that I do note that this stylized generation right here is definitely taking the research from another Google paper that was called
Style drop and I’ll show you guys that in a moment but I think it just goes to show that when Google combines all of their stuff and it does go to show that they’re probably building some very comprehensive video system in the future that you know whenever they do tend to
Release it it’s going to be absolutely incredible because if we look at you know this is just one reference image and then we can can see that all of these kind of videos that we do get this is going to be very very useful for people who are trying to create certain
Styles um for certain things and of course we can see that this is like um some kind of 3D animation kind of style and then of course the videos from that actually look very very good too so this is what I’m talking about when I say
Star drop so I’m going to show you guys that page now so the Google previously actually did release this research paper and this was actually sometime last year but you can see that this was essentially based off similar stuff now I’m not sure how much they’ve changed
The architecture but you can see that it’s a text to Image Maker and essentially what it does when it generates the images is it uses the reference image as a style and you can see just how good that stuff does look I mean if we take a look at the this
Vincent Van go style and then of course and then of course we do take a look at the other images I mean I mean they just look absolutely incredible and of course we do have the same exact one here in the style drop paper as videos and I
Think this is really really important because you know if Google manages well it looks like they’ve managed to combine everything from their previous research like magit and video poet all into one unique thing I think this is going to be super super effective because you know people are wondering and one of the
Questions has been why no code why no model you know no model just to show once again okay are you going to release this though and press it but no open source weights I think that the reason Google has chosen to not release this model and to not release um the weights
Of this model or the code is because I’m pretty sure that they are going to be building on this to release it into perhaps gemini or a later version of another Google system now I could be completely wrong Google have been known in the past to just build things and
Just sit on them but I think with the nature of how competitive things are and the fact that this is state-of-the-art and the fact that there aren’t any other models that can do this in terms of models that seem to be competing in this area this is an area that Google could
Easily Dominate and since Google did lose before to chat GPT in terms of of the AI race I’m sure that Google would try and stay ahead now seemingly like since they’ve got the lead so I don’t know they may do that they may not Google has you know previously just sat
On things before but I do think that maybe they might just polish the model and then release it I think it would be really cool if they did that and I really do hope they do do that because it would make other things even more competitive the key things here as well
Was the video stylization and I don’t think you understand just how good this is like the ma of flowers one right here here is just absolutely incredible I mean look at that that just looks I mean that looks like CGI honestly like if I saw that I would be like wow that’s some
Really cool CGI other styles aren’t as aesthetic or aren’t as good but for some reason the Lego one as well for example if we do take a look at you know this Lego car that one doesn’t look AI generated in the sense that like it was
Just from AI it actually looks like a Lego car and then of course the one for flowers I’m not sure why I think it’s because the way how AI generates these images they’re kind of like fine and I think with flowers um they just look very fine and detailed and intricate so
That’s why it doesn’t look that bad but that one does look really cool so yeah I think I think what we’ve seen here on in terms of the video stylization shows us just how good of a model this is now with this now with the cinemagraphs I do
Think that this is also another fascinating piece of the paper because this is where the model is able to animate the content of an image within a specific user provided region and I do think that this is really effective but what was fascinating was that a couple
Of days ago Runway actually did release their ability to do this so if you haven’t seen it before I’m going to show it to you guys now but essentially Runway has this brush where you can select specific parts of an image and then essentially you can adjust the
Movement of these brushes and then once you do that you can essentially animate a specific character now I know this isn’t a Runway video but it’s just going to show that this is a new feature that is being rolled out to video models across different companies so I think
That in the future what we’re also going to have is since video models you know sometimes aren’t always the best at animating certain things I think we’re going to have a lot more customization and that’s what we’re seeing here with Lumar rare because and that’s what we’re
Seeing here with Lum because of course the fire looks really really good the you know butterfly here also looks really cool the water here looks like it’s moving realistically and this smoke train also does look very very effective there weren’t that many demos of this
But it was enough to show us that it was really good now video in painting was something that we did look at I think it was either video poet or magit that showed us this but at the time it honestly wasn’t as good as it was I mean
It was decent but this is different like just completely different level like I mean imagine having just half of a video um and then being able to just say you know fill in the rest so basically if you don’t know what this is this is basically just generator fill for video
And I think that you know having this is just pretty pretty crazy because being able to just you know say Okay fill it in or just you know with the text prompt I mean just look at the way that the chocolate falls on this one um it’s definitely really
Really really effective um at doing that so I think this one is definitely going to have some wild scale uses and of course this one is probably going to have the most because you can change different things so you can literally just you know say wearing a red scarf
Wearing a purple tie sitting on a stool wearing boots you know wearing a bathrobe I think a lot of this stuff is most certainly fascinating and another thing that we also didn’t take a look at was of course the image to video and with image to video I think this is
Really good as well because some of the models don’t always generate the best images and if you want to be able to generate certain images yourself you’re going to want to be able to animate those specifically so I think that this as well the image to video section of
The model is rather rather effective and I always find it very funny and hilarious that for some reason all of these video models decide to use a teddy bear running in New York as some kind of Benchmark but definitely this one does look better over previous iterations um
And I do think that for some reason the text to video model is better than the image to video model just simply based on how things are done but you know for example things like uh you know ocean waves the way that the giraffe is eating grass I know that they definitely did
Train this on a huge amount of data because if you’ve ever seen giraffes eating grass they do eat it exactly like that it’s not a weird AI generated mouth also if you do look at waves waves look exactly like that fire moves exactly like that too so there is a like a real
Big level of understanding like a huge level of understanding um for what’s being done here and I mean even if we look at a happy elephant like this one right here you know a happy elephant wearing a birth birthday hat Under the Sea and then when you hover over it you
Can see the original image so this is what the original image looks like and then this is what the text video thing is and we can see that like it’s kicking up the water as it’s moving underwater which is I don’t know it’s kind of weird
But um it also does look uh pretty cool if you ask me and then this is that notable image of soldiers raising the United States flag on a windy day then we can see that it is moving so I think overall and of course we got this very
Famous painting and of course even more waves but I think in certain scenarios for example with liquids it seems to work pretty well with water it seems to work pretty well and I think fireworks and for some reason um rotating objects do now work really well but I think the
Main question that is going to come away from this is is Google going to release this are they going to build it into a bigger project or are they waiting to be for something to be more published I mean currently it is state-of-the-art so I guess we’re going to have to wait from
Google themselves but I do note that one thing that um is a bit different from you know larger companies is the fact that there is a difference between getting AI research done and then of course just having it out there and just releasing it versus actually having a
Product that people are going to use because it’s all well and good being able to do something which is you know fascinating astounding and it’s really good but you know of course Translating that into a product that people can then use and is actually effective is another
Issue so I don’t know if they’re going to do that soon but I will be looking out for that because I do want to be able to use this and test it to see just how well it does against certain prompts against certain things like Runway PE
Collabs and of course stable the fusion video so what do you think about this let me know what your favorite feature is going to be my you know favorite feature that I’m thinking of is of course just the text of video because I’m just going to you know use that once
It does come out if it does ever come out but um other than that I think this is an exciting project I think there’s a lot more things to be done in this space and you know if things are continuing to move at this pace I really do wonder
Where we will be at the end of the year
Video “Googles New Text To Video BEATS EVERYTHING (LUMIERE)” was uploaded on 01/25/2024 to Youtube Channel TheAIGRID