Googles New VIDEO AI ‘VideoPoet’ Surpises Everyone!
Google’s new Video AI, VideoPoet, has surprised everyone with its amazing capabilities. In a deep dive into Google’s research paper on this text-video model, we discover the potential game-changing features of VideoPoet. The model, capable of zero-shot video generation, has already shown impressive results in creating detailed videos from text prompts.
While the quality of the generated videos may not be perfect yet, the accuracy and capabilities of VideoPoet are truly impressive. The model can output high-motion, variable-length videos and even generate audio from videos. With the ability to produce long videos indefinitely based on short input clips, VideoPoet shows strong object identity preservation unlike previous models.
Google’s VideoPoet also offers interactive video editing, stylization, controllable video effects, and camera motion customization, making it a versatile tool for various creative applications. Through comparisons with other software, VideoPoet showcases its unique architecture and potential for future developments in AI video generation.
As we await Google’s decision on making VideoPoet available to the public, the possibilities for this innovative AI technology are endless. Stay tuned for more updates on VideoPoet and other AI advancements on The AI Grid.
Watch the video by TheAIGRID
Video Transcript
Okay so Google actually just released a amazing research paper that discusses their new text video model this is called video poet now essentially it’s really incredible because of some of the features that they do show us and in this video it’s going to be a deep dive
On why this is truly gamechanging if Google actually manages to make this into a full-fledged product because often times when Google always does these crazy uh breakthroughs for some reason I honestly have no idea why they never make it available to the public or even develop it into a fully fledged
Product so here you can see this is video poet a large language model for a zero shot video generation and essentially the zero shot in zeroshop video generation just means that it’s the ability for this large language model to complete the task without having received any training examples as
You can see here is this is some of their first examples and later on I’ll show you why although the quality might not be as good as you think it’s really really good so you can see we’ve got a dog listening to music with headphones highly detailed AK a large paint blob
Exploding this one actually does look pretty good A robot cat eating spaghetti now I’m going to get into each of these examples and show you why they’re actually really good in terms of the accuracy and also what I’m going to be showing you as well is how this actually
Does compare to some of the other popular video or SL text to video models that do exist out there because honestly they’re really really great so essentially one of the first things that we do have is that video poet can output High motion variable length videos given
A text prompt and one of the crazy things about this is that they do have video to audio which is I guess you could say kind of weird because it’s not something that we’ve seen before we only saw video to audio in I think one model
That was Codi which was any to any a multimodal model which ENT entally any form of modality could be converted to another so audio could be converted to text text could be converted to video video could be converted to audio and vice versa but anyways and yeah this is
The paper that I’m talking about um just for reference so Google integrating maybe some things from this is going to be really interesting but anywh who back to Google so let’s actually test out how good this video to Audio model is because although this is something new I
Think the abilities are going to be surprising so this one is a dog eating popcorn in the cinema and let’s uh put that audio on I mean audio definitely sounds a little bit raspy like it needs a bit more I guess you could say Clarity um
But I’m going to just go through these one by one and then I’m going to discuss them after so this one is a teddy bear with a cap in leather drums playing drums then we have a teddy bear in a leather jacket baseball cap and sunglasses playing guitar in front of a
Waterfall then we have a pink cat playing piano in the Forest then we have the Orient Express driving through a fantasy landscape on an oil canvas then we have a dragon breathing fire Cinematic so I mean the audio isn’t actually that bad and I think if this does get really good it could present another area for us to access a different kind of modality where we do have these two modalities combined and it does help us produce much better content and
Essentially they actually made a video with this generative model so it says to Showcase Video poets capabilities we have produced a short movie composed of many short clips generated by the model for the script we asked Bard to write a series of prompts to detail a short
Story about a traveling raccoon we then generated video clips for each prompt stitch them together resulting all Clips to produce the final YouTube short below so take a look at this clip but like I said although the pin results might not seem that crazy I would say watch the
Video Until the End to you see why this is kind of incredible the raccoon in the Amazon raccoon Forest lives rookie the raccoon and his family rookie loves to play soccer with his friends collect acorns from the trees swim in the river and catch fish but what he loves most is
Exploring new places and sniffing new smells except he’s never made it out of the vast Forest every night he dreams of seeing exotic places like Time Square in the Golden Gate Bridge one day he sees a portal of yellow light and his curiosity makes him step inside suddenly he’s
Falling through a long tunnel which takes him to Paris I’ve never seen anything like it he says and then to the Great Barrier Reef and even the South Pole where he makes some new friends before visiting the ISS and ultimately the planet androida where the locals didn’t speak raccoon which makes
Him miss the raccoon forest and especially his family rookie quickly finds his way back home he’s so happy to be back with his friends but also grateful for his adventures guess he’s not such a rookie anymore so like we said that was an impressive demo but let’s get into some of the technical
Stuff so you can understand what makes this really really impressive and why you should be paying attention to this so it says video poet is a simple modeling method that can convert any auto regressive language model or large language model into high quality video generator it contains a few simple
Components a pre-trained magit V2 video tokenizer and a sound stream audio tokenizer transform images video and audio clips with variable lengths into a sequence of discreete codes in a unified vocabulary now we actually did a video that was on magit before but essentially magit was really really cool in terms of
Its ability to do some new things essentially this is a video from a couple months ago but essentially it could out paint five times showing us how to do a panoramic video you could also see here that was some shots that you could take like a single video If
Taken in portrait mode could be converted to a very long landscape mode you can see that there was also this smart remover feature where if there was a box in this video that was somehow cut out the magit Transformer could the magit system could replace that box you
Can see here the box is replacing some of the images with exactly what it was and this was essentially video then of course we had some more examples we also had autof flip which is uncropping any size of video then it also had image to animation there was something that was
Really cool which was frame interpolation which is taking two images and then of course animating what’s in between those now here you can see this is where they give us an overview of the video po model it says an overview of the video pent model which is capable of
Multitasking on a variety of video Centric inputs and output the llm can optionally take the text as input to guide generation for text of video image video stylization and out painting tasks and you can see right here it takes a text and then it can do text to video so
It’s got this whole system here that works cohesively together to produce these good videos and of course the only thing that we do have that is not as good is of course the quality but anyways let’s move on to visual narratives which is quite the thing so
You can see right here they have this initial video of a walking figure made out of water but then they can extend that video and use a prompt to change that exact video and one thing that I want you to pay attention to here is the
Accuracy of video poet because a lot of people might think that this is low quality or the FPS isn’t as good but you really do need to see how accurate these prompts are with regards to what was said so this is a walking figure made out of water lightning flashes in the
Background we can see that lightning Flash and then purple smoke emits from the figure of water and you can see that exactly what happens and when I compared this to some of the other video models they actually aren’t as accurate as this and then of course we have two raccoons
On motorbikes on a Mountain Road surrounded by pine trees in AK this one’s pretty decent and then we have the extended video of two raccoons and a meteor shower falls behind them and the meteors impact the Earth and explode and you can see that the explosion actually
Looks pretty decent now I think this is the biggest thing about Google’s video poet that most people might miss is the fact that this can actually produce very long videos so it says by default video po outputs 2C videos but the model is also capable of long video Generation by
Predicting 1 second of video output given an input of a 1 second video clip this process can be repeated indefinitely to produce a video of any duration despite the short input context the model shows strong object identity preservation not seen in Prior works as demonstrated in these longer clips as
You can see here these clips are around I think around 10 seconds long or so so when I open them in a new tab they’re around 10 seconds long or so and essentially what they’re talking about is the fact that the object does remain in the image then of course we have some
Of these examples and depending on what you think I think this one and the Teddy’s walking are the best a Drone footage of a very sharp Elven city of stone in the jungle with a brilliant Brew River waterfall and large steep vertical Cliff faces teddy bears holding
Sat and walking down rainy f Avenue and an fpv Drone footage entering a cyberpunk city at night with many neon lights and reflective services and then this one of an ancient city in Autumn actually looks pretty cool although it is quite trippy now something that I
Also did find cool from this video poet from Google was the interactive video editing it says interactive video editing is also possible extending input videos a short duration and selecting from a list of examples by selecting the best video from a list of candidates we can find finally control the types of
Desired motion from a larger generated video here we generate three samples without text conditioning and the final one with text conditioner so you can see that they first tech so you can see that they first prompted the model and then this is what they got out and then based
On these three samples they decided to add this text here and you can see that it shows that when you add the text prompt it actually becomes much more accurate and this one here does look very very accurate in terms of what we do see they also do talk about
Controllable video editing which is essentially the video poet model can edit a subject to follow different motions such as dance styles and here we have a raccoon doing very different dce styles I don’t think these dance styles are that accurate because of the video model not being that accurate but I do
Think the fact that video poet does actually have longer video generation where you can generate clips of any duration this definitely does have some deeper applications in addition what was also cool that many people may have missed is the stylization aspect of video poet which is where they said
Video poet is also capable of stylizing input videos Guided by a text promps and it demonstrates stylistically pleasing prompt adherence so you can see we have the inputs and then the stylized version of these inputs are very very incredible this is very similar to what we got from
Runways Gen 2 in which in which you could put a sty onto the said video although I do think that somehow this does seem to be more accurate I don’t know about you guys but this input of a lion and then transforming it into a metal lion does look really cool then
This one here of a wombat and then you change it to a wombat wearing sunglasses holding a beach ball looks so cool and of course this one right here a magical snow Forest covered in dense pine trees so I think that this stylization is very effective and then of course we have
Applying visual Styles and effects Styles and effects can easily be composed in text video generation and we start with a base prompt and append a style to it the prompt that they have is an astronaut riding a horse in a lush forest and then you can see the differ
Styles that they have and so far out of the prompts that they did have these ones here are the ones that look the most effective the photo realism one definitely looks really cool if it had more FPS then the digital art style of the astronaut sitting on a horse then
The pencil art style and ink wash style then the Double Exposure style looks absolutely incredible and then the Small World style looks even more incredible so it’s absolutely insane the visual Styles and effects that you can use for video po but like I said before will
Google actually give us this to use and do things with now of course previously like we did discuss with the mag vit video that we did in the past they do actually have in painting and out painting So video poet can add detail in M out proportions of the video
Optionally Guided by text you can see here that the outp painted video is done the out painted video here is done and you can see also it does have inpainting like we discussed before so you can mask a piece of an image and then of course you could essentially replace that with
Anything now this this is what I’m saying about the applications of this think about if you you know for example have a video of someone doing something then put a mask in and then you could literally put anything in there for example we have this person riding a
Surfboard in the ocean then put a mask on it and then you instantly put a shark on there so I think that the applications for this once this does get really good are absolutely outstanding now in addition one of the last things I do want to discuss is of course the
Comparisons to other software so we do have image to video and I did actually test this with some of them and it’s interesting because Google’s research video po it does actually seem like it might be a bit better in some areas than the other software and the one I’m going
To be comparing it to is pabs but it goes to show that they clearly have different architectures so what we’re showing here is image to video generation and of course you could test this for yourselves by going on the video poet page and just taking the images and then inputting them in now
The image to video does look pretty more decent because I think the image qualities are just that much better but for example you know this image of a you know a ship navigating the rough Seas with several passengers on board thunderstorm lightning animated or canvas looks really good and a green man
Riding on a green horse with the wind blowing also looks pretty good and then this one also looks really cool because of the way how the flag is being waved and of course I did actually try some of these so for example right here you can
See white milk splashing in a ring drop above the ring falls down Making a Splash I actually did try this here and you can see that it didn’t seem to get exactly what I said on the first try so I did actually try and reprompt this
Again but you can see it’s white milk splashing in a ring a drop above the ring falls down Making a Splash and it does have some kind of Splash here but it doesn’t recognize exactly what this is and then make that move and then this
Was the second try that I did do with pabs and you can see that you know nothing really happens now this isn’t to go ahead and say that pabs is terrible and pabs is bad it’s just to go ahead and say that these different architectures will result in different
Outputs and I think it’s like how with clae is more mainly used for for really long form documents and creative writing where we have different large language models for different purposes we’re definitely going to have different video models for different purposes too now one of the last things that was really
Cool from video poet was the zero short controllable camera motion so one emergent property of video poets pre-training is that a large degree of high quality camera motion customization is possible by specifying the type of camera shot in the text prompt so the prompt is Adventure game concept art of
A sunrise over a Snowy Mountain by a crystal clear River and you can see that from this single prompt they were able to generate different kind of camera angles so we had the zoom out we had the dolly Zoom we had the pan left we had
The ark shot we also had the crane shot we also had the fpv Drone shot so I think this kind of thing goes to show that their video model is clearly really diverse and is able to do a lot of stuff the only thing that I would like to see
Now is number one Google finally for some reason releasing this and then of course them being able to fine-tune this with a higher quality in terms of pixel density and resolution and then of course more frames per second I know that it is not easy to do but of course
These are some of the steps that we need to take if we are going to get text a video that is actually usable
Video “Googles New VIDEO AI ‘VideoPoet’ Surpises Everyone!” was uploaded on 12/28/2023 to Youtube Channel TheAIGRID