Googles GEMINI 1.5 Just Surprised EVERYONE! (GPT-4 Beaten Again) Finally RELEASED!
Google’s Gemini 1.5 model has arrived, and it has taken everyone by surprise. This latest iteration of Google’s Gemini family boasts some incredible features, making it stand out from its predecessors and other AI models in the market. With the capability to process up to 3 hours of video in a single context length, 22 hours of audio at once, and up to 7 million words or 10 million tokens with remarkable accuracy, Gemini 1.5 is a game-changer.
The model’s accuracy rate is around 99 to 100%, setting it apart from other AI systems in the industry. It fits between the Gemini Ultra and Gemini Pro models, making it suitable for larger and more tedious tasks that require a longer context length. Compared to its predecessor, Gemini 1.0, and the ultra benchmarks, Gemini 1.5 excels in text, vision, and audio, showcasing its superior capabilities.
The video demonstrates the model’s capabilities in long-context understanding, demonstrating its ability to reason across a 432-page transcript and solve coding tasks with precision. From accurately extracting quotes from transcripts to modifying code and even identifying scenes from screenshots, Gemini 1.5 showcases its versatility and accuracy in handling complex tasks.
Overall, Google’s Gemini 1.5 Pro is an impressive AI model that sets a new standard for accuracy and multi-modal capabilities, paving the way for innovative applications across various industries.
Watch the video by TheAIGRID
Video Transcript
So Google actually did just surprise Everyone by releasing Gemini 1.5 and this is their latest iteration of their family of Gemini models and this is a rather surprising model in the fact that it is able to do something incredible Gemini 1.5 is the Behemoth that is able
To take up to 3 hours of video in a single context length it’s also able to take 22 hours of audio at once it’s also able to take up to 7 million words or 10 million tokens with remarkable accuracy as well because lots of the time when we
See new models appear many of the times what happens is is that their accuracy rates are very very underwhelming and Gemini is just outstanding because on these capabilities their accuracy rate is around 99 to 100% so that is absolutely incredible this multimotor model is going to change everything and
Let’s take a look at some of the things you do need to know because once you see a few videos you’re going to be truly surprised by how good this AI really is so where is Gemini 1.5 before we dive into some of the examples of how good
This AI system is where does it fit so on the left hand side you can see Gemini Ultra our most capable and largest model for highly complex tasks and then in the middle you can see Gemini Pro and you’ve got two iterations Gemini 1.0 and Gemini
1.5 and Gemini 1.5 is the model that was released today which is essentially for larger more tedious tasks that require a longer context l so how much better is Gemini 1.5 so you can see that in text in vision and in audio Gemini 1.5 is better across the board however compared
To the ultra benchmarks you can see that only on vision and audio on the right hand side there are some areas where Ultra is slightly better so overall this model is substantially better than Gemini Pro 1.0 which is currently available and in terms of Gemini Ultra largely it’s better than text and
Invision so across the board this is a model that is most certainly more capable now I’m going to be showing you guys one of these examples of Gemini Pro 1.5 reasoning across a 432 page transcript this is a demo of long context understanding an experimental feature in our newest model Gemini 1.5
Pro we’ll walk through screen recording of example prompts using a 402 page PDF of the Apollo 11 transcript which comes out to almost 330,000 tokens we started by uploading the Apollo PDF into Google AI studio and asked find three comedic moments list quotes from this transcript and
Emoji this screen capture is sped up this timer shows exactly how long it took to process each prompt and keep in mind that processing times will vary the model responded with three quotes like this one from Michael Collins I’ll bet you a cup of coffee on
It if we go back to the transcript we can see the model found this exact quote and extracted the comedic moment accurately then we tested a multimodal prompt we gave it this drawing of a scene we were thinking of and asked what moment is this the model correctly identified it
As Neil’s first steps on the moon notice how we didn’t explain what was happening in the drawing simple drawings like this are a good way to test if the model can find something based on just a few abstract details and for the last prompt we ask the model to cite the time code
Of this moment in the transcript like all generative models responses like this won’t always be perfect they can sometimes be a digit or two off but let’s look at the model’s response here and when we find this moment in the transcript we can see that this time code is
Correct these are just a few examples of what’s possible with a context window of up to 1 million multimodal tokens in Gemini 1.5 Pro that demo right there was rather impressive and there are a lot more examples in the paper but let’s take a look at another example of
Gemini’s massive capabilities on doing this with coding tasks this is a demo of long context understanding an experimental feature in our newest model Gemini 1.5 Pro we’ll walk through some example prompts using the 3js example code which comes out to for 800,000 tokens we extracted the code
For all of the 3js examples and put it together into this text file which we brought into Google AI Studio over here we asked the model to find three examples for learning about character animation the model looked across hundreds of examples and picked out these three one about blending skeletal
Animations one about poses and one about morph targets for facial animations all good choices based on our prompt in this test the model took around 60 seconds to respond to each of these prompts but keep in mind that latency times might be higher or lower as this is an experimental feature we’re
Optimizing next we asked what controls the animations on the littlest Tokyo demo as you can see here the model was able to find that demo and it explained that the animations are embedded within the gltf model next we wanted to see if we could customize this code for us so we asked
Show me some code to add a slider to control the speed of the animation use that kind of gooey the other demos have this is what it looked like before on the original 3js site and here’s the modified version it’s the same scene but it added this little slider to speed up
Slow down or even stop the animation on the fly it used this gooey Library the other demos have set a parameter called animation speed and wired it up to the mixer in the scene like all generative models responses aren’t always perfect there’s actually not in a knit function
In this demo like there is in most of the others however the code it gave us did exactly what we wanted next we tried a multimodal input by giving it a screenshot of one of the demos we didn’t tell it anything about this screenshot and just asked where we could find the
Code for this demo seen over here as you can see the model was able to look through the hundreds of demos and find the one that matched the image next we asked the model to make a change to the scene asking how can I modify the code
To make the terrain flatter the model was able to zero in on one particular function called generate height and showed us the exact line to tweak below the code it clearly explained how the change works over here in the updated version you can see that the terrain is indeed flatter just like we
Asked we tried one more code modification task using this 3D text demo over here we asked I’m looking at the text geometry demo and I want to make a few tweaks how can I change the text to say goldfish and make the mesh materials look really shiny and
Metallic you can see the model identified the correct demo and showed the precise lines in it that need to be tweaked further down it explained these material properties metalness and roughness and how to change them to get a shiny effect you can see that it definitely
Pulled off the task and the text looks a lot shinier now these are just a couple examples of what’s possible with a context window of up to 1 million multimodal tokens in Gemini 1.5 you just saw Google’s Gemini 1.5 Pro problem solving across 100,000 lines of code
Mayo my this is something that is truly impressive there is no other AI system out there that can do this with the accuracy level of Google’s Gemini but now let’s take a look at some of the multimodal prompting which is going to be used by a lot of standard uses this
Is a demo of long context understanding an experimental feature in our newest model Gemini 1.5 Pro we’ll walk through a screen recording of example prompts using a 44-minute but Buster Keaton film which comes out to over 600,000 tokens in Google AI Studio we uploaded the video and asked find the moment when
A piece of paper is removed from the person’s pocket and tell me some key information on it with the time code this screen capture is sped up and this timer shows exactly how long it took to process each prompt and keep in mind that processing times will vary the
Model gave us this response explaining that the piece of paper is a pawn ticket from Goldman and Company Pawn Brokers with the date and cost and it gave us this time code 1201 when we pulled up that time code we found it was correct the model had found
The exact moment the piece of paper is removed from the person’s pocket and it extract a text accurately next we gave it this drawing of a scene we were thinking of and asked what is the time code when this happens this is an example of a multimodal prompt where where we combine
Text and image in our input the model returned this time code 1534 we pulled that up and found that it was the correct scene like all generative models responses vary and won’t always be perfect but notice how we didn’t have to explain what was happening in the
Drawing simple drawings like this are a good way to test if the model can find something based on just a few abstract details like it did here these are just a couple examples of what’s possible with a context window of up to 1 million multimodal tokens in
Gemini 1.5 Pro that right there goes to show us how crazy this is I think the only caveat to this is that it does take a little bit of time for it to go ahead and get the footage but looking through a 44-minute video is absolutely incredible and doing the reasoning
Across that is not to be understated because think about how long it would take a human to watch through an entire movie and find something from one frame and whilst these demos are impressive what’s even more impressive is the paper that they attach to this which I read
That shows a whole host of other incredible capabilities so let’s take a look at some of these stunning examples from the paper which is going to show you all exactly how accurate this AI system really is and why Google are really leading the entire AI industry
With Gemini 1.5 Pro so there was this example and it stated given a reference grammar book and bilingual word list a dictionary Gemini 1.5 is able to translate from English to kamang with similar quality to a human who learned from the same materials this is incredibly substantial because it means
That not only is it able to get the entirety of this context length and a dictionary it’s able to reason and do translation based on a new data just like a human would there was also this example right here that was was another stunning example from the paper and
Essentially it states that with the entire text of this really really long novel it’s able to understand exactly what’s happening just through a very simple drawing and I’m no artist but I’m sure you can all appreciate the fact that this drawing here isn’t a very very artistic one and it’s really really
Simple so the genius here is of this system to be able to understand the Nuance of what’s happening in the image then extrapolate that data out and of course reason to figure out exactly where that is that is something that is unheard of in our current AI systems and
That’s why I stated this is truly gamechanging stuff there was another example in the paper and I’m pretty sure you’ve already seen this one based in the video but it just goes to show how crazy this is now in the paper some of the stuff I was looking at was really
Cool because there was this thing called video Hast stack okay and I’m going to break this down for you guys because it’s truly fascinating on how accurate this really is and how they tested it goes to show how accurate this is now on the image you can see Gemini 1.5 Pro
Compared to GPT 4 with vision and unfortunately GPT 4 with the vision can only take in 3 minutes in their API whereas Gemini 1.5 can do 1 minute of content all up to the way of 3 hours so essentially they set up a game the computer which is Gemini 1.5 has to find
A secret message and the secret word is needle but this message was sneakily hidden in one tiny part of a very long movie and this movie isn’t just any movie it was a thre long hour movie made by sticking two copies of a documentary about the game of go together and this
Makes the video really long with lots of places that could have hidden the message now in this demo what they did was they put the secret message only in one single frame of the video that’s just one picture out of a thousands and thousands that make up the entire movie
And of course there’s a picture every single second now Gemini 1.5 PR’s job was to watch this entire super long movie and find that one frame with the secret message and all they did was they asked Gemini 1.5 what was this secret word which is essentially like finding
An needle in a hay stack and can you guess what Gemini 1.5 was able to do it 100% of the time so that is why the video capability the video hstack capabilities are absolutely incredible in addition they did the same kind of game with the Gemini 1.5 Pro system and they did it
With 22 hours of footage and you can see here that it was able to do it up to 100% And they compared it to whisper and GPT 4 Turbo with 12 minutes all the way up to 11 hours and you can see the boxes in red that are essentially areas where
It completely failed in addition they also did this with text Hast stack and this is where things start to get crazy because this was something that people didn’t really think was possible there were certain research papers that were stating that you know using Mambo was essentially going to be possible with
This kind of you know output that we really wanted if we really wanted to be able to get the retrieval that we wanted we’re going to have to use different architectures but it seems like Google managed to figure out how to do that and you can see right here that up to 10
Million tokens they were able to get the accuracy up to around you know I think it was 99% a ridiculous level of accuracy and that is something that is unheard of a 1 million context length window is incredible and of course compared to GPT 4 Turbo it’s only a
128,000 contact l so this is a truly game-changing thing because imagine having 1 million tokens and then getting an AI system to be able to reason about the entirety of that or find certain things and then reason on that that is going to be a huge different thing now
There were additionally some benchmarks so we can see here here the comparison between GPT 4 vision and Gemini 1.5 Pro on a 1-hour video QA and experiments are run by sampling one video frame per second and linearly subsampling 16 or 150 frames and you can see here that
Gemini 1.5 Pro outperforms GPT 4 with vision substantially because not only does it outperform the 16 frames and the 150 frames it does actually support for the video whereas GPT 4 with vision currently doesn’t now any we can take a look at some of the benchmarks to see exactly what is going
On you can see right here that the core capabilities like math science and reasoning and coding and instruction following are up across the board in this model and what’s crazy is that in terms of the actual families of model like if we take a look at where Gemini
Pro 1.5 sits we know that Pro 1.5 sits in the middle in terms of what the model is going to be able to do so that leads me to believe that potentially we could be getting an ultra 2.0 or an ultra 1.5 but with these benchmarks we can see
That Gemini 1.5 is literally better across the board and it has a hugely increased contact length that’s going to allow a lot more things now if you want to take a look at some of the individual detailed benchmarks you can see the math ones right here you can see that 1.5 Pro
Outperforms it on the hell swag doesn’t on the MML U does on the GSM 8K does on the math doesn’t on the rest of these and does on the big bench so across the board you can see that Gemini 1.5 Pro is really taking the cake here in terms of
What is possible with an AI system and of course in addition the detailed benchmarks in coding we can see that you know it’s half and half in terms of these capabilities but it is 77.7% on the natural to code benchmarks and one thing that I did want to find
Out was of course how they train this model and like Gemini 1.0 Ultra and Gemini 1.0 Pro Gemini 1.5 Pro was actually trained on multiple 496 chip PODS of Google’s TPU for Vu accelerators distributed across multiple data centers and on a variety of multimodal and multilingual data now with that being
Said are you excited for Google’s family of models that are absolutely incredible and are you going to be taking a look and using this model in Google’s Ai and of course with things like the video capabilities that haven’t been done by any other AI system before are you
Excited to potentially use these to reason and figure out things in certain videos either way I’m excited for Google to finally beef up the competition and make a more competitive AI space but it will be interesting to see how other AI companies do respond because right now
It seems that Google is well in the lead benchmarks are here and the benchmarks are clear and some of the AI systems right now don’t even have some of these capabilities so with that being said if you did enjoy this don’t forget to leave your comment below on where you think
Google is going to go next
Video “Googles GEMINI 1.5 Just Surprised EVERYONE! (GPT-4 Beaten Again) Finally RELEASED!” was uploaded on 02/15/2024 to Youtube Channel TheAIGRID