GPT-4 Vision API :10 NEW MINDBLOWING Abilities + Examples
The GPT-4 Vision API is an incredible tool that is revolutionizing the way we interact with images and text. With the ability to answer questions about images, the API opens up a plethora of possibilities for innovative applications. One of the most mind-blowing abilities of the API is its integration with text to speech, allowing for AI-generated narrations of videos in real time. This has led to the creation of AI sports narrators and game commentators with impressive accuracy.
Furthermore, the Vision API has been used to automate tasks like creating product walkthrough voiceovers, generating fashion advice, counting calories from meals, and even judging outfits in a humorous way. The API’s ability to analyze images and provide insightful responses is truly game-changing.
While the cost of using the API may be prohibitive for some, the potential for creative applications is limitless. The future of work and interaction with computers is bound to be transformed by the capabilities of the GPT-4 Vision API. With more multimodal services on the horizon, we can expect to see even more groundbreaking uses of this technology in the near future.
Watch the video by TheAIGRID
Video Transcript
So GPT 4 with vision is currently one of the most incredible things that you probably did Miss I know at the open AI Dev day many things were overshadowed simply because of GPT 4 Turbo and of course with the introduction of gpts from open AI this completely blew
Everything out of the water because of the sheer customization but there was one thing the GPT 4 Vision especially the API what people have been doing with this software is absolutely incredible and essentially what the API is is it allows you to take images and answer questions about them now what’s good
About the API is that it can take in multiple images really quickly which means we do have some very very interesting applications for this that you’re going to want to see because the examples and creativity that’s already there are really really interesting you can also see that there’s multiple image
Inputs and of course there are some limitations such as the cost being quite high from what I’ve heard this is something that is actually very expensive but at the same time when you see the examples you’ll see our future is about to get very very crazy really
Really quickly so you can see that this tweet here by Josh bicket said that our team discovered you can use GPT V Vision to create a self-operating computer by looking at the user interface GPT 4 decides which series of click or type events are required to accomplish an
Objective here it is writing a poem in Apple notes so essentially what you can see here is that since we have the gp4 vision API it’s able to take the screenshot of your computer and then figure out what to do so you can see right here that we have the
Self-operating computer ask a computer to do absolutely anything so when we click play you can see that if you ask the computer to write a poem about a self-operating computer in Apple notes you can see that the computer is then able to do that and you can see that the
Computer is then able to write this really really quickly now for those of you who are thinking okay I get that it does take the image but how does this actually work how is the llm embedded into the system so it says how do you automate clicking GPT 4 with a ision
Decides on a window to click based on the objective and estimates the X and Y location in percentage which can be evaluated pixels in Python and it does rather okay at this estimation so understand that this isn’t exactly a fine-tune version and the reason I’m saying that is because this GPT 4 with
Vision isn’t its main purpose GPT 4 with vision has broad use cases which means it can be used on people it can be used on systems but what if in the future open a decides to release an open Agent or some kind of Vision model that can
Fully browse and do anything that you need it to do on your computer imagine needing to write an email send something to your boss or just do some general research and you just simply input that prompt and then GPT 5 or GPT 4.5 is simply able to go into your emails do
Everything and then when you come back in like half an hour or something maybe you went outside you can see that all the work is done this is why I say it’s going to be a very interesting future if we can simply Now using Vision models which aren’t specifically trained for
For this purpose will order this coding to be able to do this kind of stuff how is the future of work going to look and interacting with computers that’s why I say the future is going to be very very interesting so here we have GPT 4 with
Vision and text to speech equals an AI Sports narrator so one of the things that open ey did actually release was a text to speech API now this API doesn’t sound as realistic as many others but here’s the kicker openi have decided that their new API for this text to
Speech model is going to be vastly cheaper than absolutely any which actually makes it a viable option for situations like this so essentially you can see I passed every frame of a football video to GPT 4 Vision preview and with some simple prompting asked to generate a narration there were no edits
And this came out from the M from scratch which means that this could be so much better now I’m going to play this clip and hopefully there doesn’t get copyrighted but if it does there will be a link to this tweet in the description like a magician on the field
Dodging 1 2 3 Unstoppable look at him go the crowd is roaring can you believe this he’s taken on the whole defense he’s a one-man Show Ladies and Gentlemen he shoots go messy messy messy unbelievable what a goal what a goal now I do think that the implications of this
As I was saying before are absolutely incredible I mean this is just what we have in the first 24 to 48 hours so what do you think people are building behind closed doors now I would have made that full screen but like I said football clips are notorious for their copyright
And I rather include this in the video you have to go off to another website to view it but like I stated this is just what we have in the first day and this is really interesting so of course what people also say is that you could then
Have translations in real time in other languages which is definitely something that could be done but of course here is the kicker so for those of you who are thinking okay I do want to build something with this please take a look at this Suite right here it says okay
How much does this cost and did you use your whole 100 requests for the day and it says the video is 1 1,131 frames long but only pass every 10th Frame all of them together in a dictionary I think this only counts as one request but a ton of tokens this was
Not cheap this was $30 so that is pretty crazy guys that clip right there was $30 which is why I say although the gbt 4 with a vision model is really really good but at the same time you have to understand that if you are using it for
Video cases then is likely to be very very expensive now I don’t think this should discourage you from what you are seeing here because there are many many different multimodal models that are actually being released there are tons and tons of research papers so stuff like this with other models is going to
Be relatedly available very very soon so I wouldn’t stress about the cost from now I would more so think about a creative application if you are trying to build a business from this then of course we have another GPT 4 with vision example so GPT 4 with vision is in many
Cases being combined with the new text to speech demo so you can see here that Jason Zoo said gb4 with the video Vision Texas speach demo to automatically generate product walk through voiceover video just from a screen recording so essentially what we have is a screen
Recording and then of course all you do is you get that screen recording and then you input that into video click generate and then of course we get outputed a video so I’m going to show you that video clip now welcome to the product demo let’s quickly create a custom tool navigate to
Tools click on create tool select knowledge Q&A name your input add a step choose llm and voila your tool is ready and I think once again the use cases are pretty stunning now of course this isn’t something really groundbreaking in terms of what’s possible but I still think that it is an
Absolutely incredible and creative way of using the vision model because these are things that should be automated away and of course they are going to be useful for people who want to create tutorials very very quickly and very very easily and they also do make a lot more tutorials more accessible then of
Course we have another one which is of course another narrator and I think this one is a little bit better of an example but at the same time it is quite shocking at how good it is because like I said it’s not been that long and if people are already creating this it’s
Quite shocking so essentially it’s a narrator of the League of Legends game and when you take a look at the clip you’re going to realize how insane this actually is yo what’s up guys so I made this quick script where it’s um generating commentary for this league
Game so basically what it’s doing is that it’s um taking screenshots of the game every 15 seconds and then it’s using the new vision API to commentate it so here’s the end result and we’ll look to extend their lead as the next objective comes up and
Here we are in a tension filled moment T1 holding a slight gold lead over LNG as the vision game intensifies around the dragon pit T’s positioning is crucial here as they look to establish dominance and secure the next objective can LG find a pick or will T1 extend
Their lead the next move could be pivotal and as the action unfolds on the top Lane we see LG zaka cautiously holding back aware of T1 Zeus who has just disappeared into that River brush the vision game is critical here with neither wanting to overstep and risk a
Gank with the dragon timer counting down a as someone who has played League of Legends before I can say that this is actually pretty good because if you haven’t played the game it might be confusing to those of you who are just playing for the first time but trust me
When I say this is actually pretty decent and I think the applications are going to be pretty cool now of course like I said before the only thing we need to do is get the cost down and of course over time we know that openi is
Always going to be working on that so I wonder how many different applications are going to be made with this technology then of course we had fashion advice for the clothes you are currently working or wearing so it says I tried creating by combining GPT fors Vision
API and dar 3 it was really quick so I think more multimodal services will come out in the future that was the trans translation and it might not be 100% correct so take that translation with a grain of salt the long story short is that this guy essentially hooked up the
Vision API and then essentially it takes an image and then because of darly 3 in chat GPT it can then analyze your fashion Choice provide a suggestion and then the step number three is of course to generate a best choice for what you could wear so that comes in a couple of
Seconds and then of course we get this infographic where we can see many different things that the user can wear so they suggest maybe a scarf suggest maybe a change in the bottoms a different scarf they also suggest some potential accessories like glasses they also suggest a hoodie maybe some
Different shoes it’s actually really really interesting those without a fashion sense or those who aren’t that well versed in that area are going to receive plenty of information that can help guide their choices and I think this just goes to show how crazy this Vision API really is then of course we
Have webcam GPT which is using the new GPT 4 Vision API to actively Rec recognize what’s happening in real time so this is a live web demo and of course you can try this with your link in the comments now of course remember this does cost a lot so if you do actually
Have access remember that if you are continually pulling requests in it’s going to add up to quite a decent amount okay so just be careful and just be conscious of that when you’re doing this like I said I do think this is pretty cool this user is able to in real time
Get data on exactly what’s going on now I don’t know if some people are going to hook this up to maybe CT CCTV outside their homes and maybe they’re going to realize exactly what’s going on and then they can use a system that talks to them and figures out exactly what’s going on
On their maybe ring camera I don’t know how this is all going to change but like I said the implications are absolutely incredible and there’s just so many applications that I just feel that this technology is transforming at such an insane then of course we have a tweet
Here by Dy tweet share and it says gbt for vision API is pretty insane I was messing around and built a quick tool for visually counting calories kind of fun to just keep using it on everything I see so it’s pick to calories.com and essentially what you do is you upload a
Picture of your meal and then essentially it gives you the calorie count of that meal now as someone who knows about the fitness industry do you not think that this is absolutely insane like so many people struggle with counting calories okay and if you are literally able to just take a picture
The AI is able to recognize exactly what is in that picture and they’re able to over the entire day say look maybe you’ve gone over your calorie count okay this is a certain amount of calories this is a really really big use case because if you unaware in Fitness if you
Want to lose weight you have to eat less calories and of course tracking calories is the only way to know how much You’ eaten and with this okay the old-fashioned way of simply finding out you know looking at the rapper and thinking okay how many calories in this
Is going to be completely abolished because now we simply upload a picture it’s going to say okay that’s good to go you’ve got 600 calories left I mean this is just a GameChanger and this tweet doesn’t really have that many likes it only has like 31 likes and three
Retweets but I think this is something that you know is going to really change the game so here we have so this one is really really cool and I think this is going to change how people screenshot and interact with browsers because this is screenshot and ask questions about
Anything so with the gp4 vision they’ve sort of merged this API into a browser and essentially what this user is doing is he’s going around and he’s screenshotting stuff and then I think with text to speech they’re basically saying exactly what is going on so this
Is really really cool I mean it’s it’s just crazy I mean as a student you know let me show you a demo which is going to change the way you think about internet so I’m here in the medical website and I have a picture of the
Joint I’m just going to drag and select the part I’m interested in and the gb4 is going to answer me what is it is hip joint region and what about this part what is it I’m not giving it even any context it just knows this is shinger equation let’s try
This part what is it potential energy term and let’s say I’m really into the cars but I I I don’t I just don’t know what is this your orange stick what is this orange stick andine oil dipstick and it’s going to help me out no matter where I am even
If I’m just looking at pictures of some blog and wondering what is this yellow thing it’s a true needle or just if I wondering for or what is this style it looks beautiful and the cool thing is that it also has the context of the page so it
Sees the text here and it uses this to answer the question even better it’s transition Lo kitchen design style nice and the last question is it going to be able to interpret the table and answer me am I doing any good in business I have’t crising that income
Good so guys if you like this I I have left a link for the better below it’s going to be available very soon follow me on Twitter to learn more stay hungry stay foolish right if you are struggling with certain things just being able to just you know screenshot this and say hm
What is that and it’s able to tell you exactly what that is that is absolutely crazy now for the last one what I’m going to do is introduce this so it says for a bit of fun this person and I built a hot or wats using the newly announced
Vision API and the gp24 vision analyzes a picture of you and shares a light roast check it out at hotor what. so I think this is really really interesting and it provides a comical response on whether or not this person is hot or not and then of course we have one about
Elon Musk well well well isn’t it I’m too cool to look at the camera I see you there trying to pull off that distinguished gentleman y y y and I think um like I said there were many more now there is actually one more that
I did forget and I think this one is potentially the biggest one okay so this one was really really hard to find but this is uh insane so essentially what we have here is a tweet from Rod from Rob and it says just integrated opening eyes
Gbt 4 into the metaverse to be able to have agents with sight and of course the first thing that they made was a roast Master 9000 that judges you for your metaverse outfit choices oh look at this mysterious glowy figure seems like someone can’t decide if they’re trying
To be a ghost or the Hulk whoops it’s just an avatar caught in mid Photon blast hey there luminous being don’t be shy float on over here and grace us with your fluorescent presence oh look who’s strolling through witchy pixelated Park it’s you with that hat so pointy it
Could get its own Wi-Fi signal the green face gives a whole new meaning to eco-friendly you’ve mixed so many colors in that outfit I’m not sure if you’re a walking fashion statement or a colorblind artist pallet well hello there iron giant’s less successful cousin did did you wander off the set of
A lowbudget sci-fi flick or are you just lost on your way to the clunky armor convention let’s be real those knees must have their own weather system with all that bulk Come a Little Closer I promise I only roast with words not laser beams so I think this is really cool
Because previously we’ve seen what happens when we put AI into NPCs but what if we give these AI NPC’s Vision like things are about to get super super interesting I mean are these AI NPCs going to become conscious I think going to become insane are they going to go
Crazy are they going to go ahead and live their own little lives I would honestly be so intrigued to see what would happen if we put like 10 AI NPCs told them they were conscious gave them Vision um just gave them complete Freedom just to see what they will do
But um yeah go check out this video cuz it’s absolutely incredible and for links to all of this we’re going to do a Twitter thread that links to every single one of these it will be in a link in the description so if you want to
Check this out uh don’t forget to follow us because the thread will be there
Video “GPT-4 Vision API :10 NEW MINDBLOWING Abilities + Examples” was uploaded on 11/09/2023 to Youtube Channel TheAIGRID