Microsoft’s New AI Orca-2 Just Changed EVERYTHING! (Synthetic Data Breakthrough)
Microsoft has recently released a groundbreaking research paper introducing Orca 2, a follow-up to their previous model Orca which took the AI community by storm. Orca 2 focuses on teaching small language models how to reason, surpassing models of similar size and even larger models in performance. With only 7 billion parameters, Orca 2 achieves remarkable feats comparable to much larger models like GPT-3. The key breakthrough in Orca 2 lies in the use of synthetic data for training, allowing for diverse reasoning techniques and optimal solutions for various tasks. This approach accelerates model training, improves reasoning specialization, and sets the stage for future advancements in AI. The use of tailored synthetic data showcases the potential for smaller models to outperform larger ones, ushering in a new era of AI research focused on synthetic data to overcome data bottlenecks and achieve exponential growth. Learn more about Microsoft’s Orca 2 and the implications of synthetic data in shaping the future of AI in our latest video. Subscribe now for more insights into the cutting-edge developments in artificial intelligence.
Watch the video by TheAIGRID
Video Transcript
So Microsoft recently released their new research paper in which they present something called orat 2 now this is a followup to what they released before which was Orca and this was something that we did a video on before 5 months ago they released something called Orca essentially it was a large language
Model that was pretty good in capacity that took Everyone by surprise and managed to match the capabilities of chat gbt now however they’re back again with of course orat 2 and this is teaching small language models how to reason now there’s one section of this
Paper where I think that we do get some kind of breakthrough in the kind of methods that we are using to train these large language models and with the discussions online about this kind of training I want to propose to you the idea why this is definitely going to
Change everything so here’s the thing about Ora 2 that sets it apart from other models is that when you read this paper is that they did do something different I think it’s called stepbystep recall then generate then reason recall generate direct answer Etc so they have all of these various reasoning
Techniques for every different prompt and essentially the reason Orca 2 is very very good and why it’s so cool and amazing and I’m going to get into some of the real reasons later is because one of the reasons is because Oru significantly surpasses models of similar size and attains performance
Levels similar to or better than those models 5 to 10 times larger so the thing about Ora and what we’ve seen about Microsoft research is that Microsoft are focusing what seems to be on large language models that are small but really really capable you see Ora is
Only 7 billion parameters which is Tiny In the comparison of things you see I think when we look at some of the other large language models like GPT chat GPT 3.5 that one has 175 billion parameters but Ora is only 7 billion parameters and manages to accomplish nearly many of the
Feats that chat gbt does which is why this model is truly truly interesting now interestingly enough what this model is this is model is open source so you can use this right now if you do want to get the weights of this model you can go
Over to this page and you can see that it’s Microsoft Ora 23 billion parameters there’s a 13 billion version and then there’s a 7 billion versions so that should be something that I’ve seen people testing on Twitter and it’s actually really good so uh why this is
Like you know so crazy is because the way that they train this model as well introduces some new techniques that we haven’t seen too much but we are going to get in that now what is orca 2 based on Orca 2 is based on of course llama 2
So Orca 2 is built up on the Llama 2 model family and of course it does retain some of its limitations as well as the common limitations of llms but um I think it was really interesting that it was based on llama 2 because when we
Compare it to the other models you guys are going to see that it has the reasoning capabilities that are comparable to much larger language models you can see to evaluate Oru we use a comprehensive set set of 15 diverse benchmarks that correspond to approximately 100 tasks and more than
36,000 unique test cases in zero shot settings The Benchmark covers a variety of aspects including language understanding and Common Sense reasoning multi-step reasoning math problem solving reading comprehension summarizing ground truthfulness and toxic content generation and identification now of course you can see that on these benchmarks the or2 does
Surpass llama chat 2 even which it’s based on and it does also pass surpass wizard LM now yeah it would have been interesting if we actually got to see what they compared this against because they only did compare it against llama 2 and wizard LM so it would be nice to see
Many of the other models now I’m not saying that they cherry-picked this because I’m pretty sure they didn’t but some of the findings are really really cool because remember these other models are you know really a lot bigger so if they’re able to get these smaller models to be more efficient and better
Reasoners then we have a serious serious future in our hands now this is where it starts to get interesting and this is where need you guys to pay attention because this is where the true stuff does come through so it says orca’s performance significantly surpasses models of similar signs and it also
Attains performance levels similar or better than those of models at least 10 times larger which is of course models around that 70 billion parameter model showcasing the potential of equipping smaller models with better reasoning capabilities Ora 2 models exhibit limitations and of course what’s also interesting was that the way that they
Trained Orca 2 could be applied to different BAS models okay now they only used llama 2 and and the reason I’m talking about different base models is because llama 2 is good but there are other base open source models that are far superior that I’m going to talk
About later and I think it’s really interesting because those other base models that we do have with if you take a look at the large language model leaderboard you can see right now that there was a new open source King and this one was called ye or Yi 34 billion
Parameters now it does absolutely insane it crushes the benchmarks but that’s what I’m saying okay so it imagine Now using the way that we trained Ora 2 to you know be even more effective can we apply it to some of these open source models and then get even more effective
Because a lot of these models some of these are trained in many different ways and it’s intriguing to see how we could do this so essentially the main breakthrough for Ora 2 was the synthetic data set so it says here that Orca 2 is trained with an expanded highly tailored
Synthetic data set the training data was generated such that it teaches U various reasoning techniques such as step-by-step processing recall then generate recall reason generate exact generate and direct answer methods while also teaching it to choose different solution strategies for each different task so the reason that this synthetic
Data is such a large breakthrough is because of the fact that this allows us to achieve scale it allows us to achieve scale on a level that we haven’t seen and I can guarantee you all that this is probably one of the large stepping stones that we need to explore when it
Does come to looking at AGI you see the problem with a lot of the data sets that we do collect is that you actually do need to get a lot of data the problem is okay the problem with collecting data is that it is very very hard to collect you
Know human generated data and when we look at what you know gpt3 was trained on it was trained on I think 45 GB of training data it says GPT 4 was trained on yeah 45 GB of training data and gpt3 was trained on 17 GB of trading data
Which essentially means that you know if one of the main bottlenecks is data then Microsoft research showing us that training on a new synthetic data set and providing these outstanding results shows us that this is where we need to look now Microsoft research aren’t the only ones that are looking into
Synthetic data sets because of course as you know this is a data that is created by AI themselves and is essentially just generated now the reason that this is really cool is because others in the field have also started to say this and many people are looking at the previous
Lessons that we did talk about so doc Jim fan says that it’s pretty obvious that synthetic data will provide the next trillion high quality training tokens I bet most serious llm groups know this the key question is how to sustain the quality and avoid Plateau too soon the bitter Lesson by Richard
Sutton continues to guide us in AI development there’s only two paradigms that scale indefinitely with compute learning and search it’s true in 2019 and at the time of writing it’s true today and I bet it’s going to be true until the day that we solve a AGI and
Essentially what he means by that is that if we want to get like a trillion high quality tokens we’re going to need to make sure that we are able to generate high quality synthetic data and I’m pretty sure what one thing that they did I I don’t remember if it was GPT 4
Gpt3 but I’m pretty sure that they exhausted a lot of the highest quality data that was available and it was Sam Alman or Greg Brookman that was talking about how the fact is that they know that for the other large language models like GPT 5 and further iterations that
If they’ve collected every single data source you know on planet Earth which trust me when I tell you guys it’s not that much okay you can fit it on like a USB drive like all of the data if you have it just as text essentially what we
Need now is we need to be able to generate a continuous stream of new data that we can use to train these models so that we can have them in different ways and I’m going to show you later why this is crazy now the bitter lesson is a
Really really really key insight into why we need synthetic quality data and when you see this you’re going to be like okay this is so true so the bitter lesson essentially is broken down into five parts and essentially it says that General methods are effective the primary lesson from 70 years of AI
Research is that General methods leveraging computation outperform other approaches due to the exponentially decreasing cost of computation then essentially it talks about the transition and the case studies okay and this is the key important parts I’m going to show you guys a gift that basically illustrates why synthetic data
Will be what changes everything so so it says historically AI research often assumed the constant computation availability focusing on leveraging human knowledge okay key part here is human knowledge however as computation power increased the Focus Shift the focus shifted to leveraging this growing computational capability so as you know
Before it was all about human knowledge and human power but you have to remember over time compute has increased we get faster uh you know gpus we get faster processing power we get more RAM we get everything is just becoming more and more efficient and of course this is
Attributed to mes law okay but this is where we have to shift over to ensure that we know that we can utilize this compute which is going to be more effective and one of the things is as well is extensive search and computational power so the thing about
This is is that there is a case study which shows us that training on data that AI creates itself essentially creates a feedback loop of kind of like self-improvement to the point where we can get that next level in Ai and that’s why there is such a focus on synthetic
Data so I’m going to show you guys this gift right here and let’s take a look at this gift okay so you can see that at zero days alphago has no prior knowledge of the game and only basic rules as an input okay but in three days alphao
Surpasses the abilities of alpha go Le the version that beat the world champion in four out of five games in 2016 and then in 21 days Alpha go okay it reaches the absolute level that beats the top professionals and the world champion of three out of three okay so it then goes
To alphao master and then in 40 days alphao Zer surpasses all versions of alpha go and becomes the best go player in the world and it does this entirely from selfplay with no human intervention and zero historical data the point from this GIF is essentially showing this
That in 0 to 40 days with synthetic data of the AI playing itself we simply went from something that was like negative 2000 to all the way to the best in the world okay and that’s 40 days that’s pretty crazy if you think about it okay
In a month and 10 days it went from Zero to Hero okay and I think that that is absolutely incredible and it shows us that whilst yes it is good to play on human data and human experts the problem is is that the large language model or
Whatever AI system you’re going to have is only going to be as good as the best inputs that you have which means that there is a cap because if we’re able to get these large language models to be able to train themselves with unsupervised learning and if it actually
Is very effective then we could have a system like this where we get this explosion in growth and this continual increase in capabilities where we don’t need human data anymore and hopefully if we somehow manage to not need any human feedback then we could get situations like this where synthetic data provides
Us with the key steps to moving forward in terms of reasoning so I think this would be something that you know they’re going to be looking at I’m pretty sure that that’s what they’re going to be doing for GPT 5 and GPT 6 because you know we eventually we we are going to
Run out of data and there was a video by AI explained where he did talk about data is all you need so yeah you can go ahead and check out that video I’ll leave a link to it but it basically says that gbt 5 will all be about data and
Because there’s several bottlenecks about data including the fact that humans are going to be needing to produce high quality data there’s only so much data that exists and of course we do have this tweet right here as well from Elon Musk that does say yeah it’s a
Little sad that you can fit the text of every book ever written by humans on one hard drive and synthetic data will exceed that by a zillion and gy fan responds saying lots of synthetic data will come from embodied agents for example Tesla Optimus if we can simulate
Them massively so it shows us that right now we’re moving to that future where synthetic data will be the key that allows us to get that exponential growth just like we saw with Alpha go now I’m not sure how this is entirely going to work how they’re going to get the
Synthetic data get whatever kind of machine producing it I’m sure the engineers and researchers are open I know exactly what they’re doing and any other startups that are going to be doing that because there are some companies that literally just provide data to open a I they might be looking
Into generating synthetic data as well you have to understand that these tokens are the backbone of these llms because there are some papers that show that high quality data is essentially the backbone of that large language mon for example Microsoft 51 was a paper where essentially they just fed it the very
Very highest quality data about code generation and it outperformed chat GPT and I think even GPT 4 on code generation which shows us that this is where we want to do and of course this is where we want to go so synthetic data has its advantage essentially synthetic
Data can be designed to encapsulate a wide range of scenarios including those that are rare or difficult to encounter in real world data sets so this allows the creation of diverse and complex training environments as simple essential for developing models that can generalize well across different contexts and crucial aspect of AGI so
For example certain things don’t happen all the time but when they do happen you want the model to be able to have like 10,000 times where it has already happened so that when the one time it does happen it knows exactly how to respond and handle that situation in
Addition synthetic data also has you know some some some benefits like and what the data is unlikely to be biased because we’re likely to make sure that the synthetic data is balanced we can also have the safe exploration of edge cases so synthetic data can safely represent edge cases or sensitive
Scenarios where real world data collection would be impractical or unethical so I’m not going to State any of those in this video because I don’t want it to get demonetized but you can use your creativity to understand what that might be but training on that data ensures that the models are prepared for
Like I said a wide range of scenarios which is you know essential for AGI if we’re going to have an AGI system it need to be able to perform in every single scenario to a general level that is above humans and of course we’re only going to do that if it’s seen pretty
Much everything and anything that there is possible and of course as we discussed the fourth part is that creating synthetic data allows for Rapid iteration and scalability in model training researchers can quickly generate and modify data sets to test different hypotheses or improve model capabilities accelerating the path towards AGI by enabling faster
Advancements in model training and evaluation so one thing that I do remember from my video that I made on the GPT 5 prediction date was that for GPT 4 they spent around six months just collecting data and for GPT 5 I do believe they spent around the same time
Collecting data now imagine they didn’t spend 6 months imagine they could collect all the data for GPT 5 in Just Around 2 months so imagine just 1 month or imagine even 30 days that would be something incredible because if you could do that think about how much that
Shortens the timeline from new models they were going to be able to test you know many different iterations of GPT 5 with many different data sets and essentially this is why I’m saying that synthetic data could speed up the timeline now of course there are some negatives that do come with synthetic
Data but I do believe that it is going to be the final bottleneck now to conclude Orca 2 it essentially says that by strategically training these models with tailored synthetic data we have achieved performance levels that rival or surpass those of larger models particularly in zero shot reasoning
Tasks Oru success lies in its application of diverse reasoning techniques and the identification of Optimal Solutions for various talks while it has several limitations including limitations inherited from its based models and common to other language models Oru potential for future advancements is evident especially in improved reasoning specialization
Control and saf of smaller models the use of the carefully filtered synthetic data for post training emerges as a key strategy in these improvements so the one thing I want to know now is that are we about to see the next evolution in AI where all of these people who are
Creating these large language mods are going to start focusing more on synthetic data due to the bottleneck of human data because once you do capture all of the data that you can from humans and it’s available on the USB you’re going to need more data to make these
Models even better and of course there are different reasoning Tech techniques and there different ways to train a large language model you know there are different things that people are doing but I definitely believe that if we do manage to get into some kind of loop
Just like we did with alago and selfplay that could present us with a very very different scenario in terms of where we go in the future so what are your thoughts about this I will be doing another video on entirely synthetic data because I do believe that in the future
That’s likely going to be where we’re headed and now that we do have these large language models which essentially before chat GPT we weren’t able to generate you know coherent pieces of text easily but now that we do have systems like GPT 4 that can generate entire paragraphs entire pages of high
Quality coherent text I would love to see the experiments of feeding that data back into GPT 4 to seeing if it does improve its reasoning even better and accelerates the process even faster
Video “Microsoft’s New AI Orca-2 Just Changed EVERYTHING! (Synthetic Data Breakthrough)” was uploaded on 11/27/2023 to Youtube Channel TheAIGRID