Googles New Medical AI Just SHOCKED The Entire INDUSTRY (BEATS Doctors!) AMIE – Google
Google’s new AI system, AMIE (Articulate Medical Intelligence Explorer), has shocked the medical industry with its ability to diagnose patients and interact with them in a way that rivals that of human doctors. The system was trained using synthetic data through a self-play mechanism where it simulated medical scenarios and learned from its interactions. This method allowed AMIE to learn from a wide range of medical cases, including rare and complex conditions, and continuously improve its diagnostic accuracy and communication skills.
In a study comparing AMIE to 20 real primary care physicians, the AI system outperformed the doctors in diagnosing medical conditions, managing medical issues, communicating effectively, and building relationships with patients. The simulated consultations, featuring trained actors as patients, were designed like objective structured clinical examinations to ensure a fair evaluation of both the AI system and the human participants.
Overall, AMIE’s success in diagnostic reasoning and conversational abilities has marked a significant breakthrough in the field of medical AI, showing promise in transforming the way healthcare is delivered and raising questions about the future role of AI in the medical industry.
Watch the video by TheAIGRID
Video Transcript
So something that has actually shaken up the AI industry and actually took a lot of people by surprise was Google’s new AI system which they’ve dubbed an articulate medical intelligence Explorer or for short Amy so essentially they’ve made an AI system that is really good at diagnosing patients and is arguably
Better than the doctors so this video will do a deep dive on exactly what the research found and just how good this AI system is so you can see here that this is the introduction it says inspired by this challenge we developed articulate medical intelligence Explorer Amy for
Short a research AI system based on an llm and optimized for Diagnostic reasoning and conversations we trained and evaluated Amy along many dimensions that reflect quality in real world clinical consultations from the perspective of both clinicians and patients to scale Amy across a multitude of disease conditions specialities and
Scenarios we developed a novel self-play based simulated diagnostic dialogue environment with automated feedback mechanisms to enrich and accelerate its learning process we also introduced an inference time chain of reasoning strategy to improve Amy’s diagnostic accuracy and conversation quality and finally we tested Amy prospectively in real examples of multi- turn dialogue by
Simulating consultations with trained actors so overall essentially what we have is a large language model system that got really really good at conversing with patients and being able to offer up a great view of what the diagnosis could potentially be and this is a short overview on how it works so
Essentially Amy was optimized for Diagnostic conversations and asking questions that help to reduce its uncertainty and improve diagnostic accuracy while also balancing this with other requirements of effective clinical communication such as empathy fostering a relationship and providing information clearly here is a gif from Google’s blog where they show how the conversation
With Amy may go so someone talks about chest pain and then it responds saying you know give me more details and yada yada yada and it’s essentially a very comprehensive system in terms terms of how the conversation goes and later on in the video I will show you one of the
Key examples actually compared to a real primary care physician where you can compare the results from the AI system to that of the primary care physician or for short PCP now one of the crazy things about this was the evaluation of this system so it says evaluation of a
Conversational diagnostic AI so what they wanted to do was determine how effectively to assess AI systems designed for medical IAL diagnosis because it is a pretty complex issue so what they did the researchers created a unique evaluation system inspired by real world methods used to assess doctor’s communication and consultation
Skills so this system evaluat several key areas like history taking how well the AI gathers medical history from the patients diagnostic accuracy the ai’s ability to correctly diagnose health issues clinical management the AI skill in managing medical conditions clinical communication skills how effectively the AI communicates with patients and
Relationship building and empathy the ai’s ability to establish a rapport with the patients and show empathy now in the study design the researchers set up a randomized double blind study where patient actors had text-based consultations with either a real primary care physician a PCP or the AI system
And they had osce style consultations these consultations were designed like an objective structured clinical examination osce a practical test often used to evaluate medical professionals in an osce clinicians interact with actors trained to portray patients with specific conditions and in the interface the consultations were conducted using a
Text chat tool similar to what many people use to interact with AI systems so in simple terms the study involved testing Amy an AI system by having it and real doctors conduct virtual consultations with actors pretending to be patients and the AI and doctors were both evaluated on how well they gathered
Patient history diagnosed conditions and managed medical issues communicated built up relationship with these patients and the setup was very similar to the medical exam for doctors ensuring a fair and effective assessment of both the human and AI participants now there are some more things that we do need to
Take a look at such as how this AI system was actually trained because it introduced a very interesting way to train so it seems like this AI model was trained on synthetic data by training on conversations with itself so Amy was trained using real world data that did include medical reasoning summarization
And clinical conversations and although the real world data was good they did actually have some struggles with the real world data this is because real world data often doesn’t cover all possible medical conditions and scenarios which limits the ai’s ability to learn about less common situations and there were also some quality issues
Data from real conversations can be messy and it includes unclear language some slang interruptions and other complexities of natural speech making it hard from the AI to learn from so essentially this is where we do get into the self-play nature of Amy which is fascinating for how this system was done
And later on I’ll get onto some of the results because some of the initial results from Amy are really surprising because really does outperform these primary care physicians but it’s important to understand first of all how this AI system does work so essentially this AI system Amy did selfplay so
Selfplay is a method used in artificial intelligence where a system essentially just plays against itself and this technique is often used in training AI models especially in complex decision-making scenarios so the purpose of that in Amy the selfplay it was allowing the system to simulate Medical Diagnostic conversations so it’s like
Having an AI doctor practice by having a conversation with an AI patient and both roles played by the system itself so how does this actually work in Amy so number one we have the simulation Amy plays both the physician and the patient in a simulated medical scenario the AI
Creates and responds to medical history questions symptoms and other relevant information in a dialogue format number two is the feedback loop each interaction in the simulated dialogues provides feedback to the system and this feedback helps Amy learn and improve its diagnostic capabilities communication skills and overall understanding of
Medical scenarios and number three is the diverse scenarios so by simulating a wide range of medical cases and interactions Amy can learn learn from a vast array of medical situations including rare or even complex cases and the benefits of self-play in medical AI are just off the chart because you have
A system being able to learn from thousands of medical scenarios in a very short period which would be impossible in real world settings and continuous learning through self-play helps improve the system’s diagnostic accuracy as it continuously adapts and updates its knowledge base and number three is of course scalability this method allows
Allows Amy to scale its learning across the various disease conditions and medical specialities without the need for real patient interactions so Amy’s self-play mechanism was a critical component in training the AI for Diagnostic dialogues and the effectiveness of this approach was evident in the systems performance in simulated consultations where it showed
Higher diagnostic accuracy and better conversation quality compared to primary care physicians so how on Earth did they actually test this model because if you’re going to test an AI model against some Physicians and doctors you need to be able to do it pretty in a pretty decent manner so essentially Amy’s
Ability to diagnosed medical conditions was tested in a study the ai’s performance was compared with that of 20 real primary care physicians and instead of real patients trained actors played the role of patients the simulated consultation helps in evaluating both Amy and the primary care physicians so
Essentially in the study they did a randomized and blinded the study was set up in in a way that neither the evaluators The Specialist doctors and the patient actors nor the participants including Amy and the pcps knew who was being assessed at any given time in addition they also had extensive
Scenarios there were 149 different medical cases used in the study and these cases came from occe objective structured clinical examination providers in the Canada and UK and India covering various medical specialities and diseases now the study had a unique approach because the study wasn’t meant to copy the traditional in-person
Medical exams that doctors usually take and it also didn’t mimic the typical ways doctors communicate with patients like via text like email chat or teley medicine but instead the study was designed to reflect how people commonly interact with large language models today through a text based communication
And this is seen as a scalable and familiar way for AI systems to engage in remote medical discussions so essentially actors acted as patients and Amy and real doctors had to figure out what was wrong with them and the test was basically set up like a secret so
Neither the actors nor the doctors knew who was being tested they used many different health problems from all over the world and they did it in a way that’s more akin to how we usually use these llms now of course here we get to the interesting part because of course
This is the performance so one of the fascinating things from this study that we did see was of course this chart right here and of course this chart right here but I think the most shocking thing of all was not that the AI was better than the primary care physicians
But was the fact that the AI was better only because largely what we do expect when we see AI systems doing stuff better than humans is that we usually think that humans with AI are going to be largely better than AI systems but we can see that there are four different
Charts we can see that the clinician unassisted performed the lowest the Clin the clinician assisted by search performed the second lowest and then the clinician assisted by Amy performed really well but Amy only performed the best so it did show in this specific study that the AI system by itself was
Able to outperform these primary care physicians which was rather shocking because I would have always thought that a clinician assisted by Amy would definitely perform better than an AI system itself but overall I definitely feel like this research shows us that these AI systems are definitely useful because the clinician unassisted
Compared to the clinician assisted by am definitely did a lot better so it’s clear that these AI systems are really effective when trained in specific scenarios now in addition to this what we do want to look at are the actual conversation examples because when you look at how these conversations are I
Think you’re going to find out that they’re rather fascinating so this is an example of the actual PCP so this is from the research paper and we can see that the patient here is Patricia Wilson a 45-year-old complaining about chest pain and the PCP has the 10 um diagnosis
These 10 probable diagnosis which are acute coronary syndrome GD um and these other ones and then they’ve left out some you know options for number six all the way to number 10 and these probable diagnosises are what are you know this is essentially what the AI system is
Going to get as well but on the right here this is what they’ve managed to say or deduce from that so essentially you can see this this is the conversation here this is the conversation that they’re having and I’m going to be going back and forth between this conversation
And the one with the AI to show you how different they look so you can see this is the online text based consultation with the primary care physician it says hi doctor how are you I’m great how can I help you I recently experienced an episode of chest pain and discomfort
About 2 hours ago while walking home from lunch with my daughter I developed Central chest plane along with some discomfort in the upper stomach I’m sorry to hear that are you still in pain no I’m not currently in pain I’m not going to go ahead and read these but
It’s it’s important to just look at you know this entire thing to see how they responded and then we can kind of compare this to the AI system so take a look at the AI system here you can see that this is what the AI system did so
Instantly we did know that this AI system was able to have you know immediately 10 diagnosises 10 possible diagnosises for the chest pain um and that was something that is pretty useful and it also had some of the similar ones to the doctor and then what was also
Fascinating was when looking at the conversation we did see that there was definitely a lot more in the sense of the conversation so the conversations seem to be a lot longer and just by reading these conversations from the AI systems we did see that these AI systems are most certainly more empathetic like
For example you know the AI system here says I’m sorry to hear you’re experiencing this discomfort and just the way that the AI system manages to ask these questions is truly fascinating it seems like the AI definitely has a lot longer responses compared to the primary care physician so for example
You can see the Amy system has has a decent paragraph here and a decent paragraph here but then if we go back to these other Primary Care PHS it’s only one liners then of course we can see the last message from Amy here and then of course um it suggests exactly what you
Should seek out and then it said I understand that this is woring but it’s crucial to act fast even if it turns out not to be a heart attack and then of course the patient actor does respond then of course we have conversation quality so essentially in terms of
Conversation quality in the study both patient actors and specialist doctors rated amies ational skills higher than those of real doctors there were also a wide range of aspects and this included how well Amy made patients feel at ease listened and explained conditions and treatments and in some areas like respecting patient privacy and
Acknowledging mistakes there was no significant difference between Amy and real doctors and this study did show that self-play Training Method did improve the quality of Amy simulated conversations now in terms of the diagnostic accuracy better than real doctors is true because Amy was better at figuring out medical diagnosis than
Primary care physicians and this was confirmed by specialist doctors who evaluated the study and Amy actually did well across medical Fields especially in respiratory and cardiovascular specialities and whether Amy was analyzing its own conversations or those of real doctors its ability to diagnose was consistently good and both Amy and
The real doctors were equally efficient in gathering the necessary information for a diagnosis and this is where they said spe we observed that Amy’s performance matched or surpassed pcp’s performance for all specialities so essentially so for the future this brings about some very interesting discussions because access to Medical
Expertise is limited globally and AIS like Amy shows promise in specific areas but engaging in real world conversational medical diagnosises involves challenges not yet fully addressed by AI I mean real doctors have skills knowledge and dedication to principes TOS like safety quality communication and professionalism and it’s an experimental system exploring
How AIS might eventually align with these clinici attributes and Amy is not a finished product it’s early stage research aimed at safety exploring how the potential of AI could be in healthcare and this research does hint at a future where conversational empathetic AI systems could become safe helpful and accessible in healthcare
Complimenting human clinicians and according to some articles medical errors are the third leading cause of deaths in the US so if there are you know false diagnosises which do lead to Deaths this is definitely something that AI could of course solve really easily and you know it said a recent study by
John Hopkins more than 250,000 people in the United States die every year because of medical mistakes making it the third leading cord of death making it the third leading cause of death after heart disease and cancer so that definitely could be a reason for these AI systems
To be more widely distributed but of course this AI system is not without its limitations and they say our research has several limitations and should be interpreted with appropriate caution firstly our evaluation techniques likely underestimates the real world value of human conversations as the clinicians in our study were limited to an unfamiliar
Text chat interface which permits large scale llm patient interactions but isn’t representative of usual clinical practice and any research of this type must be seen as only a first exploratory step on a long journey transitioning from llm research prototype that we evaluated in the study to a safe and
Robust tool that could be used by people and those who provide care for them will require significant additional research and there are many important limitations to be addressed including experimental performance under real world constraints and dedicated exploration of such important topics such as Health Equity and fairness privacy robustness and many
More to ensure the safety and reliability of this technology overall I think this is a fascinating system previously we did discuss that 2024 was going to be the year that we did start to see the development of this and so far it turns out that this is fairly
True and the results show us that the results are very very promising because we seem to be much better with AI so it’s only a matter of time until these systems become fully developed product that are helping us in our daily lives now what I would like to see next from
Google is of course combining this with some of their most capable Vision models to see if AI systems are able to combined with their large language model capabilities use their Vision capabilities to use their Vision to be able to visually identify certain illnesses maybe earlier than doctors and potentially use those combined to
Essentially help doctors do what they do best because that would be a very fascinating study to see and something that I did see whilst researching this if you go on Google Health they have a whole entire page they have an entire page dedicated to AI assisted diagnosis
In terms of healthcare so it says AI enabled Imaging and Diagnostics previously thought impossible then it also says improving access to skin disease information using AI to help doctors address eye disease using AI to improve lung cancer detection and novel biomarkers for non ey related conditions and essentially they just used deep
Learning to identify certain things because it’s really good at pattern recognition and they used AI to improve breast cancer detection so a complete p of all of the you know research that they’re doing and how they’ve been using AI to impact the healthcare industry and
I think that this is a thing that is really good because some instances where technology Innovations does result in a little bit of Doom I think using AI for the betterment of humanity in terms of longevity is something that is rather fascinating and I will leave links to
All of this stuff because I think it’s important content that definitely should be shared and the last thing that I will leave you with is of course Google’s innovative breakthrough that is med Palm 2 if you haven’t seen it before it is ground baking research in health Ai and
Trust me Med Palm 2 has shocked absolutely everyone and I honestly can’t wait for Med pom 3 that could potentially even be better than this system and rather than me explain this I’ll leave a small clip from the Google Health checkup in 2023 where they explain exactly why Med pom 2 is going
To be a game changer of carefully curated medical expert demonstrations in multiple choice questions used for us medical licensing exams the pass Mark for new doctors is often around 60% these questions have long been considered a grand challenge for AI systems they require a clinician to recall medical knowledge and apply logic
To identify the correct answer despite years of effort from leading AI labs around the world performance on challenging tasks like this has plateaued at around 50% last December our model Med Palm was the first AI system to exceed the pass Mark we reached a performance of over 67% on these licensing exam style
Questions we also carefully examined how medal performed in many other kinds of medical question answering tasks these range from commonly asked internet search questions to complicated questions about medical research in doing this work we compared medp Palm’s answers with answers from real clinicians and we looked at several
Aspects like factual accuracy bias and the potential for today we’re announcing results from Med palal 2 our new and improved model medp Palm 2 has reached 85% accuracy on the medical exam Benchmark in research this performance is on par with expert test takers it far exceeds the
Passing score and it’s an 18% leap over our own state-of art results from medp Palm medp Palm 2 also performed impressively on Indian medical exams and it’s the first AI system to exceed the passing score on those challenging questions there are many ways that an AI system like medal can be a building
Block for advanced natural language processing in healthcare and we’d like to work with researchers and experts to advance this work
Video “Googles New Medical AI Just SHOCKED The Entire INDUSTRY (BEATS Doctors!) AMIE – Google” was uploaded on 01/16/2024 to Youtube Channel TheAIGRID