Medical AI can’t handle complex cases yet, but the introduction of multimodal large language models like ChatGPT-4o is changing that. The future of medicine is closely tied to artificial intelligence (AI). Although this revolution has been in the works for years, recent months have seen significant progress as AI moves from specialized labs into our everyday lives.
This revolution has picked up speed as major tech companies release their multimodal large language models, promising they will soon be accessible to everyone. The latest major development is the announcement of ChatGPT-4o from OpenAI. This model is described as “natively multimodal,” a feature also claimed for Google’s Gemini at its launch. However, while general users still don’t have access to Gemini’s multimodal features, ChatGPT-4o’s partial multimodality is available – though limited – to free account holders.
So, how did we get here and why is it important? Let’s look at the journey over the past 18 months and see what the future holds to understand the significance!
The public debut of Large Language Models (LLMs), like ChatGPT, which became the fastest-growing consumer application ever, has been hugely successful. LLMs are machine learning models trained on vast amounts of text data, enabling them to understand and generate human-like text based on learned patterns and structures. They differ significantly from previous deep learning methods in their scale, capabilities, and potential impact.
Large language models will soon be used in everyday clinical settings due to the global shortage of healthcare personnel. AI can help with tasks that do not require skilled medical professionals. However, before this can happen and before we establish a robust regulatory framework, we are already seeing how this new technology is being used in daily life.
To better understand what lies ahead, let’s explore another key concept that will significantly transform medicine: multimodality.
Doctors and Nurses Are Supercomputers, Medical AI is a Calculator
A multimodal system can process and interpret multiple types of input data, such as text, images, audio, and video, simultaneously. Current medical AIs only process one type of data, for example, text or X-ray images.
However, medicine is inherently multimodal, as are humans. To diagnose and treat a patient, a healthcare professional listens to the patient, reads their health files, looks at medical images, and interprets laboratory results. This is far beyond what any AI can do today.
The difference between the two is like comparing a runner to a pentathlete. A runner excels in one discipline, whereas a pentathlete must excel in multiple disciplines to succeed.
Most current Large Language Models (LLMs) are like runners; they are unimodal, meaning they can only analyze text. GPT-4 can analyze images and understand voice commands in the phone app, and so can ChatGPT-4o. These models can also generate images. The rest of the multimodal capabilities are not yet available to everyday users. Other widely used LLMs, like Google’s Gemini or Claude AI, can interpret image prompts (such as a chart) but can’t generate image responses yet. Meanwhile, Google is reportedly working on pioneering the medical large language model arena with a range of models, including the latest: Med-Gemini.
From The Medical Futurist’s perspective, it’s clear that multimodal LLMs (M-LLMs) with full functionality will arrive soon. Otherwise, AI won’t be able to significantly contribute to the multimodal nature of medicine and care. These systems will considerably reduce the workload of – but not replace – human healthcare professionals.
The Future is M-LLMs
The development of M-LLMs will have at least three significant consequences:
- AI Will Handle Multiple Types of Content, from Images to Audio
An M-LLM will be able to process and interpret various kinds of content, crucial for comprehensive analysis in medicine. We could list hundreds of benefits of such a system, but here are a few in five categories:
- Text analysis: M-LLMs will handle many administrative, clinical, educational, and marketing tasks, from updating electronic medical records to solving case studies.
- Image analysis: This broad area includes reading handwritten notes and analyzing radiology images (ophthalmology, neurology, pathology, etc.).
- Sound analysis: M-LLMs will eventually monitor diseases, such as checking heart and lung sounds for abnormalities, to ensure early detection. Sounds can also provide valuable information in mental health and rehabilitation applications.
- Video analysis: An advanced algorithm will guide a medical student in virtual reality surgery training on how to aim precisely and move correctly. Videos could also detect neurological conditions or support patients communicating with sign language.
- Complex document analysis: This includes assistance in literature reviews and research, analysis of medical guidelines for clinical decision-making, and clinical coding, among many other uses.
- It Will Break Language Barriers
These M-LLMs will facilitate communication between healthcare providers and patients who speak different languages by translating between various languages in real time. Just as we’ve seen with ChatGPT-4o’s live translation capabilities. The potential for removing language barriers during medical appointments is clear.
- Specialist: “Can you please point to where it hurts?”
- M-LLM (Translating for Patient): “¿Puede señalar dónde le duele?”
- The patient points to the lower abdomen.
- M-LLM (Translating for Specialist): “The patient is pointing to the lower abdomen.”
- Specialist: “On a scale from 1 to 10, how would you rate your pain?”
- M-LLM (Translating for Patient): “En una escala del 1 al 10, ¿cómo calificaría su dolor?”
- Patient: “Es un 8.”
- M-LLM (Translating for Specialist): “It is an 8.”
- Finally, the Arrival of Interoperability Can Connect and Harmonize Various Hospital Systems
An M-LLM could serve as a central hub, facilitating access to various unimodal AIs used in the hospital, such as radiology software, insurance handling software, Electronic Medical Records (EMR), etc. The situation today is as follows:
One company manufactures software for the radiology department, which uses a certain format of AI in their daily work. Another company’s algorithm works with the hospital’s electronic medical records, and yet another third-party supplier creates AI to compile insurance reports. However, doctors typically only have access to the system related to their field; for example, a radiologist has access to the radiological AI, but a cardiologist does not. And these algorithms don’t communicate with each other. If the cardiology department used an algorithm that analyzed heart and lung signs, gastroenterologists or psychiatrists likely wouldn’t have access to it – even though its findings may be useful for their diagnosis as well.
The significant step will be when M-LLMs eventually become capable of understanding the language and format of all these software applications and help people communicate with them. An average doctor will then easily work with the radiological AI software, the AI software managing the EMRs, and the fourth, eighth (etc.) AI is used in the hospital.
This potential is essential because such a breakthrough won’t come about in any other way. No single company will come up with such software because they don’t have access to the AI data developed by individual companies. However, the M-LLM will be able to communicate with these systems individually and, as a central hub, will provide a tool of immense importance to doctors.
The transition from unimodal to multimodal AI is a necessary step to harness AI’s potential in medicine fully. By developing M-LLMs that can process multiple types of content, break language barriers, and facilitate access to other AI applications, we can revolutionize how we practice medicine. The journey from being a calculator to matching the supercomputers we call doctors is challenging, but it is a revolution happening right before our eyes.
To register for our next masterclass please click here https://linktr.ee/docpreneur