The Vision, Debugged;
Posts
From Notes to X-Rays: Google's Med-Gemini is here to Read, Reason & Assist Doctors

From Notes to X-Rays: Google's Med-Gemini is here to Read, Reason & Assist Doctors

PLUS: Prompt Templates for stunning AI Product Photography

Tezan Sahu & Sandra Anil
May 21st, 2024

Howdy fellas!

Welcome to another edition of our newsletter. Spark & Trouble are here again, to drop some serious knowledge bombs about AI that'll leave you starry-eyed. So, let’s get started.

Here’s a sneak peek into this edition 👀

Capabilities, novelties & caveats of Med-Gemini, the latest AI dropped in the biomedical domain
3 Awesome AI Tools that you JUST can’t miss out on!
Prompt templates to create kickass product showcase photos using FREE AI tools that beat Stock Images

Time to jump in!😄

PS: Got thoughts on our content? Share 'em through a quick survey at the end of every edition It helps us see how our product labs, insights & resources are landing, so we can make them even better.

Hot off the Wires 🔥

We're eavesdropping on the smartest minds in research. 🤫 Don't miss out on what they're cooking up! In this section, we dissect some of the juiciest tech research that holds the key to what's next in tech.⚡

Spark, just like Sheldon & the guys from Big Bang Theory (BBT fans in the house, make some noise!), is a die-hard fan of Star Trek. And for some reason, she’s obsessed with that tricorder capable of scanning a patient & providing a diagnosis almost instantly.

Medical tricorder from Star Trek, in use (source: bitrebels.com)

If you too are fascinated about this, you’re going to love this innovation by Google that boasts of not just state-of-the-art advanced clinical reasoning, but also the power of processing new ’medical’ modalities (think ECG signals, etc.) along with tons of information (in the form of images, long videos, numerous health records, etc.) to intelligently diagnose patients.

Who knows, it could serve as a solid foundation to build something even better than the beloved tricorder (well, only if the remaining snags are cleared up - that’s Trouble’s take on it).

So, what’s new?

The use of AI & LLMs in healthcare has always been a matter of great fascination (& debate, of course!). Models fine-tuned on medical data (like Med-PaLM 2, Flamingo-CXR & AMIE) have shown promising results (& often outperformed primary care physicians) in medical question answering, report generation & text-based consultations with patients.

However, there are some kinks to iron out…

Medical knowledge is on fast-forward: Medical AI needs to stay on top of the latest research, but current ‘trained’ models struggle to keep up.
Tricky medical terms can trip up AI: Words like "diabetic nephropathy" and "diabetes mellitus" sound similar but have different meanings. Confusing them could lead to big mistakes.
LLMs miss the whole picture: Most models can only focus on a small amount of information (limited context window), while medical applications require analyzing a lot of details.
Long & messy medical records: Medical records can be a jumble of checklists, scribbles, reports, and images. AI needs to sort through this mess to find what matters.

This is where Google chimes in, trying to exploit the capabilities of its powerful Gemini models by providing medical specialization in various forms (discussed next), thereby creating a family of highly capable Med-Gemini models.

Under the hood…

First things first - It’s important to clarify that Med-Gemini is NOT just 1 fine-tuned generalist model. Instead, it is a family of models, each leveraging the inherent capabilities of specific Gemini models, further optimized for different capabilities and application-specific scenarios.

From Gemini to Med-Gemini (source: created by authors)

During this process, these researchers introduce several novel training techniques & curated datasets. However, before understanding these, let’s equip ourselves with some jargon that might be helpful:

Fine-Tuning vs Prompting: Think of an LLM as a chef who knows many recipes (general knowledge). Fine-tuning is like sending him to a pizza school (specific task), while prompting is like giving him a detailed recipe (instructions) to get a desired dish.

Instruction Fine-Tuning: This is like training a language model by giving it lots of examples of instructions or commands, and their outcomes.

Self-Training: This training strategy is used especially when you have less training data, where the model improves iteratively by using its own outputs as new training data.

Chain-of-Thought (CoT): This involves prompting the LLM to first explain its reasoning step-by-step, to arrive at the final answer. This technique has shown significant improvements compared generation of answers directly.

Awesome! Now it’s time to unravel the mysteries behind Med-Gemini’s phenomenal capabilities…

Self-Training with Web Search

The process of self-training with Web Search (source: created by authors)

This novel technique is used to obtain Med-Gemini-L 1.0 from Gemini 1.0 Ultra. It basically involves the following:

First, Med-Gemini-1.0 L crafts apt search queries for our questions, & get results from a Web API
Experts curate “seed” demonstrations with accurate clinical reasoning (picking relevant stuff from search results) to arrive at ground-truth answers
With these, Med-Gemini-1.0 L is asked to generate CoT results, both w/ & w/o web search results
The generated CoTs are filtered (only the ones leading to ground truth answers were retained) & used to fine-tune the model

This process is repeated, over training data spanning medical Q&A & medical note summarization, until the model aces the tasks.

A sample question from the MedQA dataset, which was used to fine-tune Med-Gemini

Uncertainty-Guided Search (at Inference)

Answering a medical question through Med-Gemini-1.0 L involves the following process:

Med-Gemini-L 1.0 explores multiple CoT reasoning paths to solve the medical question
It also calculates the overall "uncertainty" for the question based on the reasoning paths.
If this uncertainty is low, the majority vote is returned as the answer.
Else, it generates 3 relevant search queries to help resolve conflicts (& thus, reduce uncertainty)
After submitting these queries to a web search engine, the results are incorporated into the model’s input prompt for the next iteration

This iterative process with targeted web searches helps Med-Gemini-L 1.0 arrive at a more confident and accurate answer during inference.

Uncertainty-Guided Web Search (source: Med-Gemini paper)

It is using this technique that Med-Gemini brags about achieving a 91% accuracy, beating current SoTA GPT-4. We acknowledge this feat, however, it’s critical to note that this number corresponds to just 1 benchmark (that too, the relatively easier one)! Although Med-Gemini beats SoTA models across several other benchmarks, the absolute accuracy numbers may not be too impressive.

Multimodal Finetuning

To specialize Gemini 1.5 Pro’s multimodal reasoning, it was instruction fine-tuned on a range of image-to-text datasets that included:

Visual Question Answering in radiology & pathology
Image captioning across diverse scans - CT, ultrasound, X-Ray, PET & MRI
Medical MCQ answering on dermatology & chest X-rays
Check X-Ray report generation

The result was none other than Med-Gemini-M 1.5, which outperformed most of the SoTA models across these benchmarks & was capable of engaging in comprehensive multimodal medical dialogue. All this is awesome, but the actual accuracy numbers (55-95% depending on the benchmarks) indicate how far we are from deploying such models in real-world use cases.

Med-Gemini-M 1.5 demonstrates its ability to analyze a chest X-ray (CXR) & engage in a dialogue with a primary care physician (source: Med-Gemini paper)

Customized Encoders

This is an interesting proof-of-concept proposed by this research, motivated by the fact that apart from images & text, a bunch of biomedical signals can be leveraged if such models are to be used in wearable devices, etc.

To demonstrate this, Gemini 1.0 Nano (Google’s LLM for on-device AI experiences) was augmented with a special encoder to understand the 12-channel ECG waveform input (a new modality) & then instruction finetuned on question-answering tasks involving ECGs.

Question answering capability of Med-Gemini-S 1.0 using 12-lead ECG signal (source: Med-Gemini paper)

The results beat the current SoTA model, showing promise with such an approach. However, achieving a meagre 57.7% accuracy on this benchmark is underwhelming!

Chain-of-Reasoning (CoR)

Sifting through long electronic health records (EHR) & notes to dig out critical information could be time-consuming. If automated with good precision & recall, this could be a game-changer for clinicians. Motivated by this researchers formulated a “needle-in-the-haystack” problem, to be solved by Med-Gemini using a 2-step chain-of-reasoning approach:

Problem formulation: Given a collection of unstructured EHR documents for a patient, retrieve the evidence for the presence/absence of a medical condition such that it is supported with only 1 evidence snippet
CoR Approach:
- Step 1: Prompt Med-Gemini-M 1.5 to retrieve all snippets of evidence related to the problem, given 1 example of how to do so (1-shot in-context learning)
- Step 2: Prompt Med-Gemini-M 1.5 to determine the presence of the problem based on evidence snippets

Med-Gemini’s performance is at par with the manual baseline created by human raters for this task, which highlights its exceptional long-context processing capabilities & is pretty encouraging given the original motivation.

Why does this matter?

Med-Gemini’s prowess across a wide range of medical tasks can be considered a great leap in the direction of infusing AI in medicine. In fact, on certain long text generation tasks, clinicians preferred Med-Gemini responses over human experts by a huge margin. This could be a huge productivity boost for clinicians & doctors.

Clinicians preferred Med-Gemini’s responses over human experts on several occasions (source: Med-Gemini paper)

The multimodal & long-context processing capabilities of Med-Gemini enable it to ace tasks like video question answering tasks and identifying segments in videos matching a text query. Beyond these benchmarks, researchers also demonstrate the following use cases:

Clinicians having conversations based on medical videos
Summarizing long EHR records
Biomedical literature review

Such capabilities could truly transform medical education, and the field of assistive aids for doctors soon, into rich & immersive experiences.

And now, for the concerns…

Though Med-Gemini is a significant leap forward across a range of tasks, it is far away from being “usable”, and Med-Gemini researchers acknowledge it.

Accuracy concerns: The low absolute accuracy scores across several benchmarks are a testament to the fact that a lot more kinks need to be ironed out before any real-world deployment (well, if you observe, this has been a concluding statement in most of Google’s recent papers).

Limited performance: Med-Gemini, like most other LLMs, has scored really well on “closed” Q&A tasks (MCQs of sorts). Under “open” settings, allowing free-form responses, although it beats previous benchmarks, the absolute numbers are way too low!

Safety risks: The medical domain calls for a very low tolerance for errors. Under these settings, possible hallucinations by models like Med-Gemini could be life-threatening. In fact, before the announcement of Med-Gemini, some Indian researchers had published a paper exploring such problems with Gemini models.

Lack of transparency: Med-Gemini’s predecessor, Med-PaLM had been reportedly tested by Mayo Clinic in 2023, and since then there’s been no communication about the results. Even in the Med-Gemini paper, there is no mention of such clinical trials. This feels a little disturbing.

Spark & Trouble are super excited about this advancement of AI in the field of medicine and envision that such technology is meticulously tested by doctors & clinicians for improved reliability & safety, so that someday doctors could have most of their ‘mundane’ tasks automated by an AI assistant, allowing them more time to empathetically diagnose patients.

10x Your Workflow with AI 📈

Work smarter, not harder! In this section, you’ll find prompt templates 📜 & bleeding-edge AI tools ⚙️ to free up your time.

Fresh Prompt Alert!🚨

Wish to land a job at your dream company? Mock interviews are the best way to prepare. Along with actual mentors who can help you ace your interviews, you can now leverage AI at your fingertips to spin up a mock interview setting, curated for the role you aspire to!

Check out the prompt below 👇

Act as a strict interviewer for a [role] at [company], having the following job description:
[insert job description]

For this role, conduct a mock interview for me by asking a series of interview questions, relevant for a [type of round - technical/behavioral/AA] round.

Ensure that the questions are thought-provoking (they may require discussion). Also, ensure that the questions can all be answered within [time].

* Replace the content in brackets with your details

Pro Tip: Use this prompt on Microsoft Copilot, Claude or Gemini (basically anything with access to the web) to leverage the trends & questions from more recent actual interview experiences of candidates.

3 AI Tools You JUST Can't Miss 🤩

🖼️ Krea - The AI that transforms ideas into stunning visuals with real-time interaction
💼 CareerFlow - Land your dream job with your AI career copilot
🦜 PenParrot - Use AI anywhere on the web to magically write your emails, posts, tweets, articles, and texts

Spark 'n' Trouble Shenanigans 😜

Stock photos? Nah.
AI-Generated product images? Oh yeah! 🤩

Let's take a look at how you can develop custom, on-brand images to showcase your products in today’s AI era without any professional setup, using just a FREE tool like Microsoft Designer, like these…