The Vision, Debugged;
Posts
Can AI Finally See the World As We Do? LLaVA-CoT says "Yes"!

Can AI Finally See the World As We Do? LLaVA-CoT says "Yes"!

PLUS: Never Camera-Ready? This AI Just Became Your New Best Friend

Tezan Sahu & Sandra Anil
December 3rd, 2024

Howdy fellas!

Spark and Trouble are on a mission, exploring a tech frontier where machines are learning to reason like never before.

The pieces are coming together, and the picture they’re forming is pure innovation. Let’s see how it all fits!

Here’s a sneak peek into today’s edition 👀

Meet LLaVA-CoT - the AI revolutionizing how machines “see” and think
Unlock five profitable online business ideas with this prompt – no fluff!
Discover 5 tools that’ll blow your mind—promise!
Meet Pickle: The AI that attends meetings for you

Time to jump in!😄

PS: Got thoughts on our content? Share 'em through a quick survey at the end of every edition It helps us see how our product labs, insights & resources are landing, so we can make them even better.

Hot off the Wires 🔥

We're eavesdropping on the smartest minds in research. 🤫 Don't miss out on what they're cooking up! In this section, we dissect some of the juiciest tech research that holds the key to what's next in tech.⚡

Imagine Sherlock Holmes solving a mystery. He doesn’t jump straight to conclusions; he observes, analyzes step by step, and then deduces the truth.

Similarly, the ability to “see and reason” is a superpower we wish AI could perfect. While large language models (LLMs) excel at logical reasoning and deduction, Vision Language Models (VLMs) often fumble when asked to connect the dots between visuals and textual queries.

Why? Because true reasoning requires more than just processing—it demands structured, step-by-step thinking.

That’s where LLaVA-CoT (inspired by reverse engineering OpenAI o1’s reasoning mechanisms) steps in, setting a new benchmark for reasoning-intensive tasks across vision and language.

So, what’s new?

Ever since OpenAI’s o1 model showed the potential for systematic reasoning, AI has been chasing the dream of logic-rich problem-solving. While Language Learning Models (LLMs) like GPT mastered text-based reasoning, adding the visual layer with VLMs has been tricky. Models often stumble when required to think systematically about images and text together.

Think of tasks like Visual Question Answering (VQA)—answering a question based on an image. For example, "How many red apples are left after removing two green ones?" Models must break the question into steps, understand the image, and synthesize the final answer. Sounds simple? For machines, it’s been a massive challenge!

Why? Most VLMs either:

Jump to conclusions without structuring their reasoning.
Struggle to connect the dots between text and image due to "domain gaps."

Unlike its predecessors, LLaVA-CoT doesn’t rush to conclusions. Instead, it operates like Sherlock Holmes, dividing reasoning into four clear stages. This novel approach allows it to outperform many larger open-source models and even some closed-source models!

Forging the fundamentals

Before we unravel the magic behind LLaVA-CoT, let’s decode some key jargon:

Chain-of-Thought (CoT) Prompting: A method that guides models to think step by step instead of blurting out answers. While it’s popular for text-based tasks, adapting it for visuals is a tougher nut to crack.

Structured Reasoning: Unlike free-form responses, this involves breaking down tasks into stages like summarizing, analyzing visuals, reasoning, and concluding.

Inference Time Scaling: It refers to methods used to adjust the speed and efficiency of reasoning processes in models, ensuring correct answers are obtained within a reasonable timeframe. There are several ways like majority voting, and best-of-N search, each offering unique ways to enhance the model's reasoning capabilities. To understand these in detail with examples, click here

Under the hood…

Now, let’s understand the key innovations that enable LLaVA-CoT to work its magic…

A (much-needed) clarification…

You might be tempted to think that LLaVA-CoT is built using LLaVA (Large Language and Vision Assistant) as the base model (intuitive, right?).

But NO! it is actually built upon the Llama-3.2-11B-Vision-Instruct model

Multistage Reasoning Workflow

Instead of jumping to conclusions, LLaVA-CoT works in four structured stages:

Summary: It starts by summarizing the question and pinpointing the main problem to tackle. Think of this as laying out the game plan.
Caption: If there's an image involved, it describes the relevant visual details to help connect the picture to the question.
Reasoning: Next, it uses logic and structured thinking to explore the question deeply and figure out a potential answer.
Conclusion: Finally, it wraps up with the answer, tailored to the user's needs—brief for quick answers or detailed for in-depth explanations. The earlier stages stay behind the scenes, forming the foundation for this response.

This granular approach mirrors human reasoning and minimizes errors.

Using supervised fine-tuning, the model learns to think systematically, marking each reasoning stage with tags like <SUMMARY> , <CAPTION>, <REASONING> and <CONCLUSION>. These markers ensure clarity and consistency.

Smarter Training Dataset

To teach LLaVA-CoT its new tricks, researchers built the LLaVA-CoT-100k dataset—a collection of annotated images, questions, and reasoning steps, using examples from benchmarks like MathVista and AI2D (these target both general domain as well as science-related visual question answering).

Now, since no other VLM in the past had used such structured reasoning, these researchers had to synthetically produce these reasoning steps to help train the LLaVA-CoT model. To accomplish this, they leveraged GPT-4o to generate detailed reasoning processes & compiled them into the LLaVA-CoT-100k dataset.

Clever use of GPT-4o to augment the original VQA samples with detailed reasoning steps (source: LLaVA-CoT paper)

Stage-Level Beam Search for Inference Time Scaling

Stage-level beam search helps LLaVA-CoT pick the best possible reasoning at each step of its structured problem-solving process, ensuring more accurate and reliable outcomes.

At each of the 4 stages of the reasoning process, the following happens:

The model generates several possible responses (like brainstorming multiple ways to approach a problem). These responses are referred to as candidate outputs.
The model evaluates these candidates based on their quality and coherence. It selects the best-performing response that aligns with the task requirements.
The chosen response from the current stage serves as input for the next stage.

The process repeats for all four stages until the conclusion is reached.

Here’s how Stage-Level Beam Search stacks up against other inference time scaling techniques:

Aspect	Stage-Level Beam Search	Best-of-N Search	Sentence-Level Beam Search
Granularity	Operates at a stage level, considering the reasoning process in structured chunks.	Operates on complete outputs, generating multiple full answers and picking the best.	Operates on individual sentences, evaluating each one in isolation.
Flexibility	Allows fine-grained control at each reasoning stage, improving overall coherence.	Limited flexibility; it picks the best answer from pre-generated outputs.	Highly granular, often leading to fragmented or inconsistent reasoning.
Performance	Strikes a balance between quality and computing, achieving high accuracy with fewer errors.	Often less efficient as generating full outputs can be computationally expensive.	Can introduce errors by focusing too narrowly on sentences rather than overall logic.
Scalability	Scales well with increased candidates at each stage, enhancing performance further.	Not inherently scalable without significant compute costs.	Granularity can limit scalability for complex, open-ended tasks.

An intuitive representation of Stage-level beam search, against other methods (source: LLaVA-CoT paper)

Why does this matter?

On six visual reasoning benchmarks, LLaVA-CoT delivered a +6.9% improvement over its base model. It even outperformed heavyweight closed-source models like GPT-4o-mini!

However, the implications of LLaVA-CoT stretch far beyond beating these benchmarks. Imagine an AI tutor that explains geometry diagrams step by step or a customer support bot that logically identifies issues from images. LLaVA-CoT makes such applications a reality, bridging the gap between vision and language reasoning:

Here’s what makes it a game-changer:

Scalability: Need better accuracy? Scale up inference with stage-level beam search.
Accessibility: With the code and datasets publicly available, researchers can now train and finetune their own reasoning-focused VLMs.
Versatility: From solving math problems to visual debugging, the possibilities are endless.

Spark & Trouble are already imagining a future where AI not only sees the world but truly understands it.

Ready to explore LLaVA-CoT further?

➤ Check out the GitHub repository
➤ Dive into the research paper for more specifics

10x Your Workflow with AI 📈

Work smarter, not harder! In this section, you’ll find prompt templates 📜 & bleeding-edge AI tools ⚙️ to free up your time.

Fresh Prompt Alert!🚨

Ever dreamed of ditching your 9-to-5 and launching an online business that actually pays the bills? Tired of scrolling through endless "get-rich-quick" schemes?

We've got your back!

This prompt is your secret weapon to uncovering not just one, but FIVE killer online business niches that could be your ticket to entrepreneurial freedom.

Buckle up, future business rockstar – your side hustle roadmap starts right here! 💡🔥

As an online business consultant, your task is to help an aspiring entrepreneur identify 5 profitable niches for starting an online business.

Consider factors such as market demand, competition, target audience, and potential for growth.

For each niche, provide a brief description of the products or services that could be offered, the target customer demographics, and the potential marketing strategies to reach them.

Suggest low-cost ways to validate the niche ideas and test their viability before investing significant time or money.

* Replace the content in brackets with your details

5 AI Tools You JUST Can't Miss 🤩

🌐 Starizon: Your browser assistant for seamless interaction, data extraction, and monitoring
🖌️ Design Buddy: Your full-time AI-powered design assistant
💻 Lovable: Your superhuman full-stack developer, converting ideas into apps in seconds
🛡️ Gecko Security: AI-powered security engineer that finds and fixes vulnerabilities in your codebase
🔊 ElevenLabs: AI audio platform to create the most realistic speech

Spark 'n' Trouble Shenanigans 😜

Ever felt camera-shy on Zoom? Or wish you could sip coffee in your pyjamas while your digital twin handles the call?

Buckle up, because Pickle just crashed the virtual meeting scene!

This AI wizardry lets you create a personalized digital clone that syncs perfectly with your voice—like magic, but techier.

Spark & Trouble are geeking out over the lip-sync precision (we’re talking under 0.5 seconds!) It is privacy-first, camera-optional, and absolutely mind-blowing.

Ready to give your webcam the day off? Check out Pickle here, and thank us later! 👇🏼

Pickle - Lifelike AI clones lip-syncing to your voice in real-time

Pickle generates hyper-realistic AI clones allowing users to join video calls without a camera. Our AI avatar lip-syncs to the user's voice in real time, replicating their facial expressions and interactions with near-zero latency.

getpickle.ai

Well, that’s a wrap!
Thanks for reading 😊

See you next week with more mind-blowing tech insights 💻

Until then,
Stay Curious🧠 Stay Awesome🤩

PS: Do catch us on LinkedIn - Sandra & Tezan

Reply

or to participate.