- The Vision, Debugged;
- Posts
- Chart Creation 📊 Just Got a Whole Lot Smarter with METAL
Chart Creation 📊 Just Got a Whole Lot Smarter with METAL
PLUS: Can AI Beat Super Mario? The Answer Will Shock You

Howdy, Vision Debuggers! 🕵
Spark and Trouble are back in the lab, aligning the pieces of a complex puzzle. The patterns are revealing themselves—are you ready to decode the bigger picture?
Here’s a sneak peek into today’s edition 👀
- Meet METAL – the AI multi-agent team that makes perfect charts. 
- Master any topic with this Feynman-inspired prompt 
- 5 powerful AI tools that will blow your mind 
- Can AI master Super Mario? The results are wild! 
Time to jump in!😄
PS: Got thoughts on our content? Share 'em through a quick survey at the end of every edition It helps us see how our product labs, insights & resources are landing, so we can make them even better.

Hot off the Wires 🔥
We're eavesdropping on the smartest minds in research. 🤫 Don't miss out on what they're cooking up! In this section, we dissect some of the juiciest tech research that holds the key to what's next in tech.⚡
Remember those school projects where you spent hours trying to recreate a perfect bar chart or pie graph? While you struggled with alignment, colors, and proportions, you wished for a magical assistant who could just look at your sketch and instantly create a polished digital version.

Example of how LMMs assist scientists and researchers in understanding, interpreting and creating charts during the reading and writing of academic papers (source: ChartMimic website)
Well, that magical assistant is finally here – and it's getting smarter by the minute! Researchers have just unveiled METAL (Multi-agEnT frAmework with vision Language models for chart generation), a breakthrough approach that transforms the way AI creates charts. And yes, they definitely strained a few linguistic muscles creating that acronym!
Think of it like a well-coordinated team of designers, coders, and quality checkers working together to create the perfect chart. And the best part? It gets better with more computational power, thanks to its test-time scaling capabilities.
So, what’s new?
Traditional chart generation methods rely on monolithic models that attempt to handle everything—from visual interpretation to code generation—in one go. This often leads to errors in structure, color, and alignment, especially when dealing with complex reference charts
The METAL framework tackles this challenge by breaking down the complex task of chart generation into smaller, specialized subtasks handled by different "agents" (AI models with specific roles). Instead of expecting a single system to handle everything perfectly, METAL creates a collaborative environment where multiple agents critique and refine each other's work – similar to how human teams might review and improve designs iteratively.
The cherry on top? METAL exhibits "test-time scaling" – we’ll discuss this in more detail later. For now, you can think of it as a fancy way of saying it gets better results when given more computational resources during testing, without requiring any additional training.
This is a big deal because it shows that the framework isn’t just effective—it’s scalable and future-proof.
Forging the fundamentals
Before we dive deeper, let’s break down some key concepts:
Vision-Language Models (VLMs): AI models that can understand and generate content based on both visual and textual inputs, allowing them to "see" and "read" simultaneously.
Chart-to-Code Generation: The process of converting a visual chart into code (e.g., Python) that can reproduce the chart programmatically.
Multi-Agent Framework: A system where multiple specialized AI agents collaborate to solve a complex task. Each agent has a specific role, like a team of specialists.
Test-time scaling: The phenomenon where a system's performance improves when given more computational resources during testing, without requiring additional training.
F1 Score: The F1 score measures how well a model balances accuracy when identifying positive cases. It combines two key factors: how many actual positives it catches (recall) and how many predicted positives are correct (precision).
Under the hood…
METAL’s magic lies in its four specialized agents:
- Generation Agent (G): This agent takes the reference chart and generates an initial program (code) to reproduce it. Think of it as the architect who drafts the blueprint. 
- Visual Critique Agent (V): This agent compares the rendered chart with the reference and identifies visual discrepancies—like mismatched colors or misaligned axes. It’s the quality control inspector for visuals. 
- Code Critique Agent (C): This agent reviews the generated code for errors and inefficiencies. It’s the code reviewer who ensures the program is clean and functional. 
- Revision Agent (R): This agent integrates feedback from the critique agents and updates the code. It’s the editor who polishes the final draft. 

Overview of the METAL system for chart-to-code generation (source: METAL paper)
At the heart of this process is the Multi-Criteria Verifier, which ensures the generated chart meets strict quality standards. The verifier evaluates the chart based on three key metrics:
- Color (m1): Checks if the colors match the reference chart. 
- Text (m2): Ensures all labels, titles, and annotations are accurate. 
- Overall Structure (m3): Verifies the layout and alignment of chart elements. 
If the generated chart meets predefined thresholds for these metrics, the process stops early. If not, the agents continue refining until the chart is perfect or the maximum number of attempts is reached. This dynamic verification process ensures that METAL delivers high-quality outputs consistently.
What's clever is how this team is organized: The Generation Agent and Visual Critique Agent are powered by a VLM, while the Code Critique Agent and Revision Agent are purely text-based. This division of labor allows each agent to specialize in what it does best.
The process is iterative: the agents collaborate, critique, and refine until the generated chart meets a predefined quality threshold. This dynamic approach allows METAL to adapt to complex inputs and deliver highly accurate results.
What’s the Intrigue?
The researchers discovered two fascinating phenomena while testing METAL:

Graphs showing test-time-scaling capabilities of METAL across use of VLMs (source: METAL paper)
- The framework exhibits remarkable "test-time scaling" – as they increased the computational budget logarithmically (from 29 to 213 tokens), performance improved almost linearly. This suggests that simply giving the system more computational resources during inference can yield better results without requiring any additional training. 
- Separating the critique process into specialized agents for visual and code feedback significantly outperformed a unified critique approach. When researchers tried merging these functions into a single agent, performance actually decreased. This highlights how the brain sometimes works better when focusing on one task at a time rather than multitasking. 
Results speak louder than words
METAL was tested against several baselines, including Direct Prompting, Hint-Enhanced Prompting, and Best-of-N. Here’s how it fared:
- With LLaMA 3.2-11B as the base model, METAL achieved an 11.33% improvement in average F1 score over Direct Prompting. 
- With GPT-4O, METAL outperformed baselines by 5.2%, achieving an average F1 score of 86.46%. 
- The framework showed significant improvements in capturing text and layout details, proving its ability to handle both structural and fine-grained visual elements. 
These results highlight METAL’s robustness and generalizability across different base models and evaluation metrics.

A case study of the iterative refinements performed by METAL agents working together to recreate a reference chart (source: METAL paper)
So, how does it matter?
The implications of METAL are vast and transformative. Here’s where we might see it making waves:
- Financial platforms like Bloomberg and Yahoo Finance could automatically generate customized visualizations from existing charts in analyst reports. 
- Business intelligence tools like Tableau and Power BI might integrate METAL to allow users to simply upload reference images and generate interactive dashboards. 
- Research platforms like Overleaf and Jupyter could incorporate chart regeneration capabilities, helping researchers quickly reproduce visualizations from published papers. 
- Educational technology companies could develop tools that allow students to recreate complex scientific charts for better understanding and analysis. 
- Healthcare systems could benefit from easier visualization of medical data, enabling clinicians to quickly create custom views of patient information. 
The future of AI-powered visualization tools looks increasingly collaborative – not just between humans and AI, but between specialized AI agents working together as a team. Rather than building ever-larger monolithic models, METAL suggests that dividing complex tasks among specialized agents may be the key to more accurate and flexible results.
Wish to take METAL for a spin?
➤ Check out the detailed prompts for all the agents in the Appendix of the paper
➤ Play with their code implementation using their GitHub repository
What do you think about this multi-agent approach to chart generation? 
Would you trust an AI team to recreate your important visualizations? 
Share your thoughts with Spark & Trouble!

10x Your Workflow with AI 📈
Work smarter, not harder! In this section, you’ll find prompt templates 📜 & bleeding-edge AI tools ⚙️ to free up your time.
Fresh Prompt Alert!🚨
Ever tried explaining something only to realize you barely get it yourself? Yeah, us too.
That's where this week's Fresh Prompt Alert comes in!
With this Feynman Technique-inspired prompt, you'll break down tricky concepts like you're teaching a curious 5-year-old — clear, simple, and unforgettable. Whether you're cracking quantum physics or demystifying data pipelines, this prompt will sharpen your understanding like a pro. Ready to level up your learning game? Give it a go 👇
I would like to explore the concept of [insert specific topic or concept here] using the Feynman technique.
Please start by asking me a foundational question about this topic to assess my current understanding. Based on my response, guide me through a series of targeted questions that will help me clarify and deepen my knowledge. As we progress, please provide constructive feedback to help me identify any misconceptions or gaps in my understanding.
Additionally, encourage me to explain the concept in simple terms as if I were teaching it to someone else, so I can solidify my grasp on the material. Let's begin with the first question.
3 AI Tools You JUST Can't Miss 🤩
- Mochii AI: A personalized AI ecosystem with seamless multi-platform integration 
- Swatle: AI-powered project management platform 
- Promptize: A browser extension that gives anyone the powers of an expert prompt engineer 
- OpusClip AI Reframe: The easiest way to resize any video in one click 
- Pieces: Long-term memory agent that captures, preserves, and resurfaces historical workflow details, so you can pick up where you left off 

Spark 'n' Trouble Shenanigans 😜
Are we seriously benchmarking AI with Super Mario now?!
Yep, and the results are wild! Turns out even the smartest AIs can’t keep up when it’s time to dodge Goombas at warp speed. UC San Diego's Hao AI Lab put Anthropic’s Claude 3.7, GPT-4o, and Gemini 1.5 Pro through their paces using GamingAgent, and let’s just say… Claude 3.7 came out looking like Mario with a star power-up. Meanwhile, OpenAI’s fancy new reasoning model? Face-planted into a pit. 😆
Claude-3.7 was tested on Pokémon Red, but what about more real-time games like Super Mario 🍄🌟?
We threw AI gaming agents into LIVE Super Mario games and found Claude-3.7 outperformed other models with simple heuristics. 🤯
Claude-3.5 is also strong, but less capable of… x.com/i/web/status/1…
— Hao AI Lab (@haoailab)
7:33 PM • Feb 28, 2025
Spark thinks this is the perfect benchmark—real-time decision-making, quick reflexes, and chaos theory all rolled into one. Trouble’s just wondering when AI will finally master the art of speedrunning. 🤔
Karpathy’s talking about an “evaluation crisis” in AI—so maybe games are the ultimate test. After all, if your AI can’t handle Bowser, what hope does it have in the real world? 😋

| Well, that’s a wrap! Until then,  |  | 





Reply