The Vision, Debugged;
Posts
The Training That Makes AI Actually Useful (Post-Training Explained)

The Training That Makes AI Actually Useful (Post-Training Explained)

PLUS: Trouble’s putting MCP to the test — in finance

Tezan Sahu & Sandra Anil
March 18th, 2025

Howdy, Vision Debuggers!🕵️

Spark & Trouble have been overwhelmed by the flurry of jargon that’s come up in the LLM space over the last few months and decided to level up by going through some awesome resources over the last week, ready to share all their learnings with you…

Lets Go Ok GIF by FIA European Rally Championship

Gif by FIA-ERC on Giphy

Here’s a sneak peek into today’s edition 👀

Peek behind the curtain of AI's modern post-training methods
Unlock the power of AI with these 5 incredible tools
What if AI could take over half of your tasks? Try our Fresh Prompt
Model Context Protocol (MCP) in Action - Trouble’s AI Agent Analyzes Financial Data

Time to jump in!😄

PS: Got thoughts on our content? Share 'em through a quick survey at the end of every edition. It helps us see how our product labs, insights & resources are landing so we can make them even better.

Hot off the Wires 🔥

We're eavesdropping on the smartest minds in research. 🤫 Don't miss out on what they're cooking up! In this section, we dissect some of the juiciest tech research that holds the key to what's next in tech.⚡

We talk a lot about how LLMs like GPT-4 can write poetry, code, and even answer legal questions, but have you ever wondered why they sometimes give you bizarre or factually incorrect answers? Why does an LLM sometimes ace a tricky math problem but fail miserably when asked to explain a simple concept?

The truth is, LLMs aren’t inherently logical thinkers — they’re just extremely good at predicting the next word based on statistical patterns.

And that’s where post-training comes in. Post-training strategies have become the secret sauce behind the recent surge in LLM performance.

Did you know?

GPT-4, Claude 3.5 Sonnet, and other leading models don't emerge fully-formed from their initial training. They undergo extensive post-training refinement that can dramatically alter their capabilities. In fact, without these techniques, most modern LLMs would be significantly less helpful, accurate, and safe!

If you’ve heard buzzwords like fine-tuning, reinforcement learning, or test-time scaling but aren't quite sure what they mean or why they matter, you're not alone. Spark & Trouble were also in the same boat until they came across this recent survey paper that tries to comprehensively capture & organize the latest post-training techniques!

A glimpse of the array of LLM post-training techniques, categorized into 3 buckets, along with the actual LLMs leveraging those techniques (source: paper)

Now equipped with their findings, they’re all set to break it down for you. So, let’s dig in.

Forging the fundamentals

Before diving deeper, let's decode some essential terms:

Pre-training: The initial phase where models learn to predict the next token in a sequence from massive datasets, forming their base knowledge.

Post-training: All subsequent training that refines, specializes, or aligns the model after pre-training is complete.

Fine-tuning: Supervised learning on specific datasets to teach models particular skills or domain knowledge.

Reinforcement Learning: Training models by rewarding desired behaviors and penalizing undesired ones.

RL Policy: It is like a playbook that guides an agent on what action to take in any given situation to get the best possible outcome.

Test-Time Scaling: Techniques applied during inference (when the model is actually being used) that don't change the model's parameters but improve its outputs.

Under the hood…

Post-training makes LLMs more useful and reliable — it's the difference between someone who knows a lot of words and someone who can hold a meaningful conversation.

Post-training strategies fall into three main buckets:

Fine-Tuning: Teaching LLMs New Tricks

Think of fine-tuning as personal training for a professional athlete. Pre-training makes the model strong, but fine-tuning makes it specialized and sharp. The following are some common fine-tuning techniques:

Supervised Fine-tuning (SFT): Think of this as sending your AI to a specialized school. If pre-training is general education, SFT is like medical school or law school — focused training on specific domains or tasks.
- Example: A model pre-trained on general text being fine-tuned on medical literature and patient records to better assist with healthcare questions.
Parameter-Efficient Fine-tuning (PEFT): Imagine trying to remodel an entire skyscraper versus just updating the lobby. PEFT methods like LoRA (Low-Rank Adaptation) insert small, trainable modules into the frozen pre-trained model, allowing targeted learning without updating the whole model.
- Example: Using LoRA to adapt a 70B parameter model by training just 1% of its parameters, saving 99% of the computational resources while achieving nearly identical results.
Instruction Fine-tuning: This is like teaching a child to follow directions. The model learns to respond appropriately to instructions rather than just continuing text or completing sentences.
- Example: Training a model on pairs like "Instruction: Summarize this article" → "Response: [summary]" teaches it to follow diverse instructions.
Chain-of-Thought (CoT) Reasoning: Instead of jumping to conclusions, CoT teaches models to "show their work" like a good math student, breaking down complex problems into steps.
- Example: Rather than answering "What's 17 × 24?" with just "408," the model learns to write "17 × 20 = 340, 17 × 4 = 68, 340 + 68 = 408."
Distillation-Based Finetuning: Picture a master chef teaching apprentices their signature recipes. Larger "teacher" models guide smaller "student" models to achieve similar results more efficiently.
- Example: A 70B parameter model generating answers that a 7B model learns to imitate, allowing the smaller model to punch above its weight class.

Reinforcement Learning: Teaching LLMs to Improve Themselves

If fine-tuning is like coaching, reinforcement learning is like letting the model practice and improve based on feedback. Here are some RL methods that have gained massive traction in the LLM world:

Flow diagrams depicting the comparison of RL techniques for LLMs (source: paper)

Proximal Policy Optimization (PPO): Think of PPO as a careful coaching system that helps models improve gradually without wild swings in behavior. PPO adjusts the model’s behavior based on reward signals while preventing extreme swings. It's like training a gymnast with incremental challenges rather than immediately asking for Olympic-level routines.
- Example: When the model generates helpful responses, it receives positive reinforcement; harmful outputs get negative feedback, slowly steering behavior in the desired direction.
Direct Preference Optimization (DPO): Imagine a cooking competition where judges directly compare two dishes. DPO lets humans directly compare model outputs, creating clear "this is better than that" signals. Through this, DPO adjusts the model directly based on user preferences without needing a separate reward model.
- Example: Given two possible responses to a question, human raters select the more helpful one, and the model learns to produce more outputs like the preferred option.
Group Relative Policy Optimization (GRPO): GRPO evaluates outputs relative to a batch of alternatives, making training more stable and efficient. This is like asking a focus group to evaluate answers and adjusting based on consensus. GRPO simplifies PPO by using group feedback to define the reward signal.
- Example: Instead of using an absolute scoring system, the model learns by comparing its outputs against a group of other potential responses it generated.

Test-Time Scaling: Smarter Reasoning on the Fly

Test-time scaling improves the model's output without retraining and is one of the latest buzzwords in town. Let’s try to simplify some of the strategies so that you’re never lost in a conversation when someone brings it up…

Beam Search: Picture a hiker exploring multiple paths simultaneously before committing to one. Beam search explores N possible text completions in parallel before choosing the best one..
- Example: When generating a story ending, the model might explore five different endings simultaneously, ultimately selecting the most coherent option.
Best-of-N Search (BoN): This is like writing multiple drafts of an essay and picking the best one. The model generates several complete responses and selects the top performer.
- Example: Generating 10 different "working steps” to reach the answer to a math problem and selecting the one with the highest confidence score.
Compute-Optimal Scaling (COS): Like an efficient manager allocating more resources to difficult projects and fewer to simple tasks, COS dynamically adjusts computational effort based on problem complexity.
- Example: Using minimal computing for simple questions like "What's the capital of France?" but ramping up resources for complex reasoning tasks.
Self-Consistency Decoding: This resembles asking yourself the same question in multiple ways to verify your answer. If most approaches lead to the same result, it's probably correct.
- Example: Solving a math problem through several different methods and checking if they converge on the same answer.
Tree-of-Thoughts (ToT): Imagine a chess player thinking, "If I move here, then my opponent might do this or that..." ToT lets models explore branching decision paths and backtrack when needed.
- Example: When solving a puzzle, the model can explore multiple solution strategies, abandon dead ends, and try alternative approaches.
Graph of Thoughts (GoT): Taking ToT further, GoT is like having a whiteboard where you can connect ideas in any direction, not just in a tree structure. It allows for more flexible reasoning with interconnected thoughts.
- Example: When planning a complex project, ideas can connect and reference each other in a web-like structure rather than a strict hierarchy.

Comparison of reasoning strategies in LLMs (source: paper)

How does this matter?

These techniques aren't just academic exercises — they're transforming how AI systems work in the real world:

Instruction tuning and RLHF are why modern assistants (like the one you're reading this on) can follow complex instructions rather than just predicting text
Chain-of-thought approaches have dramatically improved reasoning capabilities, particularly in math and logic
Parameter-efficient methods like LoRA have democratized fine-tuning, allowing smaller organizations to customize models on modest hardware
Test-time scaling techniques enable models to tackle problems far beyond what their raw parameters would suggest possible

Choosing your post-training strategy

Facing this array of techniques, how should practitioners decide which to apply? Here are some actionable tips:

For task specialization:
- Start with SFT on high-quality examples
- If computing is limited, use PEFT methods like LoRA
- For complex reasoning tasks, incorporate CoT fine-tuning
- Extend into ToT/GoT if inference-time compute is available
For alignment with values:
- DPO offers the best performance-to-complexity ratio for most use cases
- PPO provides more control but requires significantly more infrastructure
- GRPO balances efficiency and effectiveness for mid-sized deployments
For deployment optimization:
- Self-consistency is the most reliable way to improve reasoning without retraining
- For interactive applications, beam search offers the best speed-quality tradeoff
- When quality matters most, Best-of-N with a reward model filter delivers top results

Despite all these advances, challenges remain. The paper highlights several open problems, including better evaluation metrics, more efficient sampling techniques, and improved methods for handling complex reasoning and planning.

What do you think about these techniques?
Which would you prioritize when building your next AI application?

Share your thoughts with Spark & Trouble! ✨

10x Your Workflow with AI 📈

Work smarter, not harder! In this section, you’ll find prompt templates 📜 & bleeding-edge AI tools ⚙️ to free up your time.

Fresh Prompt Alert!🚨

Feel like your to-do list is sprinting ahead while you’re still getting ready?

We totally get that. But what if you had a little secret weapon to help you out—an AI-powered productivity buddy?

This week’s Fresh Prompt Alert is here to help you lighten the load, get organized, and automate tasks like a champ. Just toss your tasks in, and voilà—AI will show you what it can take care of, help with, or pass off to someone else, letting you focus on what really counts.

Ready to take back your time? Give it a shot! 👇

Here’s a list of tasks that I wish to accmplish today:

[List of your tasks]

Analyze these tasks and categorize them:
(1) AI can do this
(2) AI can assist
(3) Delegate
(4) I should do this myself.

Explain why for each.

Also, for each task catgorized as (1) & (2), create an action plan by suggesting relevant AI tools, prompts, workflows, etc. for the task.

* Replace the content in brackets with your details

5 AI Tools You JUST Can't Miss 🤩

👨‍💻 Prepair: AI-driven platform for software engineers to practice technical interviews
📽️ Kreado: Turn text, image, PowerPoint and everything into professional-quality videos
🎸 DoMusic AI: Easily turn your text or lyrics into beautiful songs
🍉 Watermelon: Automate your customer support with AI Agents
📤 Sheetsy: Send personalized emails with AI at scale

Spark 'n' Trouble Shenanigans 😜

After Anthropic's recent workshop on the Model Context Protocol (MCP), Trouble got a little too inspired. So, naturally, he decided to build an AI agent using MCP — and not just any agent — one that analyzes financial data and helps make investment decisions.

The agent hooks up to the Financial Modeling Prep (FMP) API and can pull up company profiles, balance sheets, cash flow statements — basically, all the financial intel you’d need to invest like a pro.

And it works across different LLM providers like Azure OpenAI, OpenAI, and even Ollama (yes, you can use it with locally hosted LLMs as well). Spark was low-key impressed!

And hey, if you’re curious about how MCP works or just want to see Trouble’s agent in action, check out the full demo 👇

Well, that’s a wrap!
Thanks for reading 😊

See you next week with more mind-blowing tech insights 💻

Until then,
Stay Curious🧠 Stay Awesome🤩

PS: Do catch us on LinkedIn - Sandra & Tezan

Reply

or to participate.