The Vision, Debugged;
Posts
SLMs with Reasoning Just Got Real with Microsoft's Phi-4-Mini-Reasoning

SLMs with Reasoning Just Got Real with Microsoft's Phi-4-Mini-Reasoning

PLUS: This AI turns any GitHub repo into a self-explaining book

Tezan Sahu & Sandra Anil
May 6th, 2025

Howdy Vision Debuggers!🕵️

Spark bet Trouble that smaller minds can’t reason. Trouble disagreed.

The results? In this edition…

Here’s a sneak peek into today’s edition 👀

The secret sauce behind Microsoft’s latest Phi-4-Mini-Reasoning model
Lost in your career? Meet your next mentor… With a little AI help
5 powerful AI tools that will blow your mind
The easiest way to understand any GitHub repo - now at your fingertips

Time to jump in!😄

PS: Got thoughts on our content? Share 'em through a quick survey at the end of every edition It helps us see how our product labs, insights & resources are landing, so we can make them even better.

Hot off the Wires 🔥

We're eavesdropping on the smartest minds in research. 🤫 Don't miss out on what they're cooking up! In this section, we dissect some of the juiciest tech research that holds the key to what's next in tech.⚡

Remember that one kid in school who could breeze through Olympiad problems while the rest of us were still figuring out the first step? Fast-forward to today—and that kid might just be Microsoft’s latest Phi-4-Mini-Reasoning, a 3.8B parameter model trained to punch way above its weight in math.

Today, Spark & Trouble have the perfect blend of small-but-mighty: a compact model that schools its larger peers in mathematical reasoning.

Let’s decode what makes Phi-4-Mini-Reasoning the new class topper in the world of Small Language Models (SLMs).

So, what’s new?

It’s long been believed that reasoning-intensive tasks, especially in math, are best handled by large language models (LLMs) packed with billions (if not trillions) of parameters. Recent advancements in reasoning models like DeepSeek-R1 & OpenAI’s o-series (o1, o3, etc) are a testament to this statement.

But what if you could train small models to reason just as well by carefully crafting how they learn?

That’s the question the researchers behind Phi-4-Mini-Reasoning set out to answer—and their results are turning heads. Through a systematic, four-stage training recipe, they’ve proved that with the right approach, a David-sized model (this one merely has 3.8B parameters) can indeed outperform Goliaths.

Forging the fundamentals

Before we explore the approach, a few key concepts to keep in mind:

SLMs (Small Language Models): Language AI models that are compact in size, like a pocket calculator compared to a supercomputer. They're more efficient and need less computing power, but traditionally have been less capable at complex tasks than their larger counterparts.

CoT (Chain-of-Thought): A technique where models generate intermediate reasoning steps before providing a final answer

Distillation: Teaching a smaller model by having it learn from a larger model's answers. It's like a master chef (big model) teaching cooking techniques to an apprentice (small model), allowing the apprentice to make almost-as-good dishes without needing all the experience.

Mid-Training: An intermediate training phase where models learn foundational skills before specialization. Like how you might learn general music theory before focusing on mastering a specific instrument—it builds a knowledge base that later training can build upon.

DPO (Direct Preference Optimization): Training an AI by showing it pairs of answers — one good, one not so good (instead of formulating “rewards” for choosing an answer) — so it learns what's preferred. Similar to training a dog by rewarding good behavior rather than explicitly teaching every command in detail.

Rollouts: The various answers or solution attempts a model generates for a given question.

Exploration-Exploitation: The balance between trying new approaches (exploration) versus using what already works well (exploitation). It's like deciding whether to try a new restaurant or return to your favorite one—you need both to optimize your dining experiences over time.

Under the hood…

The magic behind Phi-4-Mini-Reasoning lies in its innovative four-stage training process, meticulously crafted to overcome the limitations of small models.

Think of it like transforming a promising rookie athlete into a championship-level player through a carefully designed training regimen.

Representation of the multi-stage training recipe for a Reasoning SLM (source: created by authors)

While this recipie can be used for any SLM, these researchers applied it on the Phi-4-mini model (hence, the final output was Phi-4-mini-Reasoning)

Stage 1: Distillation as Mid-Training

Imagine having a world-class math professor who solves millions of problems step-by-step, and then having a student study all these solutions. That's essentially what happens here: the small model learns from a massive collection of reasoning examples generated by DeepSeek-R1, a behemoth 671B-parameter model.

This isn't just about copying answers; it's about absorbing reasoning patterns across diverse mathematical domains. The researchers packed multiple examples into single training sequences for efficiency, continuing the training until the model's performance stabilized—establishing fundamental reasoning capabilities in the small model.

Stage 2: Supervised Fine-tuning

Once the model has absorbed a vast repository of knowledge, it's time for focused practice. The researchers selected a compact but high-quality subset of problems spanning various math domains with difficulty levels exceeding college-level mathematics.

During this phase, the model learns not just how to solve problems, but when to stop generating—an essential skill for providing concise, accurate answers rather than rambling explanations.

Stage 3: Rollout Preference Learning

Here's where the training gets ingeniously efficient. Rather than discarding incorrect answers generated during previous stages, the researchers repurposed them as valuable teaching opportunities.

By pairing correct and incorrect solutions to the same problems, they created preference pairs that teach the model to distinguish good reasoning from flawed approaches. Using Direct Preference Optimization (DPO), the model learns to align with correct reasoning patterns and avoid common pitfalls—similar to how humans learn from their mistakes.

Stage 4: Reinforcement Learning with Verifiable Reward

The final stage involves reward-based learning where the model receives positive reinforcement for correct answers and negative feedback for incorrect ones.

However, applying reinforcement learning to small models presented unique challenges that required innovative solutions:

Challenge	Solution Adopted by Researchers
High variance in response lengths	The researchers implemented prompt optimization to generate responses with more uniform lengths
Vanishing gradients under uniform rewards	They rebalanced positive and negative rewards and oversampled difficult problems to maintain effective learning signals.
Exploration-exploitation tradeoff	A temperature annealing technique was introduced, starting with high exploration (temperature 1.0) and gradually shifting to exploitation (temperature 0.6).

The training data itself was meticulously curated: approximately 10 million problem solutions across 1.6 million math problems, spanning algebra, geometry, probability, calculus, and theoretical mathematics.

Each problem was categorized by domain and difficulty level (from elementary school to graduate level), with correctness verified using specialized math-verification tools and GPT-4o-mini.

Results speak louder than words

On the challenging AIME 2024 benchmark (American Invitational Mathematics Examination), Phi-4-Mini-Reasoning scored 57.5%, outperforming DeepSeek-R1-Distill-Qwen-7B (53.3%) and DeepSeek-R1-Distill-Llama-8B (43.3%)—models with nearly twice its parameter count.

For context, the base Phi-4-Mini model without this specialized training scored just 10% on AIME—demonstrating the dramatic impact of the training methodology.

Each training stage contributed progressively to these impressive results, transforming a model that initially struggled with complex reasoning into one that rivals or exceeds much larger alternatives.

Why does this matter?

The implications of this research extend far beyond academic benchmarks. Phi-4-Mini-Reasoning represents a paradigm shift in how we think about AI model development:

Efficiency Revolution: Organizations no longer need massive computational resources to deploy advanced reasoning capabilities. This dramatically reduces infrastructure costs and energy consumption while maintaining high performance.
Edge AI Gets Smarter: With powerful reasoning packed into smaller models, we can deploy sophisticated AI directly on devices like smartphones, tablets, and IoT devices—enabling real-time math assistance, financial analysis, and scientific applications without cloud connectivity.
Democratized AI Development: Lower computational requirements mean smaller research teams and startups can innovate in the reasoning AI space without the enormous compute budgets previously required.
Educational Transformation: Imagine having a math tutor that fits in your pocket, providing step-by-step solutions to complex problems while consuming minimal device resources. This could revolutionize personalized education and homework assistance.
Industry-Specific Applications: From engineering calculations to financial modeling, compact yet powerful reasoning models could be fine-tuned for specialized domains, creating a new generation of accessible expert systems.

Perhaps most importantly, this research provides a blueprint - a systematic training recipe that could be adapted to enhance other capabilities in small models, potentially transforming how we approach model development across the board.

What's your take? Could compact, reasoning-focused models like Phi-4-Mini-Reasoning replace the computational giants we've come to rely on? Are we witnessing the beginning of a "small is beautiful" revolution in AI?

Share your thoughts with Spark & Trouble.

Wish to get your hands dirty with Phi-4-Mini-Reasoning?

➤ Check out the research paper
➤ Play with the model on HuggingFace

10x Your Workflow with AI 📈

Work smarter, not harder! In this section, you’ll find prompt templates 📜 & bleeding-edge AI tools ⚙️ to free up your time.

Fresh Prompt Alert!🚨

Ever feel stuck wondering who to reach out to for guidance? Or worse, how to even start? This week’s Fresh Prompt Alert is your cheat code. It turns you into a career strategist, helping you build a solid mentor outreach plan tailored to your goals, preferences, and vibe.

Whether you're aiming for your next big break or just need a guiding light, this prompt maps out the who, why, and how, so you’re not shooting in the dark.

Give it a go 👇

Adopt the role of an expert career strategist tasked with identifying potential mentors in a specific industry. Your primary objective is to create a comprehensive mentor outreach strategy in a structured table format. To accomplish this, you should research influential figures in the industry, analyze their expertise and achievements, and evaluate how their experience aligns with the mentee's career goals. Create a detailed table that outlines potential mentors, their areas of expertise, and compelling reasons for choosing them as a mentor.

#INFORMATION ABOUT ME:

My industry: [INSERT YOUR INDUSTRY]
My career goals: [DESCRIBE YOUR CAREER GOALS]
My current expertise: [DESCRIBE YOUR CURRENT SKILLS AND KNOWLEDGE]
My networking preferences: [DESCRIBE YOUR PREFERRED NETWORKING STYLE]
My ideal mentor qualities: [LIST QUALITIES YOU SEEK IN A MENTOR]

MOST IMPORTANT!: Always provide your output in a markdown table format with three columns: Mentor Name, Expertise, and Reason for Choosing. Include at least 5 potential mentors in your table.

* Replace the content in brackets with your details

5 AI Tools You JUST Can't Miss 🤩

🔊 Teamble: AI superpowers to give and get 10x better feedback
⚙️ Guse: Build prompt to automation in seconds
🌊 Currents: AI agents that retrieve, understand, and reason about what people discuss online
💬 Poised: Your AI communication coach
📽️ Hera: Turn your text into stunning motion graphics instantly

Spark 'n' Trouble Shenanigans 😜

What if we told you there’s a tool that reads GitHub code like a senior dev and explains it like your smartest friend? 🤯

Well, meet DeepWiki — the ultimate sidekick for devs, curious PMs, and confused contributors alike. Trouble stumbled upon this gem while procrastinating on writing documentation (classic), and even Spark had to admit... this thing gets it.

Just swap github with deepwiki in the URL, and boom — beautifully organized docs, architectural diagrams, and an AI assistant ready to break down even the scariest code jungle. From TensorFlow to your weekend side project, DeepWiki makes repos talk.

Trouble even tested it on his ai-garage repo and was shocked to see better docs than he ever wrote himself. Spark’s exact words? “It’s like your code finally got a user manual!” 😂

Diagrams? ✅.
Explanations? ✅.
Smart answers to dumb questions? ✅✅✅

Check it out here:

tezansahu/ai-garage | DeepWiki

This page provides an introduction to the AI Garage repository, a collection of AI-powered applications that demonstrate various architectural patterns and use cases for modern AI technologies.

deepwiki.com/tezansahu/ai-garage

And if you’re anywhere near code, this tool is a must-bookmark.

Well, that’s a wrap!
Thanks for reading 😊

See you next week with more mind-blowing tech insights 💻

Until then,
Stay Curious🧠 Stay Awesome🤩

PS: Do catch us on LinkedIn - Sandra & Tezan

Reply

or to participate.