The Vision, Debugged;
Posts
TripCraft - A new benchmark that helps AI plan trips like a real expert

TripCraft - A new benchmark that helps AI plan trips like a real expert

PLUS: Gemini 2.5 Pro’s new trick is blowing devs’ minds

Tezan Sahu & Sandra Anil
May 20th, 2025

Howdy Vision Debuggers!🕵️

Spark and Trouble are packing their bags—no, not for vacation (yet)—but to map out something that might just change how we plan every step of the journey.

Here’s a sneak peek into today’s edition 👀

Meet the benchmark that fixes broken AI trip planners
Today’s frsh prompt can make your product & services unmatchable, unforgettable, unstoppable
3 must-have AI tools that are redefining productivity
Google’s latst update might replace your next intern

Time to jump in!😄

PS: Got thoughts on our content? Share 'em through a quick survey at the end of every edition It helps us see how our product labs, insights & resources are landing, so we can make them even better.

Hot off the Wires 🔥

We're eavesdropping on the smartest minds in research. 🤫 Don't miss out on what they're cooking up! In this section, we dissect some of the juiciest tech research that holds the key to what's next in tech.⚡

Remember those family vacations where dad insisted on having a detailed itinerary printed out, complete with color-coded highlights and military-precision timing? Or perhaps you're the friend who creates massive spreadsheets before trips, calculating exactly how long it takes to get from the Eiffel Tower to Notre Dame? Travel planning has always been part art, part science—and now, AI is stepping up to transform this age-old challenge.

That’s how daunting trip-planning can be 🥲

Enter TripCraft, a groundbreaking benchmark that's set to revolutionize how AI understands and creates travel itineraries. Recently accepted to the prestigious ACL 2025 conference, this collaborative effort from Microsoft and IIT Bhubaneswar might finally deliver the AI travel assistant we've all dreamed of—one that actually knows you shouldn't schedule a fancy dinner right after a 5-hour hike.

So, what’s new?

Current AI travel planners are like that well-meaning but clueless friend who suggests impossible itineraries—"Sure, you can see the Louvre, Versailles, and climb the Eiffel Tower all before lunch!" They look good on paper but fall apart in practice.

Traditional benchmarks like TravelPlanner+ rely heavily on semi-synthetic data, creating a fundamental disconnect between AI-generated plans and real-world feasibility. They simply don't account for crucial constraints like:

Realistic transit times between attractions
Actual operating hours for venues
Personal preferences that shape enjoyable experiences
Local events that might impact your plans

TripCraft addresses these critical gaps by grounding travel planning in reality. Instead of asking "Can an AI create any itinerary?" it asks "Can an AI create an itinerary that would actually work and be enjoyable for a specific person?"

Conceptual overview of TripCrafter, with an example (source: TripCrafter paper)

Under the hood…

TripCraft's architecture is built around three foundational pillars that make it uniquely powerful for evaluating AI travel planning capabilities:

1. Real-world data integration

Unlike previous benchmarks, TripCraft incorporates genuine transit schedules, up-to-date attraction information, and actual event timings. This means when an AI suggests taking the subway from Central Park to Times Square at 11 PM, TripCraft can verify if that's even possible based on NYC's transit schedule.

The dataset comprises 1,000 diverse travel queries spanning different trip durations (3-day, 5-day, and 7-day itineraries) and varying difficulty levels based on destination density. Each query comes paired with a human-annotated reference plan—the gold standard against which AI performance is measured.

2. Rich personalization framework

TripCraft introduces detailed user personas that go far beyond basic preferences. The benchmark can evaluate if an AI properly accommodates:

Traveler types (adventure seekers vs. relaxation enthusiasts)
Trip purposes (nature exploration, cultural immersion, leisure activities)
Budget constraints (luxury splurges vs. economic considerations)
Location preferences (beach lovers, city explorers, forest retreaters)

This means TripCraft can determine if an AI appropriately recommends a challenging hiking trail to an adventure-seeking nature enthusiast or if it correctly suggests Michelin-starred restaurants to a food-focused luxury traveler.

3. Spatio-temporal reasoning evaluation

Perhaps most impressively, TripCraft assesses the logical flow of proposed itineraries across both space and time:

Are meals scheduled at reasonable intervals (minimum 4-hour gaps)?
Do attraction visit durations make sense for the venue type?
Is the route efficient, minimizing unnecessary backtracking?
Does the day's plan flow logically from morning to evening?
Are events scheduled only when they're actually happening?

For example, TripCraft can penalize an AI that suggests breakfast at 3 PM or schedules back-to-back museum visits with no travel time in between—mistakes that would derail a real vacation but might go undetected by simplistic evaluation metrics.

Benchmark Creation Pipeline

Here’s a representation of how this benchmark was created (source: TripCrafter paper)

Behind the scenes, TripCraft was built through a meticulous three-stage process.

First, researchers web-scraped current data from OpenStreetMap and other sources, filtering to match the 140 cities with available public transit data.
Next, they constructed diverse queries using GPT-4o, with a clever scaling approach—3-day trips cover one city, while 7-day trips span three cities within a state.
Finally, 25 graduate students spent approximately 30 minutes per annotation (significantly longer than previous datasets), with domain experts conducting final reviews to ensure both feasibility and optimal planning quality.

What’s the intrigue?

The innovative aspect of TripCraft isn't just its comprehensive data—it's how it evaluates AI performance through five continuous metrics that provide nuanced scoring:

Metric	What It Measures
Temporal Meal Score	Are meals scheduled at realistic times?
Temporal Attraction Score	Are visit durations appropriate?
Spatial Score	Is the route efficient with minimal detours?
Ordering Score	Does the day's plan flow logically?
Persona Score	Does the itinerary match user preferences?

Unlike binary checks that simply verify if constraints were met, these continuous metrics provide gradient scores that can better guide LLM improvement. For instance, rather than simply failing an AI for scheduling dinner too early, TripCraft provides a scaled score based on how reasonable the timing is—enabling more targeted model refinement.

Initial benchmark results reveal significant gaps between current state-of-the-art LLMs and human performance, particularly in balancing multiple constraints simultaneously. While leading models excel at matching user preferences (high persona scores), they struggle with spatial-temporal reasoning—often creating logistically impossible days.

Why does this matter?

As AI increasingly integrates into our travel planning tools, TripCraft represents a crucial step toward creating assistants that truly understand the complexities of real-world travel.

The implications extend beyond just better vacation planning. TripCraft demonstrates a framework for teaching AI to handle multi-constraint optimization problems—balancing competing priorities like time, distance, cost, and personal preference—applicable across numerous domains from healthcare scheduling to supply chain management.

For travelers, this research signals a future where AI assistants could:

Create truly personalized itineraries that respect your unique preferences
Automatically adjust plans when real-world conditions change (like weather or transit delays)
Balance ambitious sightseeing with realistic pacing to avoid vacation burnout
Incorporate local events and seasonal factors into recommendations

For developers and businesses, TripCraft provides the tools to build the next generation of travel services. We're already seeing this technology beginning to transform how we plan trips. Remember Wanderboat, which we covered in a previous edition? Their AI travel planning assistant aims to eliminate the stress of vacation coordination. With benchmarks like TripCraft, tools like Wanderboat can potentially evolve from simple recommendation engines to sophisticated planning partners that truly understand real-world constraints.

Major players are investing heavily in this space too. Bing Travel has been enhancing its AI capabilities to create more personalized itineraries, while booking giants like MakeMyTrip and GoIbibo are exploring AI agents that can handle complex multi-city bookings with nuanced user preferences. Even Airbnb has hinted at developing AI tools that could match travelers with experiences based on their unique travel personas.

How soon do you think we'll see AI travel planners that truly understand the difference between a relaxing vacation and an exhausting march through tourist traps?
Are we finally approaching an era where AI can handle the complexities of real-world planning with all its messy constraints?

Share your thoughts with Spark & Trouble on how this technology might transform how we plan our adventures!

10x Your Workflow with AI 📈

Work smarter, not harder! In this section, you’ll find prompt templates 📜 & bleeding-edge AI tools ⚙️ to free up your time.

Fresh Prompt Alert!🚨

Tired of being just another fish in the sea? This week's prompt transforms your average Joe product into the undisputed heavyweight champion of its domain.

It's like giving your offering a secret sauce so irresistible, competitors can only watch in awe. Whether you're launching a SaaS tool or selling handcrafted soap, this 7-step metamorphosis will have customers saying "shut up and take my money!"

Ready to create your own mini-monopoly? Let's get started👇

Give me a 7-step checklist that turns a basic product or service into a "category of one" offer - so distinct, so uncopyable, it becomes immune to price wars, competition, or customer doubt. Think Rolls Royce meets red pill. Prestige baked in.

Here are some details about [your product/service]:
[Add details like description, pricing, audience, etc.]

Convert this into such a "category of one" offer, using the checklist created.

* Replace the content in brackets with your details & nable web search while submitting this.

3 AI Tools You JUST Can't Miss 🤩

📞 Hedy: Your AI meeting coach for every conversation
🪄 Playground: Design anything like a pro
🔬 Arize Phoenix: An open-source AI observability platform for experimentation & evaluation

Spark 'n' Trouble Shenanigans 😜

“Wait… did Google just teach an AI to watch a screen recording and build an app from it?”

Yup. And Trouble has been nervously re-reading the job description ever since. Spark, of course, is already screen-recording our Notion dashboard to turn it into a to-do list app with snacks integration.

Recently, Google dropped Gemini 2.5 Pro Preview I/O Edition (because “Gemini 3” was apparently too obvious). It’s fast, clever, and currently flexing at the top of the WebDev Arena leaderboard—yes, that’s a thing.

Imagine uploading a screen recording, whispering “Clone this”, and boom: working frontend.

Google basically all devs a turbocharged sidekick. It’s better at function calls, more fluent with UI-heavy apps, and shockingly decent at going from “idea in your brain” to “working prototype” without throwing a tantrum mid-sprint.

Is it perfect? Nah. It’s still a bit verbose and occasionally acts like it knows better than you.

But this video-to-app trick? Feels like magic. Try it before your PM does. 😎

Try it out on Google AI Studio

Well, that’s a wrap!
Thanks for reading 😊

See you next week with more mind-blowing tech insights 💻

Until then,
Stay Curious🧠 Stay Awesome🤩

PS: Do catch us on LinkedIn - Sandra & Tezan

Reply

or to participate.