- The Vision, Debugged;
- Posts
- TripCraft - A new benchmark that helps AI plan trips like a real expert
TripCraft - A new benchmark that helps AI plan trips like a real expert
PLUS: Gemini 2.5 Proās new trick is blowing devsā minds

Howdy Vision Debuggers!šµļø
Spark and Trouble are packing their bagsāno, not for vacation (yet)ābut to map out something that might just change how we plan every step of the journey.
Hereās a sneak peek into todayās edition š
Meet the benchmark that fixes broken AI trip planners
Todayās frsh prompt can make your product & services unmatchable, unforgettable, unstoppable
3 must-have AI tools that are redefining productivity
Googleās latst update might replace your next intern
Time to jump in!š
PS: Got thoughts on our content? Share 'em through a quick survey at the end of every edition It helps us see how our product labs, insights & resources are landing, so we can make them even better.

Hot off the Wires š„
We're eavesdropping on the smartest minds in research. 𤫠Don't miss out on what they're cooking up! In this section, we dissect some of the juiciest tech research that holds the key to what's next in tech.ā”
Remember those family vacations where dad insisted on having a detailed itinerary printed out, complete with color-coded highlights and military-precision timing? Or perhaps you're the friend who creates massive spreadsheets before trips, calculating exactly how long it takes to get from the Eiffel Tower to Notre Dame? Travel planning has always been part art, part scienceāand now, AI is stepping up to transform this age-old challenge.

Thatās how daunting trip-planning can be š„²
Enter TripCraft, a groundbreaking benchmark that's set to revolutionize how AI understands and creates travel itineraries. Recently accepted to the prestigious ACL 2025 conference, this collaborative effort from Microsoft and IIT Bhubaneswar might finally deliver the AI travel assistant we've all dreamed ofāone that actually knows you shouldn't schedule a fancy dinner right after a 5-hour hike.
So, whatās new?
Current AI travel planners are like that well-meaning but clueless friend who suggests impossible itinerariesā"Sure, you can see the Louvre, Versailles, and climb the Eiffel Tower all before lunch!" They look good on paper but fall apart in practice.
Traditional benchmarks like TravelPlanner+ rely heavily on semi-synthetic data, creating a fundamental disconnect between AI-generated plans and real-world feasibility. They simply don't account for crucial constraints like:
Realistic transit times between attractions
Actual operating hours for venues
Personal preferences that shape enjoyable experiences
Local events that might impact your plans
TripCraft addresses these critical gaps by grounding travel planning in reality. Instead of asking "Can an AI create any itinerary?" it asks "Can an AI create an itinerary that would actually work and be enjoyable for a specific person?"

Conceptual overview of TripCrafter, with an example (source: TripCrafter paper)
Under the hoodā¦
TripCraft's architecture is built around three foundational pillars that make it uniquely powerful for evaluating AI travel planning capabilities:
1. Real-world data integration
Unlike previous benchmarks, TripCraft incorporates genuine transit schedules, up-to-date attraction information, and actual event timings. This means when an AI suggests taking the subway from Central Park to Times Square at 11 PM, TripCraft can verify if that's even possible based on NYC's transit schedule.
The dataset comprises 1,000 diverse travel queries spanning different trip durations (3-day, 5-day, and 7-day itineraries) and varying difficulty levels based on destination density. Each query comes paired with a human-annotated reference planāthe gold standard against which AI performance is measured.
2. Rich personalization framework
TripCraft introduces detailed user personas that go far beyond basic preferences. The benchmark can evaluate if an AI properly accommodates:
Traveler types (adventure seekers vs. relaxation enthusiasts)
Trip purposes (nature exploration, cultural immersion, leisure activities)
Budget constraints (luxury splurges vs. economic considerations)
Location preferences (beach lovers, city explorers, forest retreaters)
This means TripCraft can determine if an AI appropriately recommends a challenging hiking trail to an adventure-seeking nature enthusiast or if it correctly suggests Michelin-starred restaurants to a food-focused luxury traveler.
3. Spatio-temporal reasoning evaluation
Perhaps most impressively, TripCraft assesses the logical flow of proposed itineraries across both space and time:
Are meals scheduled at reasonable intervals (minimum 4-hour gaps)?
Do attraction visit durations make sense for the venue type?
Is the route efficient, minimizing unnecessary backtracking?
Does the day's plan flow logically from morning to evening?
Are events scheduled only when they're actually happening?
For example, TripCraft can penalize an AI that suggests breakfast at 3 PM or schedules back-to-back museum visits with no travel time in betweenāmistakes that would derail a real vacation but might go undetected by simplistic evaluation metrics.
Benchmark Creation Pipeline

Hereās a representation of how this benchmark was created (source: TripCrafter paper)
Behind the scenes, TripCraft was built through a meticulous three-stage process.
First, researchers web-scraped current data from OpenStreetMap and other sources, filtering to match the 140 cities with available public transit data.
Next, they constructed diverse queries using GPT-4o, with a clever scaling approachā3-day trips cover one city, while 7-day trips span three cities within a state.
Finally, 25 graduate students spent approximately 30 minutes per annotation (significantly longer than previous datasets), with domain experts conducting final reviews to ensure both feasibility and optimal planning quality.
Whatās the intrigue?
The innovative aspect of TripCraft isn't just its comprehensive dataāit's how it evaluates AI performance through five continuous metrics that provide nuanced scoring:
Metric | What It Measures |
---|---|
Temporal Meal Score | Are meals scheduled at realistic times? |
Temporal Attraction Score | Are visit durations appropriate? |
Spatial Score | Is the route efficient with minimal detours? |
Ordering Score | Does the day's plan flow logically? |
Persona Score | Does the itinerary match user preferences? |
Unlike binary checks that simply verify if constraints were met, these continuous metrics provide gradient scores that can better guide LLM improvement. For instance, rather than simply failing an AI for scheduling dinner too early, TripCraft provides a scaled score based on how reasonable the timing isāenabling more targeted model refinement.
Initial benchmark results reveal significant gaps between current state-of-the-art LLMs and human performance, particularly in balancing multiple constraints simultaneously. While leading models excel at matching user preferences (high persona scores), they struggle with spatial-temporal reasoningāoften creating logistically impossible days.
Why does this matter?
As AI increasingly integrates into our travel planning tools, TripCraft represents a crucial step toward creating assistants that truly understand the complexities of real-world travel.
The implications extend beyond just better vacation planning. TripCraft demonstrates a framework for teaching AI to handle multi-constraint optimization problemsābalancing competing priorities like time, distance, cost, and personal preferenceāapplicable across numerous domains from healthcare scheduling to supply chain management.
For travelers, this research signals a future where AI assistants could:
Create truly personalized itineraries that respect your unique preferences
Automatically adjust plans when real-world conditions change (like weather or transit delays)
Balance ambitious sightseeing with realistic pacing to avoid vacation burnout
Incorporate local events and seasonal factors into recommendations
For developers and businesses, TripCraft provides the tools to build the next generation of travel services. We're already seeing this technology beginning to transform how we plan trips. Remember Wanderboat, which we covered in a previous edition? Their AI travel planning assistant aims to eliminate the stress of vacation coordination. With benchmarks like TripCraft, tools like Wanderboat can potentially evolve from simple recommendation engines to sophisticated planning partners that truly understand real-world constraints.
Major players are investing heavily in this space too. Bing Travel has been enhancing its AI capabilities to create more personalized itineraries, while booking giants like MakeMyTrip and GoIbibo are exploring AI agents that can handle complex multi-city bookings with nuanced user preferences. Even Airbnb has hinted at developing AI tools that could match travelers with experiences based on their unique travel personas.
How soon do you think we'll see AI travel planners that truly understand the difference between a relaxing vacation and an exhausting march through tourist traps?
Are we finally approaching an era where AI can handle the complexities of real-world planning with all its messy constraints?
Share your thoughts with Spark & Trouble on how this technology might transform how we plan our adventures!

10x Your Workflow with AI š
Work smarter, not harder! In this section, youāll find prompt templates š & bleeding-edge AI tools āļø to free up your time.
Fresh Prompt Alert!šØ
Tired of being just another fish in the sea? This week's prompt transforms your average Joe product into the undisputed heavyweight champion of its domain.
It's like giving your offering a secret sauce so irresistible, competitors can only watch in awe. Whether you're launching a SaaS tool or selling handcrafted soap, this 7-step metamorphosis will have customers saying "shut up and take my money!"
Ready to create your own mini-monopoly? Let's get startedš
Give me a 7-step checklist that turns a basic product or service into a "category of one" offer - so distinct, so uncopyable, it becomes immune to price wars, competition, or customer doubt. Think Rolls Royce meets red pill. Prestige baked in.
Here are some details about [your product/service]:
[Add details like description, pricing, audience, etc.]
Convert this into such a "category of one" offer, using the checklist created.
3 AI Tools You JUST Can't Miss š¤©
š Hedy: Your AI meeting coach for every conversation
šŖ Playground: Design anything like a pro
š¬ Arize Phoenix: An open-source AI observability platform for experimentation & evaluation

Spark 'n' Trouble Shenanigans š
āWait⦠did Google just teach an AI to watch a screen recording and build an app from it?ā
Yup. And Trouble has been nervously re-reading the job description ever since. Spark, of course, is already screen-recording our Notion dashboard to turn it into a to-do list app with snacks integration.
Recently, Google dropped Gemini 2.5 Pro Preview I/O Edition (because āGemini 3ā was apparently too obvious). Itās fast, clever, and currently flexing at the top of the WebDev Arena leaderboardāyes, thatās a thing.
Imagine uploading a screen recording, whispering āClone thisā, and boom: working frontend.
Google basically all devs a turbocharged sidekick. Itās better at function calls, more fluent with UI-heavy apps, and shockingly decent at going from āidea in your brainā to āworking prototypeā without throwing a tantrum mid-sprint.
Is it perfect? Nah. Itās still a bit verbose and occasionally acts like it knows better than you.
But this video-to-app trick? Feels like magic. Try it before your PM does. š
Try it out on Google AI Studio

Well, thatās a wrap! Until then, | ![]() |

Reply