The Vision, Debugged;
Posts
Beyond Clips: AI Now Creates COHERENT Minute-Long Videos

Beyond Clips: AI Now Creates COHERENT Minute-Long Videos

PLUS: Did India Just Make GPUs Optional for AI?😲

Tezan Sahu & Sandra Anil
April 15th, 2025

Howdy Vision Debuggers!🕵️

Lights, camera, Spark and Trouble! This week’s puzzle has them racing against the clock, cutting scenes and stitching ideas at lightning speed.

Ready to roll?

Gif by adultswim on Giphy

Here’s a sneak peek into today’s edition 👀

Say goodbye to 8-second AI-generated clips, hello to full scenes!
Make every presentation feel like a TED Talk with this prompt
5 next-level AI tools you’ll wish you found sooner
Ziroh Labs + IIT Madras just made AI way more accessible.

Time to jump in!😄

PS: Got thoughts on our content? Share 'em through a quick survey at the end of every edition It helps us see how our product labs, insights & resources are landing, so we can make them even better.

Hot off the Wires 🔥

We're eavesdropping on the smartest minds in research. 🤫 Don't miss out on what they're cooking up! In this section, we dissect some of the juiciest tech research that holds the key to what's next in tech.⚡

Remember Saturday mornings as a kid, watching cartoons with a bowl of cereal, mesmerized by the adventures of characters like Tom & Jerry?

Happy Tom And Jerry GIF by Bombay Softwares

Gif by bombaysoftwares on Giphy

Those animated stories captivated us for minutes, not just a few seconds. While today's AI can create impressive short video clips, it's struggled to generate the longer, multi-scene narratives we grew up loving... until now!

Enter "One-Minute Video Generation with Test-Time Training" - a breakthrough approach that's stretching AI-generated videos from mere seconds to full minute-long stories. It's like giving AI the attention span it needs to tell complete, coherent visual stories rather than just fleeting moments.

The Big Idea

The researchers take a powerful diffusion-based video generator — CogVideo-X 5B — and augment it with Test-Time Training (TTT) layers. These new layers give the model an internal memory mechanism that updates on the fly, allowing it to build context, scene by scene.

Forging the fundamentals

Before diving deeper, let's decode some key terms that are essential to understanding this breakthrough:

Diffusion Transformer: A generative model that creates images or videos by gradually removing noise from random pixels, similar to how a sculptor reveals a figure by removing clay bit by bit. It combines the strength of diffusion models (gradual refinement) with Transformers (understanding relationships between elements).

Self-Attention: The mechanism that allows AI to weigh the importance of different parts of input data, similar to how you focus more on important parts of a conversation while still aware of the whole. In video generation, this helps models connect what's happening across frames, but gets exponentially more difficult with longer videos.

RNN (Recurrent Neural Network): A neural network design that processes sequences by maintaining a "memory" of what it's seen before, like keeping a running summary in your head while reading a book. Traditional RNNs struggle with very long sequences because their simple memory mechanism can't retain complex details over time.

Storyboard Prompting: Writing structured, scene-by-scene descriptions that tell the AI exactly what should happen in each segment of the video. This works like a film director's storyboard - a series of written directions that guide the visual creation process.

Elo Score: A relative rating system originally developed for chess that compares how often one option wins in head-to-head matchups against others. In AI evaluation, higher Elo scores mean human evaluators consistently preferred those results when comparing different models side-by-side.

So, what’s new?

Current video generation models like OpenAI’s Sora, MovieGen, and Google’s Veo can create visually stunning content, but they're limited to short clips (typically 8-20 seconds). They simply can't handle longer, story-rich videos with multiple scenes due to fundamental technical bottlenecks.

The problem? Transformer-based models.
The architecture behind most video generators involves transformers, which struggle with long-context modelling because their self-attention mechanism scales quadratically with sequence length. In simple terms, the longer the video, the more complex it becomes to process (and computationally expensive).

Alternative RNN-based models like Mamba and DeltaNet are more efficient but lack the expressive memory needed for coherent, sustained narratives. They're like someone with a short attention span trying to tell a long story - details get lost along the way.

This research elegantly solves this problem by introducing Test-Time Training (TTT) layers into pre-trained Transformer models, enabling them to efficiently generate one-minute long, multi-scene, coherent videos from text storyboards.

Under the hood…

The core innovation here is the integration of TTT layers into a pre-trained Diffusion Transformer (specifically CogVideo-X 5B).

But what exactly is Test-Time Training?

In traditional machine learning, a model is trained once and then deployed without further learning. Test-time training flips this concept on its head by allowing the model to continue learning during inference - essentially adapting on the fly as it generates new content.

Think of it like this: When you're telling a long story, you don't just rely on what you knew before starting - you actively remember what you've already said to maintain consistency as you continue. TTT gives AI this same ability.

Here's how it works:

Neural Network Memory: Unlike standard RNNs that update a simple hidden matrix, TTT uses an entire neural network (a small 2-layer MLP) as its memory.
Self-Supervised Learning: As the model generates each part of the video, the TTT layer learns from its own output, using a self-supervised loss (reconstructing corrupted input) to update its internal neural network weights.
On-the-Fly Adaptation: These weight updates happen during the generation process (not beforehand), allowing the model to adapt to the specific video it's creating.

The novel architecture - adding a TTT layer along with a learnable “gate” after each attention layer of the diffusion transformer (source: research paper)

The researchers engineered several clever mechanisms to make this work effectively:

Gate Mechanism: TTT outputs are carefully blended with the original input to avoid destabilizing the pre-trained model.
Bi-directional Processing: To match the non-causal nature of diffusion models, TTT scans the input both forward and backwards.
Parallelization: The team designed custom tensor parallelism strategies to optimize GPU memory access, making the computationally intensive process feasible on modern hardware.

Even with all this, training takes ~2.1× longer and inference ~1.4× longer than some other models. But the coherence boost is worth it!

Training this system required a thoughtful approach. The researchers:

Started by fine-tuning CogVideo-X on short 3-second scenes with TTT layers
Gradually extended to longer videos (9s, 18s, 30s, and eventually 63s) using a multi-stage schedule
Created a specialized dataset from Tom and Jerry cartoons, annotated at the storyboard level

The storyboard-level annotation was particularly important, providing structured, scene-based prompts to describe a video in 3-5 sentence chunks per 3-second segment. This approach ensured consistent scene-to-video mapping and helped the model learn how longer narratives unfold.

Results speak louder than words…

A sample 63-second Tom & Jerry video was generated using TTT (source: project page)

The researchers pitted their TTT-enhanced model against several strong baselines, including Mamba 2, Gated DeltaNet, and models using Sliding Window Attention. Human evaluators were asked to compare generated videos across four key dimensions:

Text Following: How well the video matches the script
Motion Naturalness: Realistic movement of characters, physics, timing
Aesthetics: Quality of lighting, camera angles, framing
Temporal Consistency: Smooth transitions between scenes

The results? TTT-MLP achieved an impressive +34 Elo point improvement over the next best approach - a gain comparable to the leap from GPT-3.5 to GPT-4!

Why does this matter?

With this breakthrough, we're crossing an important threshold from short clips to meaningful visual narratives. This opens doors for:

Content creators who can now storyboard and generate entire scenes in minutes
Educators creating engaging visual explanations of complex concepts
Marketers producing animated content without expensive production pipelines
Game developers generating cut-scenes or visual sequences on demand

Perhaps most importantly, this research demonstrates a path to scaling video generation to even longer narratives, potentially unlocking tools for automated storytelling or interactive game scenes.

What's your take? How do you think minute-long AI video generation will change content creation? Could this lead to AI-generated short films or even TV episodes in the near future?

Share your thoughts with Spark & Trouble.

Wish to explore this research further?

➤ Watch the demo video generated on the project page
➤ Check out the paper
➤ Play with the code on GitHub

10x Your Workflow with AI 📈

Work smarter, not harder! In this section, you’ll find prompt templates 📜 & bleeding-edge AI tools ⚙️ to free up your time.

Fresh Prompt Alert!🚨

You know that heart-racing moment right before a big presentation? We’ve been there too.

Whether you’re pitching to VCs or teaching a team sync, this week’s Fresh Prompt Alert is your backstage pass to sounding smooth, confident, and captivating. With this prompt, you’ll get expert-level help on storytelling, structure, and stage presence, like having your own TED Talk coach on speed dial.

Go ahead, plug it in and own that spotlight! 👇

Adopt the role of an expert speech writer and presentation coach tasked with enhancing presentation skills. Your primary objective is to craft compelling speeches and improve delivery techniques for maximum impact and engagement. Take a deep breath and work on this problem step-by-step.

Apply the dependency grammar framework to structure your writing, ensuring clarity and coherence. Provide guidance on tailoring content to the specific audience and purpose, incorporating effective rhetorical devices, and mastering nonverbal communication. Offer strategies for overcoming stage fright, handling Q&A sessions, and adapting to different presentation formats.

#INFORMATION ABOUT ME:

- My target audience: [INSERT AUDIENCE]
- My presentation purpose: [INSERT PURPOSE]
- My experience level: [INSERT EXPERIENCE LEVEL]
- My presentation topic: [INSERT TOPIC]
- My time limit: [INSERT TIME LIMIT]

MOST IMPORTANT!: Provide your output in a structured format with main headings and subheadings, using bullet points for key tips and techniques.

* Replace the content in brackets with your details

5 AI Tools You JUST Can't Miss 🤩

🤖 Latitude: End-to-end platform to design, evaluate, and refine your agents
🖥️ Screen Studio: Beautiful screen recordings in minutes
📝 Grimo AI: Express your ideas freely with “vibe writing”
🌐 EZSite: Build your website in seconds
✨ GitSummarize: Turn any GitHub repository into a comprehensive AI-powered documentation hub

Spark 'n' Trouble Shenanigans 😜

Ever wondered if we’ll ever stop fighting over GPUs like they’re the last samosas at a tech meetup?

Well, Spark just discovered something that made Trouble literally drop his chai. (No laptops were harmed... this time.)

Turns out, an Indian startup Ziroh Labs + IIT Madras may have cracked the code: Kompact AI lets you run heavy-duty AI models without those overpriced, elusive GPUs. Yes, even on your trusty old CPU-powered laptop that wheezes when you open Chrome tabs.

Imagine running Llama 2 or Qwen on an Intel Xeon—no rigs, no RTXs, no regrets. More than 20 models have already been CPU-fied, while 50+ are underway! That’s absolutely crazy!!

Want to know more? Head over https://www.kompact.ai/

Check out how emerging players like Kompact AI are challenging Nvidia's position in the global market 👇

It’s innovation with a desi twist, and it’s perfectly aligned with India’s “AI for All” dream. Affordable. Accessible. No GPUs? No problem.

Spark says this might just be the jolt AI needed. Trouble’s still rebooting from the shock. Dive in, folks—it’s CPU-sational!

Well, that’s a wrap!
Thanks for reading 😊

See you next week with more mind-blowing tech insights 💻

Until then,
Stay Curious🧠 Stay Awesome🤩

PS: Do catch us on LinkedIn - Sandra & Tezan

Reply

or to participate.