The Vision, Debugged;
Posts
Epic Dance Duets or Street Fights — InterMask's 3D Character Interactions Handle It All

Epic Dance Duets or Street Fights — InterMask's 3D Character Interactions Handle It All

PLUS: 😱 We Let AI Create Our Portrait from Social Media Posts...

Tezan Sahu & Sandra Anil
November 19th, 2024

Howdy fellas!

Today, Spark and Trouble are weaving a web where words become actions, and actions form a symphony of human connection.

Intrigued by how the puzzle pieces fit? Let’s take a closer look!

But before we jump in, here’s an exciting announcement 🎉

The Vision Debugged - AI Podcast

We’ve just launched The Vision Debugged: AI Podcast 🎙 — a new extension of this newsletter, but now as easily digestible 10-minute audio episodes!

We’ve already dropped 10 insightful episodes ready for you to explore! 🎙️✨ Listen now on Spotify!

Here’s a sneak peek into today’s edition 👀

🎮 This new AI makes game characters fight, dance, and interact naturally
♻️ Perfect prompt to whip up easy, fun DIY projects to unleash your creativity
🤖 5 awesome AI tools that you JUST can’t miss out
🧔🏼 [MUST SEE] AI's Brutally Honest Take on Your Social Media Profile

Time to jump in!😄

PS: Got thoughts on our content? Share 'em through a quick survey at the end of every edition It helps us see how our product labs, insights & resources are landing, so we can make them even better.

Hot off the Wires 🔥

We're eavesdropping on the smartest minds in research. 🤫 Don't miss out on what they're cooking up! In this section, we dissect some of the juiciest tech research that holds the key to what's next in tech.⚡

Remember playing fighting games like Mortal Kombat or Street Fighter? Those epic battles where characters throw punches, dodge kicks, and perform perfectly choreographed moves?

Giphy

If you've been following our newsletter, you might recall our recent coverage of NVIDIA's MaskedMimic - a breakthrough in generating single-character motion. But here's the thing: making two characters interact naturally? That's a whole different ball game!

Creating realistic interactions between two characters in virtual worlds is an art that's as complex as choreographing a real dance duet!

Think “Person A kicks towards Person B” or “Person A sneaks up behind Person B.” The complexity isn’t just in individual motion—it’s in the interplay, the reaction, and the subtle adjustments one makes based on the other.

Enter InterMask, a groundbreaking framework from the University of Alberta and Snapchat researchers that's taking character interaction generation to the next level!

Forging the fundamentals

Before we dive in, let's decode some key terms:

Vector Quantization (VQ): This is a way to simplify complex information by grouping similar data points into fixed categories, like sorting colors into a limited set of crayons. In the context of InterMask, think of this like creating a dictionary of common movements. Instead of describing every tiny motion, we reference pre-defined "motion words" from our dictionary.

Variational Autoencoder (VAE): It's a neural network that learns to compress data into a simpler form and then recreate it, like a student summarizing a book and then rewriting it.

Motion Token Map: A representation that breaks down motion (like how things move in a video) into small, meaningful pieces, making it easier for AI to understand and process.

Transformer: This is a kind of machine learning model that processes all parts of a text or image at once to understand the relationships and meanings, like solving a puzzle by seeing the whole picture.

Token Masking: This is a trick where parts of data are hidden, so a model can practice guessing the missing pieces, similar to solving a fill-in-the-blanks puzzle.

So, what’s new?

While models like diffusion-based Com-MDM and InterGen have elevated single-character motion, two-person interactions present unique hurdles:

Realism is elusive due to overlooked spatial dependencies.
High-quality results require computationally expensive sampling processes.
Existing methods struggle to fully capture the nuanced back-and-forth dynamics of interaction.

Previous methods often struggled with spatial awareness – leading to awkward moments like characters passing through each other (ouch!).

This novel framework of InterMask uses a two-stage approach to model interactions with superior spatial and temporal coherence. It ensures that interactions remain natural and physically plausible.

Under the hood…

InterMask works its magic in two stages:

Overview of InterMask technique (source: InterMask paper)

VQ-VAE for Motion Representation

First, it uses something called a VQ-VAE (Vector Quantized Variational Autoencoder) to transform complex motion sequences into a simpler format.

Here’s how this works:

Motion data, represented as pose sequences (joints with positions, velocities, and rotations), is fed into the VQ-VAE.
This data is compressed into a latent space, then mapped onto a learnable “codebook,” creating a Motion Token Map—a structured representation of movements over time.
2D convolutions are used to maintain the spatial-temporal richness of interactions.

The compression process balances spatial fidelity (how individuals occupy 3D space) with temporal flow (how they move through time), paving the way for realistic interactions.

During training, the model uses two main objectives:

Vector Quantization Loss: This helps the model accurately recreate motion and ensures the motion data is stored efficiently in a set of fixed codes.
Geometric Loss: This ensures the predicted motion follows natural joint movements and maintains realistic distances between connected joints.

Inter-M Transformer for Interaction Modeling

Then comes the clever part – the Inter-M Transformer, which processes the tokens produced by the VQ-VAE to produce output motion sequences.

It's like having an AI choreographer that understands how two people should move together.

Here, the researchers introduced innovations at two levels:

Innovative masking techniques:
- Random Masking: Randomly hides parts of the motion tokens to train the model to fill gaps intelligently
- Interaction Masking: Focuses entirely on one individual’s motion while masking the other, forcing the model to understand mutual influence
- Step Unroll Masking: Iteratively re-masks uncertain predictions for refinement, for improving predictions

Drill-down into the Inter-M Transformer (source: InterMask paper)

Innovations in the Transformer blocks:
- Shared Spatio-Temporal Attention: Ensures the model grasps how individuals share physical space over time.
- Cross-Attention Layers: Captures how one person’s actions affect the other.
- Text Conditioning: Interprets input prompts (like “Person A punches Person B”) using CLIP-based embedding for contextual realism.

Inferencing

The inference process begins with a fully masked sequence and a text prompt describing the interaction.

Inference process in InterMask (source: InterMask paper)

First, the Inter-M transformer generates all tokens for each individual over multiple iterations. Next, the tokens are dequantized (using the ‘codebook’ in reverse) and decoded to generate motion sequences using the VQ-VAE decoder.

This efficient approach keeps computational demands low:

Parameters: Only 74M, significantly lighter than most models in its league.
Inference time: Just 0.8 seconds, making it both fast and scalable.

Why does this matter?

Here’s why InterMask is making waves:

State-of-the-art realism: It outperforms existing models on various datasets with fluid, natural motion
Human evaluation: Comprehensive user studies on Amazon Mechanical Turk reveal that users prefer InterMask’s outputs to prior methods
Robustness: Works well across body types and scales effectively with data

Check out some of these cool results 👇🏼

Text prompt: “Both play rock paper scissors with their right hands”

Text prompt: “Both are performing synchronized dance moves”

The key highlight of this approach is InterMask's capability to perform the reaction generation task, where the motion of one individual is generated depending on the provided reference motion of the other, with and without text descriptions.

While occasional body penetration and motion biases remain challenges, the results are leaps ahead of existing methods.

InterMask is already showing promise across multiple industries:

Game developers like Epic Games and Ubisoft can use it to create more realistic character animations and interactions in video games
Animation studios like Pixar could speed up their character interaction animation workflows
VR companies like Meta could enhance social VR experiences with more natural avatars
Even robotics companies could use it to teach robots more natural human interaction patterns

Want to dive deeper into the world of InterMask?

➤ Check out the full research paper for all the technical details

➤ Stay tuned for the code to be available soon enough…

Next time you watch two characters interact seamlessly in a game or movie, remember - there might be a bit of InterMask magic making it all possible! 🎮✨

What do you think? Let Spark and Trouble know—they love hearing about what excites you in tech!

10x Your Workflow with AI 📈

Work smarter, not harder! In this section, you’ll find prompt templates 📜 & bleeding-edge AI tools ⚙️ to free up your time.

Fresh Prompt Alert!🚨

Ever feel like your inner DIY enthusiast is ready to shine, but you’re stuck recycling the same old ideas (literally)?

This week’s Fresh Prompt Alert is here to save the day! 🎨✨

Say hello to your personal DIY muse—this prompt whips up creative, skill-level-friendly projects using basic materials you already have. From beginner-friendly crafts to intermediate masterpieces, it’s got step-by-step magic to transform your Sunday afternoons into craft ventures.

Time to roll up those sleeves and get creating! 👇

Act as a creative DIY Project Idea Generator for [target audience]

Generate innovative DIY project ideas using [materials/tools, e.g., "recycled materials," "basic household items"]. The projects should be suitable for [skill level, e.g., "beginners," "intermediate crafters"] and take approximately [time duration, e.g., "1-2 hours"] to complete.

Provide step-by-step instructions, a list of required materials, and any safety precautions. Include variations or enhancements for each project to cater to different preferences or skill levels.

Present the ideas in a clear, engaging, and easy-to-follow format.

* Replace the content in brackets with your details

5 AI Tools You JUST Can't Miss 🤩

🔊 Vocera: Build production-ready voice agents 10 times faster
🤖 FullContext: Effortlessly Engage and Demo Leads with AI-Powered Interactive Tours
📄 PaperGen: Get Fully-Referenced, Charted, Long-Form Papers with One Click
🎙️ Sona: Captures your conversations and provides insights that matter most to you
🎥 AI Studios: Realistic AI avatars, natural text-to-speech, and powerful AI video editing capabilities all in one platform.

Spark 'n' Trouble Shenanigans 😜

Ever wondered what AI thinks you look like based on your social media persona? Well, Trouble couldn't resist putting this to the test!

After seeing a hilarious Reddit thread where ChatGPT roasted someone's appearance based on their posts (talk about AI throwing shade! 😆), our data-loving troublemaker decided to feed some of his LinkedIn posts to ChatGPT.

The result? A surprisingly on-point visualization that captured Trouble's vibe (though Spark insists the real Trouble is way cooler 😎).

Here is the image generated by ChatGPT based on Trouble’s social media presence

We're pretty sure adding that "reasoning" prompt helped avoid any AI sass – though we kind of missed out on the roast!

Ready to see if AI can nail your digital doppelganger or hilariously miss the mark?

Give it a shot and share your results with us! Refer to this sample below 👇🏼

Trouble’s User Profile Inference

ChatGPT guesses how you look based on your latest LinkedIn posts

chatgpt.com/share/67396233-f204-8008-af79-453da8df9157

Well, that’s a wrap!
Thanks for reading 😊

See you next week with more mind-blowing tech insights 💻

Until then,
Stay Curious🧠 Stay Awesome🤩

PS: Do catch us on LinkedIn - Sandra & Tezan

Reply

or to participate.