Microsoft's Latest Model Makes Faces Move and Talk Like Never Before!

Say Hello to the Future of Virtual Assistants

Howdy fellas!

Once again, Spark and Trouble with the tech & tips to keep you clicking! Buckle up for an insightful edition!

Here’s a sneak peek into this edition 👀

  • What’s the tech behind Microsoft’s powerful VASA-1 to generate top-notch digital avatars?

  • Ditch the vague aspirations & achieve laser focus with our SMART goal prompt

  • You can’t miss out on this week’s 3 top AI tools if you’re a working professional

Time to jump in!😄

PS: Got thoughts on our content? Share 'em through a quick survey at the end of every edition It helps us see how our product labs, insights & resources are landing, so we can make them even better.

Hot off the Wires 🔥

We're eavesdropping on the smartest minds in research. 🤫 Don't miss out on what they're cooking up! In this section, we dissect some of the juiciest tech research that holds the key to what's next in tech.⚡

Imagine a time when digital avatars interact with us in a natural way, with subtle & rich array of non-verbal facial & behavioral cues to articulate emotions, convey unspoken messages, and foster empathetic connections. AI's been creeping us closer to this future, and lately it's on fire. Just in the past few months, there's been a buzz about...

  • Tools like Character AI that enable you to chat with bots that emulate celebrities like Steve Jobs (as well as characters like Steve Rogers)

  • Text2Speech tools like Speechify & ElevenLabs can now replicate your voice & generate lifelike voice overs

  • Cutting-edge models like Samsung AI’s MegaPortraits & NVIDIA’s Face-Vid2Vid are able to create pretty convincing talking head videos, given any facial image & a “driver” video to mimic

Amidst this hot landscape, researchers at Microsoft recently announced their latest innovation: VASA-1. Feed it a picture and some audio, and out pops a video of that person seemingly having a conversation!

Super-realistic talking head videos generated by VASA-1

So, what’s new with VASA-1?

We all know the human face is a powerful tool. A raised eyebrow can express skepticism, a furrowed brow concern, and a wide smile, well, you get the idea. That’s why creating realistic talking face videos has often been tricky.

Previously, creating talking face videos involved feeding the model a source video along with the source portrait. This approach, while achieving good lip-sync precision, often fell short in capturing the subtle nuances of human expressions and head movements. Moreover, these systems also demanded substantial computation power, making them unsuitable for interactive use cases.

That's the gap VASA-1 aims to bridge. It can generate high-quality (512×512 resolution), hyper-realistic talking face videos at extremely low latencies for real-time applications (40 frames/second in streaming mode, within 170ms). By using just single portrait image & a speech audio clip.

Under the hood…

How does VASA-1 translate audio into these facial movements for the source image? To understand, let’s familiarize ourselves with a few interesting concepts:

  • Disentangled Face Latent Space

  • Audio-Conditioned Motion Latent Diffusion

That’s definitely a mouthful to say! But stick with us & we’ll make it intuitive for you…

Disentangled Face Latent Space

AI models represent information like text, images, etc. in an encoded format with numbers (often called ‘embeddings’) in high-dimensional space, where similar items are closer (for example, words “cat” & “dog” would be closer compared to “cat” & “soccer”). This high-dimensional space is known as latent space.

Visual representation of how stuff maps from input space to latent space (source: baeldung.com)

When considering applications involving faces, the latent space focuses on capturing the identity of a person. So, different pictures of the same person (smiling, frowning, with glasses) will have codes in this face latent space that are close together. Consider the following as important latent variables to construct a face latent space:

  • Appearance features (think of details like skin and eye colors)

  • Identity code (key facial landmarks like eye positions, nose shape, jawline)

  • Head pose features (these include rotation & translation of the head)

  • Facial dynamics code (blinking, lip motion, expression, etc.)

Disentanglement in this space means separating facial identity from other features like expressions & movement, so you can change a smile into a frown, or a head looking straight to one gazing right, without changing the person’s face.

VASA-1 builds on something known as a “3D-aid face reenactment framework” to construct such a disentangled face latent space. Simply put, it contains an encoder that takes portrait images & converts them into the latent variables, along with a decoder that tried to reconstruct the actual face given the latent variables. To achieve good disentanglement between these features, this encoder-decoder model is trained by swapping latent variables between different faces & trying to reconstruct faces with different appearances & poses with a high degree of accuracy.

Audio-Conditioned Motion Latent Diffusion

VASA-1 uses audio as a driver instead of video (used in earlier research). This is an interesting challenge because now one must deduce what potential motion & expressions accompany the corresponding audio, instead of directly relying on the motion & expressions depicted in the video frames. To appreciate how VASA-1 accomplishes this, let’s first understand text-guided image diffusion - the innovation that powers image generators like Dall-E 2 & Stable Diffusion.

Denoising Diffusion in action (source: learnopencv.com)

Diffusion is a process where data is gradually transformed into noise while simultaneously learned to be reversed, creating new data samples. For image generation, the model learns this process by adding noise to real images and then training to remove it step-by-step (post training, the reverse ‘denoising’ pathway is used for image generation from thin air). Text-guided image diffusion models like GLIDE and Dall-E use features from text prompts to guide the generation of images from noise, creating visuals that align with the described text.

In the above process, replace text features with audio features, extracted from the audio clip using Wav2Vec 2.0 model & instead of an image, you want to obtain motion sequences (combination of the latent variable of head pose & facial dynamics) corresponding to that audio. This is exactly what audio-conditioned motion latent diffusion is all about. VASA-1 performs this using transformer architecture.

Training pipeline for Motion Latent Diffusion (source: VASA-1 Paper)

For inference, VASA-1 works by first encoding the facial appearance & identity from the source portrait, obtaining the motion sequences from the audio clip using diffusion & using the trained decoder to generate the video frame by combining the appearance, identity & motion latent variables. Pretty cool, right?

Inference pipeline for Video Generation (source: VASA-1 Paper)

Why does this matter?

The use of a highly disentangled face latent space is a game changer. The result? Expressive talking faces that move naturally, regardless of whether the source image is a real photo, a cartoon, or even a completely generated face!

VASA-1 gracefully handles portraits of types that were not present while training

The use of audio-conditioned motion latent diffusion makes it possible for VASA-1 to generate videos even for non-English audio (although it was trained only on English audio).

It can also construct arbitrary-length videos by chunking audio clips into windows of fixed size & stitching the videos generated from these windowed audio clips.

Tap the image to check out the real-time demo of VASA-1 in action

But the best part? It can do this in real-time - making it perfect for video calls, virtual assistants & customer-service avatars that come alive on screen, and even personalized educational tools with avatars that speak in a student’s native language or adjust their tone based on their learning needs. Content creation through visual storytelling, filmmaking & animation can also benefit tremendously from such technology.

Also, being announced just weeks after Google announced its VLOGGER model to transform static images into dynamic lifelike videos, looks like the AI Battle Royale in this space is ON!

Of course, the researchers at Microsoft understand the perils of such a model falling into the hands of bad actors (especially as may countries are heading towards their elections) & have decided NOT to release any demos, APIs or products based on this research until we have proper regulations for responsible use of such powerful tech.

Spark & Trouble are cautiously excited about this innovation! What are your thoughts?

Curious about the nitty-gritty? Check out the paper

10x Your Workflow with AI 📈

Work smarter, not harder! In this section, you’ll find prompt templates 📜 & bleeding-edge AI tools ⚙️ to free up your time.

Fresh Prompt Alert!🚨

You know, that feeling when you say "I wanna learn Python" but then six months later you're still stuck on print('Hello, World!')?
Yeah, us too.

That's why this week's Fresh Prompt Alert is all about setting SMART (Specific, Measurable, Achievable, Relevant, and Time-Bound) goals. This prompt will turn you into a goal-setting ninja, slicing through ambiguity and propelling you towards product (or career!) greatness.

So, do give it a try 👇

You are a visionary leader, with a solid understanding of ground-level execution of tasks. You understand the importance of setting SMART goals & are an expert in setting such goals to achieve immense success.

Turn this vision into a SMART goal: [insert vision].
Include the most important outcomes and deliverables.

* Replace the content in brackets with your details

3 AI Tools You JUST Can't Miss 🤩

  • 📹 Nvidia Broadcast - Turn your room into a studio with AI-powered voice & video effects including eye-tracking & auto-framing

  • ✍️ Rytr - Beat the writer's block & generate creative text that sounds like you (not a robot) in seconds

  • 🪄 Tome - Craft stunning presentations, complete with captivating visuals and a persuasive narrative with AI

Spark 'n' Trouble Shenanigans 😜

The Weekly Chuckle

No data scientists were harmed while creating this meme 🤣 (source: LinkedIn)

Well, that’s a wrap!
Thanks for reading 😊

See you next week with more mind-blowing tech insights 💻

Until then,
Stay Curious🧠 Stay Awesome🤩

PS: Do catch us on LinkedIn - Sandra & Tezan

Reply

or to participate.