The Vision, Debugged;
Posts
Is VideoLLaMA 2 the Sherlock Holmes of Video Analysis? 🔍

Is VideoLLaMA 2 the Sherlock Holmes of Video Analysis? 🔍

PLUS: Now robots can learn to be robots, thanks to Nvidia!

Tezan Sahu & Sandra Anil
June 25th, 2024

Howdy fellas!

Did this ever happen to you? You're watching a movie with a friend, and they catch subtle details you completely missed – like a character's slight change in expression or a meaningful background sound?

Giphy

Well, imagine an AI that could do that, but for every video on the internet. Buckle up, because that's exactly what we're diving into today!

Here’s a sneak peek into today’s edition 👀

Alibaba’s new Video-LLM takes video understanding to a new level
A prompt that can make the lives of software devs a lot easier!
3 awesome AI tools that you JUST can’t miss
The world’s leading robot companies are bolstering Nvidia’s Omniverse

Time to jump in!😄

PS: Got thoughts on our content? Share 'em through a quick survey at the end of every edition It helps us see how our product labs, insights & resources are landing, so we can make them even better.

Hot off the Wires 🔥

We're eavesdropping on the smartest minds in research. 🤫 Don't miss out on what they're cooking up! In this section, we dissect some of the juiciest tech research that holds the key to what's next in tech.⚡

We bet you might have come across the recently trending Instagram reels about how “background music changes everything”. Here’s an example:

gufran025
View more on Instagram

Yeah, this rich combination of audio & visual cues is exactly what gives true meaning to videos.

Did you know?

Users watch over 1 billion hours of YouTube videos daily, and by today, online videos make up more than 80% of all consumer internet traffic.

With this explosion of video content, AI that can truly understand videos is becoming increasingly crucial.

While we humans can easily understand the content, context, and nuances in these videos, teaching AI to do the same has been a colossal challenge – until now.

Researchers at Alibaba just dropped VideoLLaMA 2 - a new state-of-the-art Video-LLM which is pushing the boundaries of what AI can understand from videos.

Understanding the ‘Jargon’

Before we go further, let's decode some key terms that’ll help us along the way:

Spatial-Temporal Modeling: The model's ability to understand how objects are arranged in a video (spatial) and how they move and change over time (temporal).

Convolution: A mathematical operation used in neural networks to process grid-like data (e.g., images or video frames). Check out this amazing video to get a sense of what convolutions are all about.

Instruction Fine-tuning: Training an AI model on specific tasks by providing it with instructions and examples.

So, what’s new?

Over the recent months, LLMs have started showing their prowess in video understanding. Early video-LLMs like mPLUG-Owl, VideoChat, and Video-ChatGPT primarily focused on analyzing static images extracted from videos or silent videos themselves. They could respond well to instructions related to what they "saw." But they were essentially watching movies on mute – missing out on crucial audio information.

Then came VideoLLaMA (the first version), which was like giving AI both eyes and ears. It introduced a dual-branch framework that consisted of separate vision-language and audio-language branches that fed into a central LLM for processing. This model could finally "hear" the videos it was watching, opening up a whole new dimension of understanding.

Note: For a deeper dive into the evolution of Video-LLMs, check out this insightful survey paper.

Comprehensive timeline showing development of Video-LLMs (source: survey paper)

Even with these advancements, Video-LLMs struggled to effectively process temporal dynamics – the way things change over time in a video. Another major challenge was to process high volume of tokens, compressing which would often lead to significant loss of information.

This is where VideoLLaMA 2 comes into the picture. It’s not just an upgrade - it borrows some parts of its architecture from the OG VideoLLaMA but reimagines the way spatial-temporal information is processed. At its core is a revolutionary feature called the Spatial-Temporal Convolution Connector (STC Connector).

Under the hood…

Building on VideoLLaMA's dual-branch framework, VideoLLaMA 2 incorporates several architectural improvements:

Vision-Language Branch: This branch utilizes a pre-trained CLIP visual encoder, allowing for flexibility in how frames are sampled from the video. The STC Connector then refines the encoded features for a more nuanced understanding.
Audio-Language Branch: Here, a pre-trained BEATs audio encoder, known for its exceptional ability to capture audio details and temporal dynamics, takes centre stage. The extracted features are then processed through a linear layer.

The LLM backbone can be customized – Alibaba’s researchers experimented with Mistral-Instruct and Mixtral-Instruct models for this purpose.

Dual-Branch Framework of VideoLLaMA 2 (source: VideoLLaMA 2 paper)

The core innovation in VideoLLaMA 2 lies in the Spatial-Temporal Convolution (STC) Connector & the cross-modal training strategy.

Imagine watching a movie –– you don't perceive each frame in isolation. The STC Connector works similarly. It's designed to preserve the crucial order of information across different frames in a video. Additionally, it is remarkably efficient, reducing the number of tokens needed compared to previous methods like Q-Former (used in VideoLLaMA) while minimizing information loss.

Visual representation of STC Connector (source: VideoLLaMA 2 paper)

Here's a simplified breakdown of how the STC Connector works:

Video frames are encoded by a visual encoder
They undergo spatial interaction using a "RegStage" block (a fancy convolution layer)
Then, a 3D convolution layer captures the spatial-temporal interactions - like how objects move across the screen
Finally, another round of spatial interaction occurs, before feeding the processed information to the next stage.

The training process for VideoLLaMA 2 is multifaceted:

Video-language training: First the vision-language branch is pre-trained on a massive dataset of image-text and video-text pairs, then fine-tuned on specific tasks like video captioning and visual question answering.
Audio-language training: Pretraining of the audio-language branch starts with the LLM from video training, which is further fine-tuned on audio-captioning and various audio-processing tasks.
Audio-visual joint training: The final step is where the model learns to understand the interplay between audio and visual cues.

Cross-Modal Training Strategy used to train VideoLLaMA 2 (source: created by authors)

Why does this matter?

VideoLLaMA 2 doesn't just perform well; it excels - outperforming open-source models and giving proprietary models a run for their money in video understanding benchmarks.

Qualitative analyses demonstrate that VideoLLaMA 2 can not only describe scenes & track changes in the orientation of objects during a video but can also make logical inferences based on various components of the video, while also pinpointing specific objects or events at exact moments in a video. That’s really neat!

An example of how VideoLLaMA 2 correctly understands the change in orientation of the car along the video (source: VideoLLaMA 2 paper)

You can check it out right away on Hugging Face. Just upload a video of your choice & ask away any questions about it!

The potential applications of VideoLLaMA 2 are vast. Our buddies Spark & Trouble can’t help but imagine the possibilities:

E-commerce platforms could use this tech for advanced product showcases
Content moderation on social media could become more nuanced and accurate
In autonomous driving, it could enhance situational awareness by integrating visual and audio cues
For companies like Alibaba, it could revolutionize how customers interact with products in virtual shopping experiences & personalized recommendations

As we stand on the brink of this new era in AI video understanding, one thing is clear: The future of how we interact with and analyze video content is about to get a whole lot smarter. Who knows? The next time you ask your AI assistant about a video, it might just understand it better than you do!

Key Takeaways

(screenshot this!)

Multimodal Integration: Combining visual and audio processing can lead to more comprehensive understanding in AI models.

Innovative Architecture: The STC Connector shows how rethinking fundamental building blocks can yield significant improvements.

Staged Training Approach: The multi-stage training process demonstrates the value of building up complexity gradually.

Balancing Efficiency and Information Preservation: The focus on reducing tokens while minimizing information loss is a crucial consideration for large-scale models.

10x Your Workflow with AI 📈

Work smarter, not harder! In this section, you’ll find prompt templates 📜 & bleeding-edge AI tools ⚙️ to free up your time.

Fresh Prompt Alert!🚨

We've all been there - staring at a mountain of code, squinting to decipher the logic. But what if there was a way to visualize that tangled web, transforming it into a neat diagram?

That's where today's prompt comes in! With a few keystrokes, you can now convert any code into sleek UML diagrams, making it a breeze to understand the architecture and flow.

Following is a [language] code snippet:

```

[code snippet]

```

Based on this, write the PlantUML code to visualize this snippet as a UML diagram.

* Replace the content in brackets with your details

Trouble tried his hands on this prompt while trying to visualize the code for a library management system. Pasting the resulting code on a PlantUML viewer like PlantText resulted in:

UML diagram of a library Management System (source: created by authors on PlantText)

Pretty neat, isn’t it?

We suppose it’s now time for you to start using this prompt template for some visual aid to make sense of the chaos…

3 AI Tools You JUST Can't Miss 🤩

☑️ Applyish - Complete job application forms faster with AI
🎨 Motiff - AI-powered professional UI design tool
📲 Owl AI - Generate any type of logo for your idea using AI

Spark 'n' Trouble Shenanigans 😜

Last month, we saw a flurry of AI conferences & announcements by the top tech companies. While catching up with these, Trouble noticed something really cool!

Nvidia is all set to build Omniverse - a development platform for virtual world simulation, combining real-time physically based rendering, physics simulation, and generative AI tech.

“The era of robotics has arrived. Everything that moves will one day be autonomous.”

Jensen Huang (founder and CEO of NVIDIA)

In fact, BYD Electronics, Siemens, Teradyne Robotics and Intrinsic, & a dozen other robotics industry leaders globally are integrating NVIDIA Isaac accelerated libraries.

In Omniverse, robots can learn to be robots – minimizing the sim-to-real gap, and maximizing the transfer of learned behavior.

Sounds crazy, isn’t it? Check it out for yourself 👇

Well, that’s a wrap!
Thanks for reading 😊

See you next week with more mind-blowing tech insights 💻

Until then,
Stay Curious🧠 Stay Awesome🤩

PS: Do catch us on LinkedIn - Sandra & Tezan

Reply

or to participate.