InternLM’s New AI Breakthrough: What You Need to Know...

PLUS: 1000+ AI Agents Go Wild in Minecraft—And It’s Epic! 🌏

Howdy fellas!

Hold onto your headphones and screens, because Spark and Trouble just intercepted a breakthrough that's about to redefine how technology listens, watches, and responds!

the orville isaac GIF by Fox TV

Gif by foxtv on Giphy

Are you ready to follow the trail and see where it leads?

Here’s a sneak peek into today’s edition 👀

  • Learn how InternLM’s latest AI can process streaming video & audio, with memory

  • Create websites that convert - with this fast & easy prompt!

  • Discover 5 incredible AI tools you should be using

  • Altera’s AI agents are creating their own society in Minecraft—crazy, right?

Time to jump in!😄

PS: Got thoughts on our content? Share 'em through a quick survey at the end of every edition It helps us see how our product labs, insights & resources are landing, so we can make them even better.

Hot off the Wires 🔥

We're eavesdropping on the smartest minds in research. 🤫 Don't miss out on what they're cooking up! In this section, we dissect some of the juiciest tech research that holds the key to what's next in tech.⚡

Remember those sci-fi movies where an AI assistant seamlessly tracks multiple information streams, remembers every detail, and responds with lightning-fast intelligence?

This is the future that InternLM’s latest release, InternLM-XComposer2.5-OmniLive (IXC2.5-OL) aims to create, blending human-like cognition with advanced AI capabilities.

Here’s a super cool demo that should get you all curious about this innovation!

IXC2.5-OL Demo Video (for the best experience, please keep the audio on while enjoying the video.)

Did you know?

InternLM is an open-source AI research organization (mainly developed by Shanghai AI Laboratory) doing pioneering work in developing state-of-the-art language models and toolchains. Their releases include:

InternLM-2.5 Series: Multilingual foundational models with advanced context handling

InternLM-Math: High-performing bilingual math reasoning LLMs

InternLM-XComposer: Vision-language models for text-image understanding and generation

Toolchain Suite: Comprehensive training, deployment, and evaluation tools

… and so much more!

Don’t miss out their GitHub handle!

So, what’s new?

The need for multimodal systems arises from our daily interactions with various forms of media—video, audio, and text. These systems must not only process these inputs but also reason about them in real time, much like how we remember and reflect on past experiences.

However, current AI systems struggle with the simultaneous processing of streaming video and audio, often losing context over time. Following are some of the key reasons:

  • Existing state-of-the-art methods rely heavily on long context windows, which are impractical for real-time applications.

  • They often switch between perception and reasoning, leading to inefficiencies and a lack of fluidity in interactions

To tackle this, IXC2.5-OL draws inspiration from human cognition and the concept of "specialized generalist AI," which allows it to perform distinct tasks (like perception, reasoning, and memory) simultaneously, enhancing its interactive capabilities.

Drawing inspiration from human-like cognition (source: IXC2.5-OL paper)

Forging the fundamentals

Before we dive deep, let's decode some key terms that make this system so revolutionary:

Multimodal Perception: Think of it as having multiple sensory superpowers - simultaneously processing video, audio, and text without breaking a digital sweat.

Streaming Memory: Not just remembering information, but compressing and retrieving it dynamically, like a brain that's constantly organizing its filing cabinet.

Specialized Generalist AI: A system that combines specialized models for different tasks while maintaining a generalist approach to handle various inputs.

Under the hood…

Instead of handling all tasks with a single engine, IXC2.5-OL uses three distinct modules working in harmony:

Streaming “Perception” Module

This module handles real-time video and audio. It uses the Whisper model for audio encoding, followed by Qwen2-1.8B SLM to classify & recognize the speech within the audio.
For efficient processing of streaming video, it uses the OpenAI CLIP-L/14 model to send visual semantics to the next module.
Thus, it doesn’t just watch and listen - but actively filters and stores relevant data.

Multimodal Long “Memory” Module

This is the most critical module, that compresses short-term memory into efficient, long-term storage. This allows the system to retain key insights while discarding irrelevant noise, much like our own brain prioritizing important experiences.

This module is trained on three main tasks:

  1. Video Clip Compression: It compresses features from video clips into a more manageable format, creating short-term and global memories for each clip.

  2. Memory Integration: It integrates short-term memories, which contain detailed information, into a more abstract long-term memory, allowing for a macro view of the video content.

  3. Video Clip Retrieval: When a user poses a question, the module retrieves the most relevant video clips and their associated short-term memories to assist the Reasoning Module in providing accurate responses

This module utilizes Qwen2-1.8B SLM (fine-tuned on video captioning & video question-answering tasks) to process and integrate the multimodal data effectively, ensuring that the system can handle real-time interactions and maintain context over extended periods.

“Reasoning” Module

This is the “brain” that connects the dots, answers queries, and makes decisions by pulling relevant memories and processing current inputs.

Here researchers used an improved version of their previously released InternLM-XComposer2.5 model, augmented with the visual, auditory & memory features extracted by the previous modules to answer the user’s questions.

Overall, here’s what this system looks like:

That’s what the “specialized generalist AI” for IXC2.5-OL looks like (source: IXC2.5-OL paper)

What’s the intrigue?

While talking about this model architecture is fascinating, the researchers went a step further to discuss how they have deployed this system to serve users…

The system is structured into three key parts: the Frontend, SRS (Simple Realtime Server) Server, and Backend Server.

Each component plays a crucial role in processing and managing audio and video streams effectively:

A high-level representation of the system pipeline (source: created by authors)

The backend server typically does all the AI heavy lifting, with a bunch of queues to process audio, and video frames & perform other LLM-related tasks, before converting the final response to speech using techniques like MeloTTS.

Results speak louder than words!

IXC2.5-OL delivers state-of-the-art performance across benchmarks:

  • StreamingBench: Achieved 73.79%, making it the best open-source model for real-time video understanding.

  • MLVU & MVBench: Outperformed both open-source and closed-source models in complex multimodal reasoning tasks.

  • Audio Recognition: Demonstrated superior accuracy in speech recognition tasks for both English and Chinese datasets.

Why does this matter?

The advancements in IXC2.5-OL can benefit industries such as entertainment, education, and customer service. Imagine the possibilities:

  • Autonomous Vehicles: Cars that don’t just “see” but also remember traffic patterns and deduce potential hazards.

  • Customer Service: Chatbots that adapt to nuanced customer needs by recalling previous interactions.

  • Healthcare: Virtual assistants capable of understanding patient history through audio-visual interactions, offering more accurate diagnoses.

Companies like Tesla, Apple, or even Meta could leverage this for next-gen AI assistants, while academics might explore its potential in education or dynamic research aids.

Wish to try this out & experience the magic for yourself?

Check out the steps to set up the live demo.

InternLM-XComposer2.5-OmniLive is more than just a technological leap; it’s a paradigm shift in how we think about multimodal AI. By emulating human-like cognition, it’s setting the stage for applications that were once the stuff of science fiction.

So, Spark & Trouble want to know: How would you use such a system in your world?

Drop us your ideas and let’s spark a discussion!

10x Your Workflow with AI 📈

Work smarter, not harder! In this section, you’ll find prompt templates 📜 & bleeding-edge AI tools ⚙️ to free up your time.

Fresh Prompt Alert!🚨

Ever tried building a website but ended up with a blank page and existential dread?
We've been there.

This week's Fresh Prompt Alert is your digital saviour! It crafts a website outline, boosts conversions, and even throws in a witty headline to make your business shine online. Whether you're launching your dream startup or giving your side hustle a glow-up, this prompt is like having a web strategist in your pocket.

Ready to hit publish? Dive in below 👇

I have a business focusing on [enter your business details here, plus any additional info like location or services offered].

I am looking to create a website, can you suggest an outline for said website?

Please also offer suggestions on optimizing the page for more conversions.

Can you suggest a witty headline for and a short introductory paragraph for the website?

* Replace the content in brackets with your details

5 AI Tools You JUST Can't Miss 🤩

  • 👩‍🏫 Playmaker: Transform unstructured data into actionable tables, instantly

  • ©️ Lyndium: Translate videos to any language using AI

  • ✍️ Exam Maker AI: Get higher grades faster by generating exams from any type of course content

  • 🎬 ClipGOAT YT Shorts Maker: Easily convert YouTube videos into Shorts with AI

  • 📅 Lean: Design A Database With AI using just a prompt

Spark 'n' Trouble Shenanigans 😜

Ever wondered what happens when AI agents get together and decide to build their own civilization? 🤔 

Well, Spark and Trouble were mind-blown (and maybe a bit jealous) when we stumbled upon Altera AI’s Project SID—the first simulations of 1000+ truly autonomous AI agents collaborating in Minecraft, forming their own society!

From creating a working economy with gems as currency to debating laws under different political leaders (yep, a virtual Trump vs Kamala showdown!), these agents aren’t just playing games—they’re creating emergent cultures, governments, and even religions.

It’s hilarious, mind-boggling, and honestly, kind of inspiring.

If you’ve ever wanted to see AI collaborate (or squabble) in ways that look like a Minecraft episode gone wild, this is a must-see. Trust us, you don’t want to miss this one! 👇🏼

Well, that’s a wrap!
Thanks for reading 😊

See you next week with more mind-blowing tech insights 💻

Until then,
Stay Curious🧠 Stay Awesome🤩

PS: Do catch us on LinkedIn - Sandra & Tezan

Reply

or to participate.