The Vision, Debugged;
Posts
Apple's Latest LLM Ushers in a New Era of Mobile Interaction

Apple's Latest LLM Ushers in a New Era of Mobile Interaction

PLUS: Who's Your Biggest Threat? Find Out with a simple prompt...

Tezan Sahu & Sandra Anil
April 30th, 2024

Howdy fellas!

✨Big news ✨ Your favorite AI newsletter will now be available 2x/week (Tuesdays & Thursdays) 🤩 Ain't that awesome!?

Spark & Trouble are really excited about this and are more keen on sharing exciting AI stuff with all of you amazing readers.

Well, here’s a sneak peek into this week’s edition 👀

Apple’s new MLLM is set to revolutionize our interaction with phones
Steal the ‘Prompt of this Week’ to make competitor analysis a breeze
How can you make Disney Pixar-style movie posters in under 2 minutes?

Time to jump in!😄

PS: Got thoughts on our content? Share 'em through a quick survey at the end of every edition It helps us see how our product labs, insights & resources are landing, so we can make them even better.

Hot off the Wires 🔥

We're eavesdropping on the smartest minds in research. 🤫 Don't miss out on what they're cooking up! In this section, we dissect some of the juiciest tech research that holds the key to what's next in tech.⚡

Apple may have hit the snooze button on the Gen AI alarm, but they’re rolling out of bed with a bang in 2024. First, they unveiled the MM1 family of multimodal large language models (MLLM), and now they're dropping a brand new MLLM called "Ferret-UI" that's got everyone buzzing.

Looks like the future of mobile interfaces just got a whole lot more fascinating!📱

Tweet about Ferret-UI by Zhe Gan, Staff Research Scientist at Apple (source: Twitter)

A bit of history (well…not that old though)

In October 2023, some brilliant minds at Apple (along with folks from Columbia University) publicly released a powerful MLLM “Ferret”, which excelled at understanding the contents of any arbitrary selection in an image input (a concept called referring) and, given a description, could also pinpoint specific locations in the output image corresponding to it (another concept, known as grounding). Pretty neat, huh?

But how is ‘grounding’ different from usual ‘object detection’?
Well, while object detection focuses on just identifying things in an image (giving labels), grounding goes a step further by trying to understand the meaning & context of those things (understand the meaning behind the labels).

The cool innovation, which had not been possible until then, was that Ferret was able to ‘refer’ content from free-form selections (not just rectangular bounding boxes or points). It was able to do this using a ‘Hybrid Region Representation’ and a ‘Spatial-Aware Visual Sampler’ (to understand the math, check out their paper), in conjunction with the pretrained Vicuna LLM.

Overall Architecture of Ferret MLLM (source: Ferret paper)

With Ferret’s state-of-the-art capabilities in finding anything, anywhere, just by you describing it, it certainly looks like the days of the frustrating CAPTCHAs asking you to “select all squares containing a traffic signal” might be numbered!

So, what’s new with Ferret-UI?

Here’s the thing: While Ferret achieved groundbreaking performance on ‘natural’ images, it wasn’t typically designed to deal with the world of apps & mobile interfaces - these UIs have an elongated aspect ratio and also contain a plethora of small icons, widgets & text (icons may occupy less than 0.1% of the entire screen).

That's where Ferret-UI comes in! This new MLLM is specifically designed to understand the collection of icons, text, and buttons that make up our phone screens (both Android & iOS). Moreover, it is advanced enough to engage in meaningful conversations based on the UI screen, providing description & functionality.

Under the hood…

Of course, Ferret-UI builds on top of the Ferret model. However, it introduces some pretty neat modifications to deal with the challenges mentioned previously.

Ferret-UI has a special trick called ‘anyres’ that lets it handle any aspect ratio thrown its way. It breaks down the (elongated) UI screenshot into 2 sub-images & feeds these along with the resized original image to the LLM for a more detailed analysis. This enables Ferret-UI to get a better view of the UI, while also avoiding confusion caused by closely spaced text.

Overall Architecture of Ferret-UI MLLM with ‘anyres’ (source: Ferret-UI paper)

The model is trained on a bunch of tasks, divided into 3 categories:

Referring Tasks	Grounding Tasks	Advanced Tasks
OCR	Widget Listing	Detailed Description
Icon Recognition	Finding Text	Conversation Perception
Widget Classification	Finding Icon	Conversation Interaction
	Finding Widget	Function Inference

Examples of the various tasks that Ferret-UI was trained on (source: Ferret-UI paper)

Although the training data leaned more towards iPhone screenshots, the performance on the various tasks for Android was commendable.

Why does all this matter?

Ferret-UI paves the way for some incredible advancements:

Natural Language UI Navigation: Imagine simply telling your phone, "Open the settings menu and navigate to Wi-Fi settings." Ferret-UI could make this a reality!
Enhanced Accessibility: Think of voice assistants like Siri being able to describe the layout of an app for visually impaired users. Ferret-UI could be the key to unlocking a more accessible mobile experience.
Smart UI Personalization: Imagine apps that automatically adjust their layout for efficient usage by using Ferret-UI, after analyzing your interaction patterns
Automated UI Testing: Ferret-UI may also streamline the app development process by having AI automatically test UI functionality and usability

The model weights & code for Ferret-UI have not yet been released by Apple, you can find more details in this paper.

Also, while some kinks need to be ironed out (Ferret-UI performed slightly worse than GPT-4V on the advanced tasks), Ferret-UI still has the potential to revolutionize the way we interact with our phones. And who knows, maybe it'll even find its way into Siri, making her a much more helpful assistant!

Spark and Trouble are definitely keeping a close eye on how this technology evolves!

Key Takeaways

(screenshot this)

Specialization is key: Ferret-UI achieves better performance than Ferret on mobile interfaces because it's specifically designed for that domain (UI with text, icons, buttons)

Break down the problem: Ferret-UI addresses the challenge of elongated aspect ratio and small UI elements by breaking down the UI screenshot into sub-images for better analysis

Training on diverse tasks: Ferret-UI's training on various tasks highlights the importance of comprehensive training for MLLMs to perform well on a broad range of functionalities

10x Your Workflow with AI 📈

Work smarter, not harder! In this section, you’ll find prompt templates 📜 & bleeding-edge AI tools ⚙️ to free up your time.

Fresh Prompt Alert!🚨

In the fast-paced world of business, knowing what your competitors are up to is like having a secret weapon. Whether you're a product wizard or a data nerd, this prompt helps you identify & analyze competitors in a breeze, while suggesting insightful recommendations for your strategy.

Get ready to outshine your rivals with our simple, powerful prompt!

Consider the business overview mentioned in <overview></overview> tags

<overview>
Description of Business: [description]
Key Differentiators: [key differentiators (if available)]
Revenue Model: [proposed revenue model of business (if available)]
Primary Regions of Operation: [Primary Regions of Operation (if available)]
</overview>

Identify the 3 biggest competitors of this business & describe their:
- Strengths
- Weaknesses
- GTM Strategies

Provide a highly detailed report based on the findings from this competitor analysis (include stats as well if available), and then make elaborate strategy recommendations as to how the business can outperform its competitors.

* Replace the content in brackets with your details | Use Microsoft Copilot / Gemini for best results

Wanna check this out in action?
Here’s the result for a hypothetical “kitchen electronics” brand targeting Indian tier-1 & tier-2 cities.

3 AI Tools You JUST Can't Miss 🤩

🌐 Framer - Design & publish stunning websites without breaking a sweat
📈 Julius AI - Analyse your data & get insights at the speed of thought
🖼️ Ideogram - Craft captivating AI images that render text correctly

Spark 'n' Trouble Shenanigans 😜

Did Spark & Trouble just miss these movies announced by Disney Pixar!? 😮

Source: r/ai_disney_posters (Reddit)

Well, nah! This is the creativity of folks around the world, coming to life with the power of AI in their hands.

If you too are excited to create such captivating movie posters for ideas that you’ve been longing to explore (in less than 2 min🤩), check out this tutorial 👇

Well, that’s a wrap!
Thanks for reading 😊

See you next week with more mind-blowing tech insights 💻

Until then,
Stay Curious🧠 Stay Awesome🤩

PS: Do catch us on LinkedIn - Sandra & Tezan

Reply

or to participate.