- The Vision, Debugged;
- Posts
- Apple's Latest LLM Ushers in a New Era of Mobile Interaction
Apple's Latest LLM Ushers in a New Era of Mobile Interaction
PLUS: Who's Your Biggest Threat? Find Out with a simple prompt...
Howdy fellas!
āØBig news āØ Your favorite AI newsletter will now be available 2x/week (Tuesdays & Thursdays) š¤© Ain't that awesome!?
Spark & Trouble are really excited about this and are more keen on sharing exciting AI stuff with all of you amazing readers.
Well, hereās a sneak peek into this weekās edition š
Appleās new MLLM is set to revolutionize our interaction with phones
Steal the āPrompt of this Weekā to make competitor analysis a breeze
How can you make Disney Pixar-style movie posters in under 2 minutes?
Time to jump in!š
PS: Got thoughts on our content? Share 'em through a quick survey at the end of every edition It helps us see how our product labs, insights & resources are landing, so we can make them even better.
Hot off the Wires š„
We're eavesdropping on the smartest minds in research. š¤« Don't miss out on what they're cooking up! In this section, we dissect some of the juiciest tech research that holds the key to what's next in tech.ā”
Apple may have hit the snooze button on the Gen AI alarm, but theyāre rolling out of bed with a bang in 2024. First, they unveiled the MM1 family of multimodal large language models (MLLM), and now they're dropping a brand new MLLM called "Ferret-UI" that's got everyone buzzing.
Looks like the future of mobile interfaces just got a whole lot more fascinating!š±
Tweet about Ferret-UI by Zhe Gan, Staff Research Scientist at Apple (source: Twitter)
A bit of history (wellā¦not that old though)
In October 2023, some brilliant minds at Apple (along with folks from Columbia University) publicly released a powerful MLLM āFerretā, which excelled at understanding the contents of any arbitrary selection in an image input (a concept called referring) and, given a description, could also pinpoint specific locations in the output image corresponding to it (another concept, known as grounding). Pretty neat, huh?
But how is āgroundingā different from usual āobject detectionā?
Well, while object detection focuses on just identifying things in an image (giving labels), grounding goes a step further by trying to understand the meaning & context of those things (understand the meaning behind the labels).
The cool innovation, which had not been possible until then, was that Ferret was able to āreferā content from free-form selections (not just rectangular bounding boxes or points). It was able to do this using a āHybrid Region Representationā and a āSpatial-Aware Visual Samplerā (to understand the math, check out their paper), in conjunction with the pretrained Vicuna LLM.
Overall Architecture of Ferret MLLM (source: Ferret paper)
With Ferretās state-of-the-art capabilities in finding anything, anywhere, just by you describing it, it certainly looks like the days of the frustrating CAPTCHAs asking you to āselect all squares containing a traffic signalā might be numbered!
So, whatās new with Ferret-UI?
Hereās the thing: While Ferret achieved groundbreaking performance on ānaturalā images, it wasnāt typically designed to deal with the world of apps & mobile interfaces - these UIs have an elongated aspect ratio and also contain a plethora of small icons, widgets & text (icons may occupy less than 0.1% of the entire screen).
That's where Ferret-UI comes in! This new MLLM is specifically designed to understand the collection of icons, text, and buttons that make up our phone screens (both Android & iOS). Moreover, it is advanced enough to engage in meaningful conversations based on the UI screen, providing description & functionality.
Under the hoodā¦
Of course, Ferret-UI builds on top of the Ferret model. However, it introduces some pretty neat modifications to deal with the challenges mentioned previously.
Ferret-UI has a special trick called āanyresā that lets it handle any aspect ratio thrown its way. It breaks down the (elongated) UI screenshot into 2 sub-images & feeds these along with the resized original image to the LLM for a more detailed analysis. This enables Ferret-UI to get a better view of the UI, while also avoiding confusion caused by closely spaced text.
Overall Architecture of Ferret-UI MLLM with āanyresā (source: Ferret-UI paper)
The model is trained on a bunch of tasks, divided into 3 categories:
Referring Tasks | Grounding Tasks | Advanced Tasks |
---|---|---|
OCR | Widget Listing | Detailed Description |
Icon Recognition | Finding Text | Conversation Perception |
Widget Classification | Finding Icon | Conversation Interaction |
Finding Widget | Function Inference |
Examples of the various tasks that Ferret-UI was trained on (source: Ferret-UI paper)
Although the training data leaned more towards iPhone screenshots, the performance on the various tasks for Android was commendable.
Why does all this matter?
Ferret-UI paves the way for some incredible advancements:
Natural Language UI Navigation: Imagine simply telling your phone, "Open the settings menu and navigate to Wi-Fi settings." Ferret-UI could make this a reality!
Enhanced Accessibility: Think of voice assistants like Siri being able to describe the layout of an app for visually impaired users. Ferret-UI could be the key to unlocking a more accessible mobile experience.
Smart UI Personalization: Imagine apps that automatically adjust their layout for efficient usage by using Ferret-UI, after analyzing your interaction patterns
Automated UI Testing: Ferret-UI may also streamline the app development process by having AI automatically test UI functionality and usability
The model weights & code for Ferret-UI have not yet been released by Apple, you can find more details in this paper.
Also, while some kinks need to be ironed out (Ferret-UI performed slightly worse than GPT-4V on the advanced tasks), Ferret-UI still has the potential to revolutionize the way we interact with our phones. And who knows, maybe it'll even find its way into Siri, making her a much more helpful assistant!
Spark and Trouble are definitely keeping a close eye on how this technology evolves!
Key Takeaways
(screenshot this)
Specialization is key: Ferret-UI achieves better performance than Ferret on mobile interfaces because it's specifically designed for that domain (UI with text, icons, buttons)
Break down the problem: Ferret-UI addresses the challenge of elongated aspect ratio and small UI elements by breaking down the UI screenshot into sub-images for better analysis
Training on diverse tasks: Ferret-UI's training on various tasks highlights the importance of comprehensive training for MLLMs to perform well on a broad range of functionalities
10x Your Workflow with AI š
Work smarter, not harder! In this section, youāll find prompt templates š & bleeding-edge AI tools āļø to free up your time.
Fresh Prompt Alert!šØ
In the fast-paced world of business, knowing what your competitors are up to is like having a secret weapon. Whether you're a product wizard or a data nerd, this prompt helps you identify & analyze competitors in a breeze, while suggesting insightful recommendations for your strategy.
Get ready to outshine your rivals with our simple, powerful prompt!
Consider the business overview mentioned in <overview></overview> tags
<overview>
Description of Business: [description]
Key Differentiators: [key differentiators (if available)]
Revenue Model: [proposed revenue model of business (if available)]
Primary Regions of Operation: [Primary Regions of Operation (if available)]
</overview>
Identify the 3 biggest competitors of this business & describe their:
- Strengths
- Weaknesses
- GTM Strategies
Provide a highly detailed report based on the findings from this competitor analysis (include stats as well if available), and then make elaborate strategy recommendations as to how the business can outperform its competitors.
Wanna check this out in action?
Hereās the result for a hypothetical ākitchen electronicsā brand targeting Indian tier-1 & tier-2 cities.
3 AI Tools You JUST Can't Miss š¤©
Spark 'n' Trouble Shenanigans š
Did Spark & Trouble just miss these movies announced by Disney Pixar!? š®
Well, nah! This is the creativity of folks around the world, coming to life with the power of AI in their hands.
If you too are excited to create such captivating movie posters for ideas that youāve been longing to explore (in less than 2 minš¤©), check out this tutorial š
Well, thatās a wrap! Until then, |
Reply