The Vision, Debugged;
Posts
What If AI Could Use Your Computer for You? Discover "Agent S"

What If AI Could Use Your Computer for You? Discover "Agent S"

PLUS: Ready to Unlock the Formula for Crafting a Successful Webinar with AI?

Tezan Sahu & Sandra Anil
October 22nd, 2024

Howdy fellas!

Spark and Trouble are at it again, connecting the dots between human-like thinking and machine-powered action. Is it possible for computers to think, act, and explore like one of us? The answer might be just a few clicks away in this edition's puzzle!

Gif by youngertv on Giphy

Here’s a sneak peek into today’s edition 👀

“Agent S” Can Perform Your Desktop Tasks—How Far Can It Go?
Want to Create a Webinar That Wows? Start with today’s Fresh Prompt!
5 AI Tools that you just can’t miss out on
A student’s AI creation is turning heads when it comes to real-time call screening

Time to jump in!😄

PS: Got thoughts on our content? Share 'em through a quick survey at the end of every edition It helps us see how our product labs, insights & resources are landing, so we can make them even better.

Hot off the Wires 🔥

We're eavesdropping on the smartest minds in research. 🤫 Don't miss out on what they're cooking up! In this section, we dissect some of the juiciest tech research that holds the key to what's next in tech.⚡

Over the last few weeks, you’ve heard a lot from us about how AI agents are transforming specialized domains, like research, CLI commands, and more:

But today, Spark and Trouble have the perfect blend of something new – Meet Agent S, an open multimodal LLM-based agentic framework developed by the folks at Simular AI, designed to interact with computers through a Graphical User Interface (GUI).

In simpler terms? It's an AI that can use your computer just like you do – clicking, typing, and navigating its way through tasks with surprising efficiency.

But why is this such a big deal? Let's break it down.

So, what’s new?

Teaching AI to use a computer like a human is no small feat. Here are some of the hurdles:

Domain Knowledge: With countless apps and websites constantly evolving, how do you keep an AI up-to-date?
Long-term Planning: Many tasks require multiple steps in a specific order. Can an AI figure out and remember these sequences?
Dynamic Interfaces: GUIs change all the time. How can an AI adapt to these shifting landscapes?

Agent S tackles these challenges head-on with some seriously cool innovations.

A 30,000 ft view of what Agent-S can do (source: simular.ai)

Under the hood…

Agent S brings some cutting-edge solutions to the table, powered by three main innovations:

Overview of Agent-S Framework (source: Agent-S paper)

Experience-Augmented Hierarchical Planning

Agent S doesn’t just perform tasks—it thinks. At the core of this capability is a hierarchical approach to task management.

A “Manager” agent uses a tool called Perplexica (an open source AI-powered search engine) and its local Narrative Memory, containing past experiences, to retrieve relevant knowledge required to solve a user-defined task & then breaks it down into subtasks.

”Narrative memory” contains summaries of all tasks done in the past by Agent S.

These subtasks are passed to “Worker” agents, which use their own Episodic Memory to execute individual actions. Workers refer to previous results to optimize their strategy.

“Episodic memory” containing summarized results of actions performed & results of previous subtask performed by the worker.

Each subtask is evaluated, and the results are fed back into the system, allowing both the Manager and Workers to learn continuously from their experiences.

Think of it like an experienced team where the manager hands out assignments, and workers continuously learn from their successes and failures. This hierarchical approach enhances the agent’s ability to plan over longer timeframes and tackle more complex tasks.

Agent-Computer Interface (ACI)

Inspired by the rudimentary ACI introduced by researchers earlier in May 2024 through SWE-agent paper, These folks as Simular AI took things up one notch.

Agent S doesn't just look at your screen - it understands it. Using a combination of the accessibility tree (a hierarchical representation of UI elements) and optical character recognition (OCR), Agent S can "see" and interpret what's on your screen.

It then uses a technique called Set-of-Mark Prompting to finalize actions by keeping the choices constrained (like clicks, typing, or hotkeys) so they can be executed smoothly in the environment.

Set-of-Mark (SoM) is a Visual Promping technique introduced by Microsoft that adds spatial and speakable marks to images, which are used as context with a textual prompt. This helps the AI better understand and interpret the visual content.

To get a sense of the what’s & how’s of this super-interesting technique, check out this page.

Continual Memory Updates

Agent S doesn’t just perform and forget. It’s always learning, updating its memories based on the success or failure of actions it performs.

The Self Evaluator summarizes strategies used by workers and updates the Episodic Memory for future reference.
When a task is fully completed, the evaluator creates a summary of the entire process, which is stored in the Manager’s Narrative Memory for the next round.

What’s exciting is that this memory system was bootstrapped using examples inspired from OSWorld, a scalable environment benchmark that simulates real-world computer environments (Ubuntu, Windows, macOS). Even beyond initial bootstrapping, Agent S continues to learn as it encounters new, unseen tasks.

To get a sense of how all these components come together, tap on the flow diagram below to get a fairly detailed overview

Tap on the image to get a full view of how Agent-S works (created by authors)

Why does this matter?

So, how well does Agent S actually perform? Pretty impressively, it turns out!

Agent S excels in both system-level tasks and day-to-day desktop tasks. From adjusting photo brightness to automating work in Word or Excel, it’s proving itself to be a highly versatile tool.

Tasks from OSWorld benchmark, where Agent-S was tested (source: OSWorld project page)

Although initially designed for Linux and Mac, it’s shown surprising adaptability in Windows, with generalization capabilities surpassing any other existing agent.

While Agent-S works with several LLMs (like Claude-3.5 and Gemini-Pro-1.5), it shines the brightest when paired with GPT-4o.

Let’s talk numbers—Agent S delivers a 26% success rate in autonomous GUI tasks in OSWorld benchmark, which might seem modest, but when compared to previous baselines of around 11%, it marks a massive 127% improvement!

Fun Fact:

If you search for "Agent S" on the web, you'll find a peppy squirrel villager from the Japanese Animal Crossing series with the catchphrase "You gotta put the pedal to the metal!"

While unrelated, it seems the folks at Simular AI took this advice to heart in creating their super-charged AI agent!

Agent S offers endless potential for industries that rely on complex, repetitive GUI interactions:

Software Testing: Companies like Microsoft or Adobe could use Agent S to automate UI testing across their suite of applications.
Customer Support: Tech support teams could leverage Agent S to remotely diagnose and fix common computer issues.
Accessibility: Organizations focused on assistive technologies could integrate Agent S to help users with motor impairments navigate computer interfaces.
Productivity: Businesses could use Agent S to automate repetitive tasks, freeing up employees for more creative work.
Education: Schools could employ Agent S to teach computer literacy skills, providing interactive guidance through various software applications.

The exciting part is that Agent S’s code is open source. Check out the repo!

For now, Linux and macOS users can dive in and experiment, with Windows support on the way. And who knows? If you're feeling adventurous, you could even start thinking about integrating Agent S into your daily workflows.

What Spark & Trouble would wish to see soon is the addition of speech recognition capabilities to Agent S, which could streamline workflows even further, offering hands-free control over desktop environments. Just sit back & talk to your computer to “do” stuff on your behalf! Powerful, ain’t it?

10x Your Workflow with AI 📈

Work smarter, not harder! In this section, you’ll find prompt templates 📜 & bleeding-edge AI tools ⚙️ to free up your time.

Fresh Prompt Alert!🚨

You know that moment when you're trying to craft the perfect webinar, but the words just won’t flow? This week’s Fresh Prompt Alert is here to help you, as a content creator or coach, nail that script effortlessly - this prompt will turn your webinar into a value-packed session that keeps your audience hooked.

Ready to shine?👇

Write a script for a webinar that aims to educate potential customers about [insert topic here].

Emphasize on the main topics like [benefits of topic, cost savings, and environmental impact, etc...] and how these advantages can make a difference in their daily lives.

* Replace the content in brackets with your details

5 AI Tools You JUST Can't Miss 🤩

🪙 Kick: AI-powered accounting software that does the work for you
🤝 Stratify: A powerful AI-driven platform that turns your ideas into actionable strategies, stunning visuals, polished content, or a powerful SaaS
💬 Favie: Real Reddit Reviews for Amazon shopping
🧲 Lead Magnet Generator: Create eBooks & Whitepapers to Capture Leads in Minutes!
🧠 MindSmith: Accelerate your eLearning development with generative AI

Spark 'n' Trouble Shenanigans 😜

Ever wish you had a personal assistant to handle annoying spam calls and schedule meetings without lifting a finger? Well, Spark & Trouble have stumbled upon something super cool this week! 🧩✨

Gary Tan, CEO of Y Combinator, casually tweeted about needing an AI to screen his calls, and guess what? A student from UC California, Sarth Shah, actually built it in just 2 days and open-sourced the entire thing! 🤯 They call it Donna (of course, they must be huge fans of Suits, just like Trouble!)

GitHub - raviriley/donna: PearVC x OpenAI hackathon (built in 8 hours)

PearVC x OpenAI hackathon (built in 8 hours). Contribute to raviriley/donna development by creating an account on GitHub.

github.com/raviriley/donna

Donna uses Google Calendar to know when you're busy, Twilio to handle the calls, and OpenAI's language models to work her magic. She automatically rejects spam and sends calendar links to important callers. Talk about a tech-savvy duo!

Check out Donna’s magic on GitHub, and see if it inspires your own shenanigans!

Well, that’s a wrap!
Thanks for reading 😊

See you next week with more mind-blowing tech insights 💻

Until then,
Stay Curious🧠 Stay Awesome🤩

PS: Do catch us on LinkedIn - Sandra & Tezan

Reply

or to participate.