The Vision, Debugged;
Posts
20x Cheaper AI Models with LLMLingua?

20x Cheaper AI Models with LLMLingua?

PLUS: Killer prompt to turbocharge your social media content

Tezan Sahu & Sandra Anil
May 28th, 2024

Howdy fellas!

Welcome to yet another edition of The Vision, Debugged. In this edition, we’ll unleash the power of "less is more" with Spark & Trouble! Explore the latest research on prompt compression and discover tools & prompts to boost your productivity and creativity.

Here’s a sneak peek into this week’s edition 👀

Discover how LLMLingua makes using LLMs way more affordable
Steal the ‘Prompt of this Week’ to supercharge your social media presence
3 new AI tools to boost your productivity
A quick way to design app logos & icons for your business using AI

Time to jump in!😄

PS: Got thoughts on our content? Share 'em through a quick survey at the end of every edition It helps us see how our product labs, insights & resources are landing, so we can make them even better.

Hot off the Wires 🔥

We're eavesdropping on the smartest minds in research. 🤫 Don't miss out on what they're cooking up! In this section, we dissect some of the juiciest tech research that holds the key to what's next in tech.⚡

Recently, Trouble had an idea of integrating an AI agent powered by GPT-4 for The Vision Debugged to engage our readers with deeper conversations & answer their questions related to our newsletter content.

Sounds impressive, right? Well, not exactly for our wallets. Yeah, we referred to some quick analysis to realize that for a small business, using GPT-4 for customer support could translate to a hefty bill of over $21,000 a month. Yikes!

Fun Fact: ChatGPT is estimated to cost over a whopping $700,000 per day to operate!

Running black box LLMs like ChatGPT & Claude for high-throughput applications can be incredibly expensive (source: created by authors)

This is exactly where techniques like LLMingua come in as a lifesaver. Authored by researchers at Microsoft, the LLMLingua paper proposes a clever method to significantly reduce the cost of using black-box LLMs like ChatGPT & Claude through their APIs, without sacrificing their capabilities by shrinking the size of the prompts you feed them. Think of it like putting your LLM on a diet – it gets leaner and meaner, without sacrificing its brainpower.

Before we dive into LLMLingua's magic, let's address a couple of key concepts.

Prompt Compression: LLMs rely on prompts, essentially instructions that guide them towards the desired response. The longer and more detailed the prompt, the better the LLM’s response, but more the processing power (& money) it requires. Prompt compression techniques basically shrink these prompts without sacrificing the quality of the output.

Perplexity: Imagine an LLM predicting the next word in a sentence. Perplexity tells you roughly how many guesses it’d need on average, considering the probabilities of each word, to get it right. Alternatively, you can think of it as how surprised a model is to know the next word, given the probabilities of the sequence of words it has seen so far. The higher the perplexity, the harder it is to predict the next word, indicating the model is unsure about the context.
Check out this 2-minute guide to get a good intuition about perplexity.

Compression Rate: Ratio of number of tokens in compressed prompt to actual prompt (inverse of “compression ratio”). Prompt compression techniques strive for low compression rates (higher compression ratios)

LLMLingua leverages perplexity to compress prompts fed to LLMs. The core idea is to remove unnecessary information from the prompt while ensuring the LLM can still understand the context and answer your questions accurately. The important idea to note here is that “removing tokens with lower perplexity has a relatively minor impact on the LLM’s comprehension of the context”

Under the Hood

LLMLingua Overall Architecture (source: LLMLingua paper)

Provided with the actual prompt & a target compression ratio (or target token number), LLMLingua employs a two-pronged approach using a locally run LLM for prompt compression: coarse-grained and fine-grained compression.

Let’s break them down step by step.

Note: To narrow the gap between the distribution of words produced by the black-box LLM and that of the locally-run LLM used for prompt compression, the distributions are aligned via instruction tuning on Alpaca dataset.

Coarse-Grained Compression through Demonstration-Level Perplexity

Typical chain-of-thought (CoT) prompts, or those involving retrieval augmented generation (RAG) are long & include 3 key components:

Instruction: This set the background for the LLM about what to do & how to do it
Demonstrations: These are examples/context documents that guide the LLM (CoT examples or retrieved snippets)
Question: This is what you want the LLM to answer

While the instruction & question are critical for the LLM response, demonstrations often tend to be redundant (indicated by the “perplexity” of the demonstrations). LLMLingua smartly picks up on this & uses a Budget Controller to strategically allocate compression budgets for the various components such that demonstrations with lower perplexity likely (containing less crucial information) can be compressed more aggressively.

The outputs include a target prompt containing the instruction & question, along with a reduced number of demonstrations, and the adjusted compression rates for each of the prompt components.

Sneak peek into the working of the Budget Controller for Coarse-Grained Compression (source: created by authors)

Fine-Grained Compression through Token-Level Perplexity

The remaining tokens across the target prompt may also have tokens containing redundant information. So now, LLMLingua focuses on this text, & uses a technique called Iterative Token-level Prompt Compression (ITPC). This leverages "token-level perplexity" to pinpoint individual words or phrases that can be removed without significantly impacting the meaning.

Working of the ITPC Algorithm for Fine-Grained Compression (source: created by authors)

Basically, LLMLingua breaks the target prompt from the budget controller into bite-sized chunks, and analyzes each token in each chunk iteratively, based on how well the LLM understands the context so far (by computing the perplexity of the token). For each segment, a threshold perplexity is chosen based on the conditional probability of the segment and the adjusted compression rates. Only the tokens from the segment that pass the threshold are retained in the final compressed prompt.

For thoroughness, when working on a chunk, token-level perplexities are computed based on the compressed previous chunks & the current chunk under compression.

Real-World Impact: Making LLMs More Affordable

The results speak for themselves. LLMLingua can achieve compression ratios of 3x to 20x, meaning the compressed prompt can be up to 20 times shorter than the original! This translates to massive cost savings without sacrificing accuracy.

Evaluation metrics on various datasets show that black box LLMs using compressed prompts performed just as well, or slightly better, than those using the full-length versions.

LLMLingua in action, compressing a long CoT Prompt by 7.3x (source: created by authors)

Techniques like LLMLingua open doors for the wider adoption of LLMs across various industries. It is useful in scenarios where LLMs are used with very long prompts, such as chain-of-thought (CoT) prompting and in-context learning (ICL). Examples of such applications could be question-answering systems or creative text-generation tasks that benefit from extensive context.

Note: A slight hiccup here is that pushing compression ratios beyond 25x can lead to a noticeable drop in performance.

Beyond LLMLingua

LLMLingua paves the way for a more advanced technique called LongLLMLingua. This addresses the challenge of extremely long prompts (particularly in RAG applications) and a phenomenon known as the "Lost-in-the-Middle" problem.

Lost-in-the-Middle: Research by Stanford University suggests that LLMs often prioritize information at the beginning and end of a long prompt, neglecting crucial details, if buried in the middle.

LongLLMLingua tackles this by incorporating "question awareness" into its compression strategies. Intuitively, it modifies the Budget Controller to select documents based on their “importance” to the question (computed perplexity of the question, if it is concatenated after the document) and uses a “contrastive perplexity” in ITPC to distinguish tokens relevant to the question & improve key information density. For more details, interested folks can check out the LongLLMLingua paper.

LongLLMLingua Overall Architecture - LLMLingua with “question-awareness” (source: LongLLMLingua paper)

How You Too Can Try Out LLMLingua

Here's the cool part: you can try it out yourself, right away! Dive into the code over at the LLMLingua GitHub repository or play around with a live demo on Hugging Face 🤗.

Using llmlingua package in Python for prompt compression (source: created by authors)

Developers, you're in luck too! LLMLingua is already integrated into libraries like LlamaIndex and LangChain (two widely used RAG frameworks), making it super easy to use in your projects.

Heads-Up: Make sure you don’t use a closed source model to compress the prompt - you’ll end up doing same mistake of increasing the token cost. The integrations offer LLaMA-7B as the default model for LLMLingua. Feel free to continue with this or some other SLM.

Spark and Trouble are keeping a close eye on these developments – after all, who wouldn't want a more affordable and efficient way to chat with a large language model?

10x Your Workflow with AI 📈

Work smarter, not harder! In this section, you’ll find prompt templates 📜 & bleeding-edge AI tools ⚙️ to free up your time.

Fresh Prompt Alert!🚨

Feeling overwhelmed by that ever-growing content calendar?
Wish you had a magic social media schedule that practically posts itself?

We got you! This prompt will be your social samurai, crafting a month-long plan for your business. Just fill in the blanks, post & watch your engagement soar! 💯

You are an expert social media manager. I want you to create a schedule for social media posts over one month, starting from [insert date that schedule will start].

The frequency of posting will be [daily/every two days/every weekday/weekly].

My business is called [insert name] and we sell [insert products or services].

For each post, include the day it will be published, a heading, body text and include relevant hashtags.

The tone of voice we use is [professional/casual/funny/friendly].

For each post, also include a suggestion for an image that we can use that could be found on a stock image service.

* Replace the content in brackets with your details

3 AI Tools You JUST Can't Miss 🤩

✍️ Becca - Find the latest trends in your niche to create engaging posts that sound just like you
🔮 Smartli - Generate SEO-friendly and high-quality product descriptions 10x faster
📽️ Glato - Create short video ads that sell with AI creators.

Spark 'n' Trouble Shenanigans 😜

Stuck on logo ideas? This image prompt template sparks instant logo & icon inspiration for your brand or app, in any style! Try it out on Midjourney or click here to test it on Microsoft Designer.

simple [style] icon for an [describe the app] app, [primary color] colour with [background color] background, ui/ux, ui --s 75

Check out some of our awesome creations using this image prompt template 👇

simple 3d vector gradient icon for an AI meme generator app, red & golden colour with black background, ui/ux, ui --s 75

simple minimalistic icon for an online messaging app, green color with white background, ui/ux, ui --s 75

simple mascot icon for a food delivery app, blue and purple with silver background, ui/ux, ui --s 75

simple emblem icon for Harvard university app, usual harvard colors with white background, ui/ux, ui --s 75

Well, that’s a wrap!
Thanks for reading 😊

See you next week with more mind-blowing tech insights 💻

Until then,
Stay Curious🧠 Stay Awesome🤩

PS: Do catch us on LinkedIn - Sandra & Tezan

Reply

or to participate.