TurboQuant: Redefining AI Efficiency with Extreme Compression

TL;DR

TurboQuant compresses the KV cache of large language models by 4.5x with no measurable accuracy loss. Push it to 2.5 bits per value and the compression exceeds 6x, with only minor quality loss.
It works in two stages: a random rotation makes the data easy to quantize, then a 1-bit error correction keeps the attention scores accurate.
Unlike conventional methods that need expensive codebooks, TurboQuant uses precomputed quantization levels with nearly zero overhead.
The mathematical building blocks go back decades, some to the late 1940s. The contribution is combining them for a modern problem.
For cloud providers this means serving more users on the same hardware. For the rest of us, it’s another step toward running capable AI on laptops and phones.

I’ve been running LLMs locally for a while now. The experience is always the same: things work fine at first, then the conversation gets longer, memory fills up, and at some point the model either slows to a crawl or crashes. The bottleneck is almost never the model itself. It’s the memory that grows with every message.

So when Google Research published TurboQuant at ICLR 2026, a method that compresses exactly this memory bottleneck by 4.5x without losing quality, I wanted to understand how it works. And whether it could actually matter for people running models outside of data centers.

The Problem: Why Memory Matters

When a language model generates text, it stores intermediate results for every token¹ in the conversation. These results are called keys and values, and together they form the KV cache.² The attention mechanism³ uses them to figure out which parts of the conversation matter for the next word.

The KV cache grows with every token. Longer conversation, more memory. For cloud providers serving thousands of users at once, this is one of the biggest costs. For anyone running a model locally, it’s the wall you hit.

The obvious fix: quantization.⁴ Represent each number with fewer bits. Instead of 16-bit floats, use 4 or 3. But it’s not that simple. LLM activations⁵ contain outliers.⁶ One dimension in a vector might hold a large value while the rest sit near zero. Round everything to 3 bits, and the small values collapse. Information is lost.

The codebook problem

Conventional compression methods deal with outliers by learning codebooks: lookup tables that map groups of values to the best matching entry. These codebooks cluster the data and store a representative value for each cluster.

The problem: codebooks need their own storage, and they grow exponentially with the number of bits. They also take significant time to compute. The TurboQuant paper puts numbers on this: codebook-based methods need 37 to 494 seconds to build their codebooks, depending on the vector dimension. TurboQuant does the same job in 0.0007 to 0.002 seconds. Five orders of magnitude faster. And without the storage overhead.

How TurboQuant Works

TurboQuant takes a completely different approach. Two stages, both simple in principle.

Stage 1: Rotate, Then Quantize

Intuition: Why shuffling the load helps

Four workers carrying a sofa. One holds almost all the weight, the other three barely touch it. You can’t replace them with four average-strength workers. That one corner needs a specialist. After rotation: the weight is distributed evenly across all four. Now you can plan around average load. You know in advance what each worker needs to handle. Quantization has the same problem. One oversized value forces you to design your scale around the worst case. After rotation, every dimension carries roughly the same load, and that load follows a known pattern you can prepare for.

If outliers make quantization hard, get rid of them first.

TurboQuant applies a random rotation to each vector before quantizing it. A rotation preserves two things: the length of each vector and the angle between any pair of vectors. Since the attention mechanism depends only on these two properties, the model produces the same output whether or not the vectors were rotated. But the way values are spread across dimensions changes.

Before rotation, the distribution of values depends on the input: one vector might have a single large coordinate, another might be spread evenly. After a random rotation, the coordinates of any vector follow approximately a normal distribution, regardless of what the original vector looked like. The input’s structure is erased; what remains is predictable.

Here’s what makes this practical: after rotation, the coordinates follow a known statistical distribution.⁷ Known in advance, for any model. That means the optimal quantization levels can be precomputed once and reused everywhere. No calibration data, no fine-tuning, no per-model adjustments. This is the key difference to codebook methods.

Stage 2: Fix the Bias

Intuition: Why one bit is enough

A car pulling slightly to the right doesn’t need a precise force measurement to fix. You just need to know which direction it’s pulling, then steer the other way. QJL stores exactly that: the direction of the quantization error. One bit per dimension. Enough to keep the result straight.

Stage 1 minimizes the rounding error for each vector individually. Good, but not enough. The errors aren’t random. They lean in a consistent direction.

Why that’s a problem: the attention mechanism computes dot products⁸ between vectors to calculate how relevant one token is to another. A systematic bias in those dot products accumulates across thousands of tokens. The attention scores start drifting. Eventually, the model picks the wrong words.

This is where QJL comes in.⁹

QJL stands for Quantized Johnson-Lindenstrauss. The name sounds complicated, but the idea isn’t. The Johnson-Lindenstrauss lemma¹⁰ is a mathematical result from 1984 that says: if you project data onto random directions, distances between points are approximately preserved. You lose dimensions, but the relationships survive.

QJL applies this idea to the quantization error: project the rounding error onto random directions and keep only the sign, +1 or -1. One bit per dimension. No magnitude, just direction. That’s enough to cancel the systematic part of the error in the dot products. The result is an unbiased estimator: across many tokens, the errors cancel out instead of accumulating.

The cost: one extra bit per dimension. The payoff: attention scores stay accurate and model quality is fully preserved. The TurboQuant paper shows that at 3.5 bits total (scalar quantization plus the 1-bit correction), there is zero measurable quality loss.

How It Compares

Intuition: Why polar form helps

When you describe a location, you can give Cartesian coordinates (3km east, 4km north) or polar coordinates (5km, heading 53°). Same point, different representation. The advantage: angles always fall into a fixed range (0° to 360°), no matter how far away the point is. Raw coordinates can be anything. PolarQuant exploits this by converting vectors to polar form, where the angles land in predictable ranges that are easy to compress.

Another recent approach is PolarQuant¹¹ (AISTATS 2026, same research group). PolarQuant uses a different strategy: it separates each vector into its length and direction, then quantizes them independently. It achieves over 4.2x compression on the KV cache.

TurboQuant goes further. On LongBench,¹² PolarQuant at 3.9 bits scores 49.78. TurboQuant at 3.5 bits, with fewer bits, scores 50.06, matching the uncompressed baseline exactly.

Results

TurboQuant was tested on Llama-3.1-8B-Instruct and Ministral-7B-Instruct on a single NVIDIA A100 GPU:

At 3.5 bits per channel,¹³ TurboQuant matches the uncompressed model. On LongBench, both score 50.06. No measurable quality difference.
At 2.5 bits per channel,¹³ there is minor quality loss. Llama scores 49.44 vs. 50.06, Ministral scores 49.62 vs. 49.89. Still usable.
On the Needle-in-a-Haystack test,¹⁴ TurboQuant scores 0.997, matching the full-precision baseline.
At 2.5 bits, the KV cache is compressed by over 6x.
In nearest neighbor search, TurboQuant outperforms codebook-based methods in recall while reducing indexing time to nearly zero.

The paper also proves a lower bound on quantization error: no quantizer, no matter how complex or slow, can do better than a certain minimum. TurboQuant’s error is only about 2.7x above that theoretical floor. Even a perfect, not-yet-invented quantizer could improve on TurboQuant by at most that factor. For a method that needs no training and runs in microseconds, that’s remarkably close to optimal.

Old Tools, New Recipe

None of the individual pieces in TurboQuant are new.

Random rotations for spreading out values? Known since long before computers. Optimal quantizers for known distributions? Lloyd (1957) and Max (1960). Random projections that preserve distances? Johnson-Lindenstrauss (1984). The framework for measuring compression limits? Shannon (1948).

The contribution is the combination. Rotate vectors to make them quantizer-friendly. Apply scalar quantization per coordinate. Fix the leftover bias with a 1-bit sign projection. Each piece existed. The recipe is new.

I keep seeing this pattern in AI research. The transformer architecture builds on attention ideas from well before 2017. Reinforcement learning from human feedback draws on reward modeling from decades earlier. TurboQuant reaches into classical information theory and signal processing.

Progress doesn’t always mean inventing something new. Often it means finding the right combination of things that already exist.

What This Means in Practice

For cloud providers, the math is straightforward. 4.5x less KV cache memory without losing quality means more concurrent users on the same GPUs, or longer context windows without buying more hardware. At the scale of millions of requests, that’s a direct cost reduction.

For local AI, memory is the main constraint. I notice it every time I push a conversation past a few thousand tokens on my machine. KV cache compression helps make longer conversations possible within a fixed memory budget. That matters on consumer GPUs with 8 to 24 GB of VRAM.

For phones and mobile devices, honesty matters more than hype. Today’s on-device models are small, typically 1 to 3 billion parameters. At that size, the KV cache is not the bottleneck. The model weights are. Those need different compression techniques.

But the direction is worth watching. On-device models are getting bigger. Flagship phones with 12 GB or more of RAM can already run quantized 7B models. As these models grow and conversations get longer, the KV cache will start to matter. Techniques like TurboQuant won’t help much on a phone today. But they address exactly the bottleneck that’s coming.

Where This Goes

What I find most interesting about TurboQuant is not the compression ratio. It’s that the method needs zero training and zero data. Precomputed levels, a random rotation, a sign bit. No codebooks to learn, no model-specific tuning. That makes it cheap to deploy.

Whether this specific method ends up in the inference stacks we use remains to be seen. But the underlying principle, that you can get surprisingly far by combining well-understood math in new ways, is not going away. The best compression might not come from a new neural network. It might come from a theorem that’s been sitting in a textbook since 1948.

References

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (ICLR 2026)
PolarQuant: Quantizing KV Caches with Polar Transformation (AISTATS 2026)
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead (AAAI 2025)

A token is the smallest unit a language model works with, roughly ¾ of a word in English. Common words like “the” are a single token. Longer or rarer words take multiple tokens. ↩
The KV cache (Key-Value cache) stores the Key and Value vectors that the attention mechanism has already computed for previous tokens. Without it, the model would recompute these for the entire conversation history with every new token. For a detailed explanation, see Context Is All You Need. ↩
The attention mechanism lets each token look at all previous tokens and decide which ones are relevant for predicting the next word. It computes relevance scores using dot products between Query, Key, and Value vectors. ↩
Quantization means representing numbers with fewer bits. A 16-bit float can express very precise values; a 3-bit integer can only distinguish between 8 levels. The trade-off is precision for memory savings. ↩
In quantization, activations refer to all values computed at runtime, as opposed to the stored model weights. The Key and Value vectors in the KV cache are activations in this sense: they are computed fresh for every token by projecting the hidden states through learned weight matrices. ↩
In this context, outliers are individual values within a vector that are much larger than the rest. A vector might have 127 dimensions near zero and one dimension at 50. This uneven spread makes uniform quantization ineffective: the quantizer wastes precision covering the full range when most values are small. ↩
After rotation, the coordinates follow a Beta distribution that converges to a Gaussian in high dimensions (which is the case for LLM vectors). This predictability is what allows the quantization levels to be precomputed once for all models. ↩
A dot product between two vectors produces a single number that measures how similar they are in direction. The attention mechanism uses dot products to compute relevance scores between tokens. ↩
QJL (Quantized Johnson-Lindenstrauss) is a method from Zandieh et al. (2024) that uses random projections followed by sign-bit quantization to estimate dot products with zero storage overhead. Published at AAAI 2025. ↩
The Johnson-Lindenstrauss lemma (1984) proves that a set of points in high-dimensional space can be projected into a much lower-dimensional space while approximately preserving all pairwise distances. The projection uses random directions. QJL takes this one step further by quantizing the projected values to just their sign. ↩
PolarQuant (Zandieh et al., 2026) separates each vector into its magnitude and direction using a recursive polar transformation. The direction is decomposed into individual angles, each quantized independently with precomputed codebooks. The residual magnitudes are stored in 16-bit precision. Published at AISTATS 2026. ↩
LongBench is a benchmark for evaluating how well language models handle long-context tasks like summarization, question answering, and multi-document reasoning. ↩
A channel is one coordinate in a vector. If a Key vector has 128 dimensions, it has 128 channels. Bits per channel is how many bits are used to store each one. At 16 bits per channel, one coordinate can represent over 65,000 distinct values. At 3.5 bits, it’s roughly 11. The number is an average across the full scheme: in TurboQuant, 1 bit always goes to the QJL error correction, the rest to scalar quantization. ↩ ↩²
The Needle-in-a-Haystack test hides a specific fact at a random position in a long document and asks the model to retrieve it. A score of 1.0 means perfect retrieval. It tests whether compression affects the model’s ability to access information across the full context. ↩

Ugo Arangino

Senior iOS Software Engineer

TurboQuant: Redefining AI Efficiency with Extreme Compression

TL;DR

The Problem: Why Memory Matters

The codebook problem

How TurboQuant Works

Stage 1: Rotate, Then Quantize

Intuition: Why shuffling the load helps

Stage 2: Fix the Bias

Intuition: Why one bit is enough

How It Compares

Intuition: Why polar form helps

Results

Old Tools, New Recipe

What This Means in Practice

Where This Goes

References

TL;DR

The Problem: Why Memory Matters

The codebook problem

How TurboQuant Works

Stage 1: Rotate, Then Quantize

Intuition: Why shuffling the load helps

Stage 2: Fix the Bias

Intuition: Why one bit is enough

How It Compares

Intuition: Why polar form helps

Results

Old Tools, New Recipe

What This Means in Practice

Where This Goes

References

Footnotes