TurboQuant

A KV cache optimization technique from Google Research (published April 2026, synthesizing three concurrent papers) that enables 4–6x more tokens in the same RAM footprint on consumer hardware. Being merged into llama.cpp.

Important: TurboQuant is NOT what makes Gemma 4 small

Google released TurboQuant as a research note alongside Gemma 4 in April 2026. The simultaneous timing has led to a common misreading that TurboQuant is the compression technique behind Gemma 4’s small footprint. It is not. Per Fireship’s Code Report: “TurboQuant is actually not the secret behind Gemma 4’s small models.”

Gemma 4’s small footprint comes from per-layer embeddings (PLE) — see per-layer-embeddings-ple-explained. TurboQuant is a separate, parallel research drop that compresses the KV cache specifically (not model weights), and uses a different mathematical approach: polar coordinates instead of XYZ Cartesian, plus the Johnson-Lindenstrauss transform to compress high-dimensional data down to ±1 sign bits while preserving distances between points. The two techniques are complementary — Gemma 4 + TurboQuant in llama.cpp would stack — but they solve different problems.

Background: The KV Cache Problem

Every local model maintains a KV cache — a running record of the conversation that the model references on each inference step. This cache grows with every message and competes for the same RAM as the model weights.

On modest consumer hardware (32 GB RAM, 8 GB GPU), a 7B model might support only 8K tokens of context because the KV cache fills the remaining memory. The model may theoretically support 200K tokens, but the hardware caps you at 8K in practice.

This gap between theoretical and practical context window is the primary limitation holding local models back from competitive cloud-like experiences.

What TurboQuant Does

TurboQuant compresses the KV cache to use 4x less memory at the same context length (F16 baseline). In practical terms:

Before TurboQuant	After TurboQuant
8K context on consumer GPU	32K context on same GPU
Long documents: not feasible	Long documents: trivial
Meeting transcripts: not feasible	Meeting transcripts: feasible
Large codebase context: not feasible	Large codebase context: feasible

Tim Carambat’s calibration: a 3-hour Lex Fridman podcast = ~48K tokens. At 8K, impossible. At 32K, trivial.

Why It’s a Step Function, Not a Marginal Improvement

Going from 8K to 32K context isn’t just “more room” — it crosses thresholds that unlock entirely new categories of tasks for local AI:

Document summarization of real-world documents (most exceed 8K)
Extended agent sessions with meaningful history
Code review across large files
Meeting and call summarization

Implementation Status

Being actively merged into llama.cpp — the most widely used local inference engine, underlying Ollama, LM Studio, and most local AI tools. Once merged, all tools built on llama.cpp benefit automatically.

Broader Context

DDR5 RAM prices have risen sharply; TurboQuant makes current hardware more capable without hardware upgrades
Does not replace cloud inference for million-token workloads — it expands the practical range of tasks suitable for local inference
Directly improves the viability of hybrid workflows: more tasks can stay local (private, free) while frontier models handle the hardest tasks

AI For Dev

Explorer

turboquant

TurboQuant

Important: TurboQuant is NOT what makes Gemma 4 small

Background: The KV Cache Problem

What TurboQuant Does

Why It’s a Step Function, Not a Marginal Improvement

Implementation Status

Broader Context

See Also

Graph View

Table of Contents

Backlinks