Google TurboQuant: Extreme KV-Cache Compression and the Reality Behind the 6x Claim

Subscribe to the Newsletter

Join 10k+ people to get notified about new posts, news and tips.

Do not worry we don't spam!

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Log in

Have no account yet? Sign up

Create an account

Already have an account? Log in

Reset password

Remember your password? Log in

Terms of use

SingularityByte.com values the privacy of our users. Therefore, this privacy policy explains in detail how we use and protect the information we collect when you visit our website.. Read this privacy policy completely. Please refrain from visiting the site if the terms outlined below are not satisfactory to you. We reserve the right to change this policy at any time and will list these changes in the updates section of the policy. By reading this notice and visiting the site, you agree that you understand that customers will not be personally notified when this policy changes. Therefore, we advise our customers to frequently review our privacy policy so that they remain aware of its updates. By using the site, you accept that the posted policy and all its changes apply to your interaction with SingularityByte.com.

Information Collected by SingularityByte.com

Personal information may be collected by this site in many ways. This information includes:

Personal identifying information like your name, address, email, phone number, age, gender, and other personal data
Server data related to the IP address you used to visit our website, which includes your address, browser, OS, access time, and site activity.
Financial information related to your orders including your payment method and identifying payment information. We rarely store financial information collected on our site for transaction purposes. That information gets sent directly to our payment processor.
Social network data including Facebook permissions and user information from other networks, provided you log onto our site using one of these media sites.
Mobile device information such as your device ID, model, and location, if you use our site by accessing trough our website.

How We Use This Information

Our website uses information collected to:
• Manage your account information
• Customize ads
• Deliver promotions
• Email your account confirmation
• Manage purchases and payments
• Increase site efficiency
• Notify you of updates
• Offer new products
• Monitor and prevent theft
• Request your customer feedback
• Resolve account disputes
• Respond to your service requests

Information Disclosure

Normally, your information stays on our site. However, below we have listed the situations that may
require us to share the information we collect from you:
• When required by law, such as for fraud protection
• With our third-party providers for payment processing and hosting
• With your consent for marketing purposes
• When you post comments on the site
• To our advertisers, affiliates, and partners
• If this site goes bankrupt and data must be transferred

Cookies, Trackers, and Online Ads

We may use cookies, trackers, web beacons, and other technology to customize our website to improve your experience. We may customize the site using this information. These trackers do not have access to your personal information and can be removed from your browser options. In addition, third-party software provides ads for our site for marketing campaigns. These programs have access to tracking technology to optimize your ad experience. For more information about these
ads, visit [link to the privacy policies of affiliate advertisers]. Website analytics such as through Google Analytics may also be used to track users
and remarket our website. We do not give these vendors access to your personal information.

Other Sites

Our website may contain links to third-party websites in the form of policies, ads, and other non-affiliated links. Once you leave our site, we are no longer responsible for how your information is collected and disclosed. Please refer to the privacy policies of those third-party sites for more information.

Information Security

We take technical and administrative precautions to protect your data, but we cannot guarantee its safety against all types of fraud or misuse. If you provide personal information, we cannot verify its total security against all types of interception.

Do-Not-Track

Some browsers offer Do-Not-Track settings to prevent any information from being distributed. Since these settings have not been legally established as standard practice, we do acknowledge these settings.

Additional Options

At any time, you may opt to review or change your account settings, including contact information. If you wish to delete your account, you may do so to remove most of your information, however, some identifying information will be retained to prevent fraud.
You may also opt-out of emails and other correspondences from our site at any time.

Microsoft Clarity

We partner with Microsoft Clarity and Microsoft Advertising to capture how you use and interact with our website through behavioral metrics, heatmaps, and session replay to improve and market our products/services. Website usage data is captured using first and third-party cookies and other tracking technologies to determine the popularity of products/services and online activity. Additionally, we use this information for site optimization, fraud/security purposes, and advertising. For more information about how Microsoft collects and uses your data, visit the Microsoft Privacy Statement.

Contact Us

If you have questions or concerns about this privacy policy, please feel free to contact us at: desk@SingularityByte.com

Do you agree to our terms? Sign up

License Other

TL;DR

Data-oblivious KV-cache vector quantization to 2.5-4 bits, no calibration data required
Paper claims ~6x KV memory cut and up to 8x faster attention, near quality-neutral at 3.5-bit
Research only: no official Google code; community llama.cpp and vLLM forks; FP8 still the production default

☍ Announcement ⬇ Download Model

Table of Contents

Long context is a memory problem. The KV cache, the running store of key and value vectors that attention reads back on every token, balloons with sequence length and quietly eats your VRAM. Google Research's TurboQuant attacks exactly that: it squeezes the KV cache to 2.5 to 4 bits with no calibration data and near-optimal error. The community already wired it into llama.cpp and vLLM, so you can try it this afternoon. The headline numbers, though, need an asterisk or three.

What TurboQuant actually is

First, the thing people keep getting wrong: TurboQuant quantizes the KV cache, not the model weights. It does not replace GPTQ, AWQ, or your GGUF Q4 weights. It stacks on top of them. Weight quantization shrinks the model on disk and in memory; TurboQuant shrinks the per-token cache that grows as you generate. Different problem, complementary fix.

It comes from a Google Research paper, "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate" (Zandieh, Daliri, Hadian, Mirrokni), posted to arXiv in April 2025 and accepted at ICLR 2026. Google blogged about it in March 2026. Two properties make it interesting for builders: it is online (compresses vectors as they are written, no separate pass) and it is data-oblivious (no calibration set, no tuning run). You point it at a model and go.

How it works

The trick is a random rotation. Quantization hates outliers, because a few large coordinates force a coarse scale on everything else. TurboQuant multiplies each key or value vector by a random orthogonal matrix first, which spreads outlier energy evenly across dimensions and pushes the per-coordinate distribution toward a predictable shape. After that rotation, each coordinate is close to independent, so a per-coordinate Lloyd-Max quantizer (the analytically optimal scalar quantizer for a known distribution) does the rest. Because the distribution is known ahead of time, there are no per-block scales or zero-points to store, which is where the bit savings come from.

That is the MSE variant. There is a second variant tuned for attention. Attention does not care about reconstructing vectors, it cares about dot products. So TurboQuant runs the MSE quantizer at one bit lower, then applies a 1-bit Quantized Johnson-Lindenstrauss transform to the residual error. At read time, a special estimator combines both parts to recover an unbiased inner product. The paper argues the whole scheme lands within a small constant factor of the information-theoretic distortion floor.

The numbers, and the asterisks

Google's reported results are genuinely good. On Llama-3.1-8B-Instruct, long-context quality barely moves at 3.5 bits per channel.

Config	LongBench avg	Needle-in-a-Haystack	Source
FP16 baseline	50.16	0.997	paper
TurboQuant 3.5-bit	50.06	0.997	paper
TurboQuant 2.5-bit	49.44	~baseline	paper

Pair that with the memory math (roughly 6x smaller KV cache at 3-bit, and up to 8x faster attention at 4-bit on an H100) and it sounds like a free lunch. It is not, and the people who run inference for a living said so.

The vLLM team published the first large independent study in May 2026, testing Llama-3.3-70B, Qwen3-30B-A3B, and others on H100s. Their verdict: plain FP8 KV cache is still the best default. Here is the shape of it.

Method	KV capacity	Throughput vs BF16	Accuracy cost	Source
FP8 KV	2.0x	~100%	~0	vLLM (community)
TurboQuant k8v4	2.4x	~75 to 80%	~0	vLLM (community)
TurboQuant 4bit-nc	up to 3.4x	~75%	~4 pts reasoning	vLLM (community)
TurboQuant 3bit-nc	highest	~66 to 73%	~20 pts reasoning, ~30% long-ctx retrieval at 256K	vLLM (community)

Read that bottom row twice. At aggressive 3-bit settings on a 256K context, long-context retrieval drops by about 30 percent and reasoning accuracy falls roughly 20 points. The compression is real, but so is the tax, and FP8 gets you 2x for free with no measurable loss. TurboQuant earns its keep when you are memory-bound and FP8's 2x is not enough, not when you are chasing raw throughput.

How to run it today

There is no official Google package. Every usable build is a community fork. Two are worth your time.

llama.cpp (local, Apple Silicon or a single GPU)

Community forks add tq3_0 (3-bit, about 4.9x KV compression) and tq4_0 (4-bit, about 3.8x) cache types. Build for your backend, then pass the cache type at runtime:

# Build (pick your backend)
cmake -B build -DGGML_METAL=ON && cmake --build build -j   # Apple Silicon
# cmake -B build -DGGML_CUDA=ON  && cmake --build build -j  # NVIDIA

# Run with a 4-bit TurboQuant KV cache
./build/bin/llama-cli -m model.gguf \
  --cache-type-k tq4_0 --cache-type-v tq4_0 \
  -c 32768 -p "Summarize this long document..."

vLLM (server, NVIDIA)

The community vLLM fork exposes TurboQuant as a kv-cache dtype. Build from source, then serve:

vllm serve /models/your-model \
  --tensor-parallel-size 4 \
  --attention-backend TRITON_ATTN \
  --kv-cache-dtype turboquant35 \
  --enable-turboquant

The under-10-minute move: take a model that currently runs out of memory at your target context, switch the KV cache to tq4_0 (or turboquant35 on vLLM), and see whether the longer context now fits. If it does and your eval holds, you win memory for almost nothing. If quality dips, step up the bit-width or fall back to FP8.

The caveats nobody puts on the slide

Three things temper the hype.

No official code. Google published a paper and a blog post, not a library. TechCrunch called it a lab breakthrough, and that is accurate. Anything labeled "Google's TurboQuant library" is community code, maintained by individuals, not Google.

The RaBitQ dispute. The core move, random rotation before quantization, is the same idea behind RaBitQ (Gao and Long, 2024). Critics on OpenReview and elsewhere argue the paper mischaracterized RaBitQ's method and guarantees, and that the speed comparison pitted TurboQuant on an A100 GPU against RaBitQ on a single CPU core with multithreading switched off. One author acknowledged multiprocessing was disabled. Treat the head-to-head speed claims against RaBitQ as unsettled.

The famous numbers are narrow. The 8x is attention-logit computation only, not end-to-end inference. The "zero accuracy loss" applies to the 3.5-bit setting on long-context benchmarks; the paper never ran MMLU, GSM8K, or HumanEval, and the vLLM study found real drops once you push to 3-bit or stress reasoning.

Who should use it, and who should wait

Try it if you run long contexts on memory-bound hardware: a consumer GPU, a Mac, an edge box that hits out-of-memory before it hits a speed wall. A 4-bit KV cache can be the difference between a 32K context fitting or not. Models like Gemma 4, Qwen3.5, and DeepSeek-V4 are all popular community targets, and MiniMax's long-context models showed up in the vLLM test set too.

Wait, or just use FP8, if you are on datacenter GPUs optimizing for throughput, or running aggressive low-bit configs on reasoning-heavy workloads. In those cases FP8 KV cache gives you 2x with no throughput penalty and no accuracy hit, and that is hard to beat.

Sources and further reading

Tested on: not independently tested. The figures here are paper- or community-reported via the sources above, with the vLLM study and the RaBitQ dispute flagged as such.
Date checked: 2026-06-09