Quantization Formats Explained: NVFP4 vs MXFP4 vs FP8, and Whether a Model Fits Your GPU

TL;DR

FP8 is the safe 2x cut: near-lossless at scale, but needs an NVIDIA Hopper GPU or newer to run fast
FP4 (MXFP4, NVFP4) is a 4x cut; NVFP4's two-level scaling keeps more accuracy, but FP4 speedup needs Blackwell
The format you pick decides whether a model fits your GPU and how fast it runs

Table of Contents

OpenAI shipped gpt-oss in MXFP4. DeepSeek trained V3 in FP8. NVIDIA's Blackwell cards added native NVFP4, and vLLM, SGLang, and llama.cpp have spent early 2026 racing to support all of it. Underneath the acronym soup is one practical question: the quantization format you pick decides whether a model fits on your GPU at all, and how fast it runs once it does. This is the evergreen map of that landscape, with the bit layouts, the memory math, and an honest "what runs fast on what hardware" verdict. Numbers here are vendor/spec- or community-reported and labeled inline.

The one formula that matters

Quantization stores each weight in fewer bits. Everything else follows from that. To a first approximation:

VRAM (GB) = params_in_billions * bits_per_weight / 8 * 1.2  # +20% overhead
# plus the KV cache, which grows with context length and batch size

So a format is not an abstract quality knob. It is the difference between a model loading and an out-of-memory error. Here is what the common formats do to three model sizes (weights only, before the KV cache).

Model	FP16 (16-bit)	FP8 (8-bit)	4-bit (INT4 / FP4)	Smallest GPU at 4-bit
8B (Llama 3.1)	~17-19 GB	~9-10 GB	~5 GB	8 GB consumer card
70B (Llama 3.3)	~168 GB	~84 GB	~46 GB	1x 48 GB (L40S) or 2x RTX 4090
120B (gpt-oss MoE)	~288 GB	~140 GB	~81 GB	1x H100 80 GB

Figures are community-reported, rounded, and include roughly 20 percent overhead. The headline is stark: a 120B model that needs four H100s in FP16 squeezes onto a single 80 GB card at 4-bit. That is the entire reason this topic exists.

The precision ladder

Modern formats step down a ladder: FP16/BF16 (the baseline), FP8 (half the bytes), then the 4-bit floats MXFP4 and NVFP4 (a quarter). A floating-point number splits its bits into a sign, an exponent (dynamic range), and a mantissa (precision). Fewer bits means less of both, so the trick is choosing where to spend them and how to rescale blocks of values to claw accuracy back.

Format	Sign/Exp/Mantissa	Block size	Scale format	Effective bits/weight
FP8 E4M3	1 / 4 / 3	per-tensor or per-channel	FP32/FP16	8
FP8 E5M2	1 / 5 / 2	per-tensor or per-channel	FP32/FP16	8
MXFP4	1 / 2 / 1 (E2M1)	32	E8M0 (power-of-2)	~4.25
NVFP4	1 / 2 / 1 (E2M1)	16	E4M3 + FP32 tensor scale	~4.5

All four are spec-defined. Note that MXFP4 and NVFP4 use the identical 4-bit element (E2M1); the entire difference between them is how they scale blocks, which we get to below.

FP8: the safe 2x

FP8 is the least scary cut, and the one most production stacks reach for first. It comes in two flavors: E4M3 (4 exponent, 3 mantissa bits, max value 448) is used for weights and activations during inference; E5M2 (5 exponent, 2 mantissa, much wider range) is used for gradients during training. More mantissa for forward-pass precision, more exponent for gradient range.

At 8B parameters and up, FP8 typically costs under 1 percent on MMLU versus BF16 (vendor-reported: Llama 3.1 8B drops 68.8 to 68.3, Llama 3.3 70B holds at 82.0). It halves memory and, on the right hardware, runs at full Tensor Core speed. This is not just an inference trick: DeepSeek V3 was trained in FP8 at 671B scale with under 0.25 percent relative error versus BF16. The catch is hardware, which we will hit shortly: FP8 needs an NVIDIA Hopper card or newer to actually go fast.

The FP4 twins: MXFP4 vs NVFP4

Plain 4-bit floats are nearly useless on their own. The E2M1 element can only represent values like 0, 0.5, 1, 1.5, 2, 3, 4, 6. One outlier in a weight matrix forces a coarse scale and flushes small values to zero. The fix is microscaling: split the tensor into small blocks and give each block its own shared scale factor, so each block adapts to its own local range. Both modern FP4 formats do this; they differ in how.

Property	MXFP4 (OCP standard)	NVFP4 (NVIDIA)
Element	E2M1 (4-bit)	E2M1 (4-bit), identical
Block size	32 elements	16 elements
Block scale	E8M0 (powers of 2 only)	FP8 E4M3 (fractional allowed)
Second-level scale	None	FP32 per tensor
Standard	Open (Microsoft, AMD, Arm, Intel, Meta, NVIDIA, Qualcomm)	Proprietary

The crux is the scale format. MXFP4's E8M0 scale can only be a power of two, so adjacent block scales are 2x apart; if a block's true maximum sits between two powers of two, you either overflow or waste precision. NVFP4's smaller 16-element blocks and fractional E4M3 scales (plus a tensor-wide FP32 scale) track the real distribution far more tightly. It shows in the numbers: on Llama 3.1 8B, MXFP4 drops MMLU-Pro from 44.2 to 32.5, while NVFP4 only falls to 38.8 (community-reported). On large models the gap shrinks; NVFP4 on DeepSeek-R1 lands within about 1 percent of FP8 (vendor-reported).

MXFP4's win is openness and adoption: OpenAI's gpt-oss ships natively in MXFP4 (MoE weights only, attention and routing kept higher), which is how the 120B variant fits on a single 80 GB GPU. NVFP4's win is accuracy per bit, at the cost of being NVIDIA-only.

The formats you are probably already using

Most people running models locally have never touched FP4. They use GGUF K-quants through llama.cpp and Ollama, or AWQ/GPTQ through vLLM. These are INT-based and predate the new float formats, and they are still excellent.

GGUF format	Bits/weight	Size (8B model)	Quality
Q4_K_M	4.89	4.58 GB	The popular sweet spot
Q5_K_M	5.70	5.33 GB	Near-lossless on most models
Q6_K	6.56	6.14 GB	Almost lossless
Q8_0	8.50	7.95 GB	Effectively lossless

Bits-per-weight figures are from the official llama.cpp quantize README. AWQ and GPTQ produce INT4 (W4A16) Hugging Face checkpoints for vLLM; GGUF is the one that also runs on CPU and, via MLX or Metal, on Apple Silicon. INT4 (AWQ) is competitive with FP4 on accuracy at 4-bit; the difference is that FP4 can also quantize activations and, on Blackwell, run them through dedicated hardware. For now, if you are on a Mac or an older GPU, GGUF Q4_K_M remains the default that just works.

"Runs" versus "runs fast": the hardware catch

This is the part the format names hide. A quantized model can load (memory savings) on almost anything, but it only runs fast where the GPU has dedicated hardware for that format. Mismatch the two and you get the file-size benefit with no speedup, because the runtime quietly upcasts to a precision the silicon understands.

Format	Blackwell (B200, RTX 50)	Hopper (H100/H200)	Ada (RTX 4090)	Ampere (A100, RTX 30)
FP16 / INT8 / INT4	Fast	Fast	Fast	Fast
FP8	Fast	Fast	Loads, falls back to BF16	No (INT8 max)
MXFP4	Fast	Emulated	Emulated	No
NVFP4	Fast	Emulated	Emulated	No

The practical rules: FP8 needs Hopper or newer (the RTX 4090's Ada chip has the data type but no scaling hardware, so it silently runs BF16). The FP4 formats only accelerate on Blackwell; on a Hopper H100 an NVFP4 model fits in less memory but does not run faster. Ampere and older are stuck at INT8/INT4. Apple Silicon has none of these float formats in hardware and uses MLX or GGUF 4-bit over unified memory instead. Runtime examples once you have matched format to card:

# vLLM, FP8 on a Hopper or Blackwell card
vllm serve meta-llama/Llama-3.1-70B-Instruct --quantization fp8

# llama.cpp, a GGUF Q4_K_M model on anything (CPU, Apple Silicon, any GPU)
llama-cli -m Llama-3.1-8B-Q4_K_M.gguf -c 8192 -p "Explain quantization."

Accuracy versus compression

The rough ladder, for large models: 8-bit is essentially free (under 1 percent loss), 4-bit costs a small but real amount (roughly 1 to 3 percent), and sub-4-bit starts to hurt. Two patterns matter. First, bigger models tolerate aggressive quantization better; the same MXFP4 that mangles an 8B model barely dents a 70B-plus one (DeepSeek-R1 in MXFP4 holds above 99.5 percent on several benchmarks, vendor-reported). Second, quantizing activations as well as weights (W4A4) is far harsher than weights-only, which is why NVFP4's accuracy-preserving scaling matters most there. If you are running a small model, stay at Q5 or higher; if you are running a 70B, 4-bit is genuinely fine.

So which should you pick?

Your hardware	Best format	Why
Blackwell (RTX 50, B200)	NVFP4	Only arch with fast FP4; best accuracy per bit
Hopper (H100/H200)	FP8	Native, near-lossless, 2x memory
Ada / Ampere GPU	INT4 (AWQ/GPTQ) or GGUF	No FP8/FP4 hardware; INT is the fast path
Apple Silicon	MLX 4-bit or GGUF Q4_K_M	Unified memory, no FP4/FP8 hardware
CPU / old GPU	GGUF Q4_K_M	Runs everywhere, sane quality

Weight quantization is only half the memory story. On long contexts the KV cache dominates, and that has its own quantization track; we covered one approach in our Google TurboQuant writeup. And if your target is a tiny device rather than a datacenter card, the calculus changes again, which we walked through in edge AI on a sensor node. Match the format to the silicon, do the memory math first, and you will know whether a model fits before you ever hit download.

Sources and further reading

Date checked: 2026-06-09. Format specs are from vendor and OCP documentation; accuracy and memory figures are vendor/spec- or community-reported and labeled inline.

Subscribe to the Newsletter

Search

GDPR Compliance

Log in

Create an account

Reset password

Terms of use

Information Collected by SingularityByte.com

How We Use This Information

Information Disclosure

Cookies, Trackers, and Online Ads

Other Sites

Information Security

Do-Not-Track

Additional Options

Microsoft Clarity

Contact Us

Midjourney SREF Styles:

The one formula that matters

The precision ladder

FP8: the safe 2x

The FP4 twins: MXFP4 vs NVFP4

The formats you are probably already using

"Runs" versus "runs fast": the hardware catch

Accuracy versus compression

So which should you pick?

Sources and further reading

SLMs on Microcontrollers: Does Edge AI Fit on a Sensor Node?

Turn Your PC Into a Private AI Server in One Command with ODS

Related to this topic:

Latest topics

The Sections

About

Keep up to date with the latest updates & news