Newsletter image

Subscribe to the Newsletter

Join 10k+ people to get notified about new posts, news and tips.

Do not worry we don't spam!

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Search

GDPR Compliance

We use cookies to ensure you get the best experience on our website. By continuing to use our site, you accept our use of cookies, Privacy Policy, and Terms of Service.

SingularityByte - Hardware, Inference, Model Fine-Tuning

Quantization Formats Explained: NVFP4, MXFP4, and FP8

The quantization format you pick decides whether a model fits your GPU and how fast it runs. Here is the NVFP4, MXFP4, and FP8 landscape, demystified.

TL;DR
  • FP8 is the safe 2x cut: near-lossless at scale, but needs an NVIDIA Hopper GPU or newer to run fast
  • FP4 (MXFP4, NVFP4) is a 4x cut; NVFP4's two-level scaling keeps more accuracy, but FP4 speedup needs Blackwell
  • The format you pick decides whether a model fits your GPU and how fast it runs

OpenAI shipped gpt-oss in MXFP4. DeepSeek trained V3 in FP8. NVIDIA's Blackwell cards added native NVFP4, and vLLM, SGLang, and llama.cpp have spent early 2026 racing to support all of it. Underneath the acronym soup is one practical question: the quantization format you pick decides whether a model fits on your GPU at all, and how fast it runs once it does. This is the evergreen map of that landscape, with the bit layouts, the memory math, and an honest "what runs fast on what hardware" verdict. Numbers here are vendor/spec- or community-reported and labeled inline.

The one formula that matters

Quantization stores each weight in fewer bits. Everything else follows from that. To a first approximation:

VRAM (GB) = params_in_billions * bits_per_weight / 8 * 1.2  # +20% overhead
# plus the KV cache, which grows with context length and batch size

So a format is not an abstract quality knob. It is the difference between a model loading and an out-of-memory error. Here is what the common formats do to three model sizes (weights only, before the KV cache).

ModelFP16 (16-bit)FP8 (8-bit)4-bit (INT4 / FP4)Smallest GPU at 4-bit
8B (Llama 3.1)~17-19 GB~9-10 GB~5 GB8 GB consumer card
70B (Llama 3.3)~168 GB~84 GB~46 GB1x 48 GB (L40S) or 2x RTX 4090
120B (gpt-oss MoE)~288 GB~140 GB~81 GB1x H100 80 GB

Figures are community-reported, rounded, and include roughly 20 percent overhead. The headline is stark: a 120B model that needs four H100s in FP16 squeezes onto a single 80 GB card at 4-bit. That is the entire reason this topic exists.

The precision ladder

Modern formats step down a ladder: FP16/BF16 (the baseline), FP8 (half the bytes), then the 4-bit floats MXFP4 and NVFP4 (a quarter). A floating-point number splits its bits into a sign, an exponent (dynamic range), and a mantissa (precision). Fewer bits means less of both, so the trick is choosing where to spend them and how to rescale blocks of values to claw accuracy back.

FormatSign/Exp/MantissaBlock sizeScale formatEffective bits/weight
FP8 E4M31 / 4 / 3per-tensor or per-channelFP32/FP168
FP8 E5M21 / 5 / 2per-tensor or per-channelFP32/FP168
MXFP41 / 2 / 1 (E2M1)32E8M0 (power-of-2)~4.25
NVFP41 / 2 / 1 (E2M1)16E4M3 + FP32 tensor scale~4.5

All four are spec-defined. Note that MXFP4 and NVFP4 use the identical 4-bit element (E2M1); the entire difference between them is how they scale blocks, which we get to below.

FP8: the safe 2x

FP8 is the least scary cut, and the one most production stacks reach for first. It comes in two flavors: E4M3 (4 exponent, 3 mantissa bits, max value 448) is used for weights and activations during inference; E5M2 (5 exponent, 2 mantissa, much wider range) is used for gradients during training. More mantissa for forward-pass precision, more exponent for gradient range.

At 8B parameters and up, FP8 typically costs under 1 percent on MMLU versus BF16 (vendor-reported: Llama 3.1 8B drops 68.8 to 68.3, Llama 3.3 70B holds at 82.0). It halves memory and, on the right hardware, runs at full Tensor Core speed. This is not just an inference trick: DeepSeek V3 was trained in FP8 at 671B scale with under 0.25 percent relative error versus BF16. The catch is hardware, which we will hit shortly: FP8 needs an NVIDIA Hopper card or newer to actually go fast.

The FP4 twins: MXFP4 vs NVFP4

Plain 4-bit floats are nearly useless on their own. The E2M1 element can only represent values like 0, 0.5, 1, 1.5, 2, 3, 4, 6. One outlier in a weight matrix forces a coarse scale and flushes small values to zero. The fix is microscaling: split the tensor into small blocks and give each block its own shared scale factor, so each block adapts to its own local range. Both modern FP4 formats do this; they differ in how.

PropertyMXFP4 (OCP standard)NVFP4 (NVIDIA)
ElementE2M1 (4-bit)E2M1 (4-bit), identical
Block size32 elements16 elements
Block scaleE8M0 (powers of 2 only)FP8 E4M3 (fractional allowed)
Second-level scaleNoneFP32 per tensor
StandardOpen (Microsoft, AMD, Arm, Intel, Meta, NVIDIA, Qualcomm)Proprietary

The crux is the scale format. MXFP4's E8M0 scale can only be a power of two, so adjacent block scales are 2x apart; if a block's true maximum sits between two powers of two, you either overflow or waste precision. NVFP4's smaller 16-element blocks and fractional E4M3 scales (plus a tensor-wide FP32 scale) track the real distribution far more tightly. It shows in the numbers: on Llama 3.1 8B, MXFP4 drops MMLU-Pro from 44.2 to 32.5, while NVFP4 only falls to 38.8 (community-reported). On large models the gap shrinks; NVFP4 on DeepSeek-R1 lands within about 1 percent of FP8 (vendor-reported).

MXFP4's win is openness and adoption: OpenAI's gpt-oss ships natively in MXFP4 (MoE weights only, attention and routing kept higher), which is how the 120B variant fits on a single 80 GB GPU. NVFP4's win is accuracy per bit, at the cost of being NVIDIA-only.

The formats you are probably already using

Most people running models locally have never touched FP4. They use GGUF K-quants through llama.cpp and Ollama, or AWQ/GPTQ through vLLM. These are INT-based and predate the new float formats, and they are still excellent.

GGUF formatBits/weightSize (8B model)Quality
Q4_K_M4.894.58 GBThe popular sweet spot
Q5_K_M5.705.33 GBNear-lossless on most models
Q6_K6.566.14 GBAlmost lossless
Q8_08.507.95 GBEffectively lossless

Bits-per-weight figures are from the official llama.cpp quantize README. AWQ and GPTQ produce INT4 (W4A16) Hugging Face checkpoints for vLLM; GGUF is the one that also runs on CPU and, via MLX or Metal, on Apple Silicon. INT4 (AWQ) is competitive with FP4 on accuracy at 4-bit; the difference is that FP4 can also quantize activations and, on Blackwell, run them through dedicated hardware. For now, if you are on a Mac or an older GPU, GGUF Q4_K_M remains the default that just works.

"Runs" versus "runs fast": the hardware catch

This is the part the format names hide. A quantized model can load (memory savings) on almost anything, but it only runs fast where the GPU has dedicated hardware for that format. Mismatch the two and you get the file-size benefit with no speedup, because the runtime quietly upcasts to a precision the silicon understands.

FormatBlackwell (B200, RTX 50)Hopper (H100/H200)Ada (RTX 4090)Ampere (A100, RTX 30)
FP16 / INT8 / INT4FastFastFastFast
FP8FastFastLoads, falls back to BF16No (INT8 max)
MXFP4FastEmulatedEmulatedNo
NVFP4FastEmulatedEmulatedNo

The practical rules: FP8 needs Hopper or newer (the RTX 4090's Ada chip has the data type but no scaling hardware, so it silently runs BF16). The FP4 formats only accelerate on Blackwell; on a Hopper H100 an NVFP4 model fits in less memory but does not run faster. Ampere and older are stuck at INT8/INT4. Apple Silicon has none of these float formats in hardware and uses MLX or GGUF 4-bit over unified memory instead. Runtime examples once you have matched format to card:

# vLLM, FP8 on a Hopper or Blackwell card
vllm serve meta-llama/Llama-3.1-70B-Instruct --quantization fp8

# llama.cpp, a GGUF Q4_K_M model on anything (CPU, Apple Silicon, any GPU)
llama-cli -m Llama-3.1-8B-Q4_K_M.gguf -c 8192 -p "Explain quantization."

Accuracy versus compression

The rough ladder, for large models: 8-bit is essentially free (under 1 percent loss), 4-bit costs a small but real amount (roughly 1 to 3 percent), and sub-4-bit starts to hurt. Two patterns matter. First, bigger models tolerate aggressive quantization better; the same MXFP4 that mangles an 8B model barely dents a 70B-plus one (DeepSeek-R1 in MXFP4 holds above 99.5 percent on several benchmarks, vendor-reported). Second, quantizing activations as well as weights (W4A4) is far harsher than weights-only, which is why NVFP4's accuracy-preserving scaling matters most there. If you are running a small model, stay at Q5 or higher; if you are running a 70B, 4-bit is genuinely fine.

So which should you pick?

Your hardwareBest formatWhy
Blackwell (RTX 50, B200)NVFP4Only arch with fast FP4; best accuracy per bit
Hopper (H100/H200)FP8Native, near-lossless, 2x memory
Ada / Ampere GPUINT4 (AWQ/GPTQ) or GGUFNo FP8/FP4 hardware; INT is the fast path
Apple SiliconMLX 4-bit or GGUF Q4_K_MUnified memory, no FP4/FP8 hardware
CPU / old GPUGGUF Q4_K_MRuns everywhere, sane quality

Weight quantization is only half the memory story. On long contexts the KV cache dominates, and that has its own quantization track; we covered one approach in our Google TurboQuant writeup. And if your target is a tiny device rather than a datacenter card, the calculus changes again, which we walked through in edge AI on a sensor node. Match the format to the silicon, do the memory math first, and you will know whether a model fits before you ever hit download.

Sources and further reading

Date checked: 2026-06-09. Format specs are from vendor and OCP documentation; accuracy and memory figures are vendor/spec- or community-reported and labeled inline.

Prev Article
SLMs on Microcontrollers: Does Edge AI Fit on a Sensor Node?
Next Article
How to create Logos with Midjourney

Related to this topic: