TL;DR
- 8.4B-total / 760M-active Mixture-of-Experts reasoning model from Zyphra, Apache 2.0.
- First large MoE pretrained, midtrained, and fine-tuned end-to-end on AMD MI300X, no Nvidia in the loop.
- Strong single-pass math and code, weak agentic (BFCL-v4 40.5); all benchmarks Zyphra-reported.
System Requirements
| RAM | 32GB |
| GPU | 48GB GPU |
| VRAM | >24GB (4090 OOMs) |
On May 6, 2026, Zyphra released ZAYA1-8B: an 8.4-billion-parameter Mixture-of-Experts reasoning model under Apache 2.0. The number that matters is not 8.4 billion. It is zero, as in zero Nvidia GPUs anywhere in the training loop. Zyphra pretrained, midtrained, and fine-tuned this model end-to-end on AMD Instinct MI300X hardware, the first time anyone has done that for a large MoE in public. Only 760 million of those parameters fire per token, the benchmarks lean hard on math and code, and every score you are about to read was run by Zyphra itself. So here is what actually changed, how the AMD stack held up, and what you can run today.
An open-source reasoning MoE trained entirely on AMD MI300X
For about fifteen years, training a serious model has meant buying Nvidia and writing CUDA. AMD's Instinct cards get pitched as inference hardware, and most "AMD AI" stories turn out to be a model that was trained on Nvidia and then ported to run on AMD. ZAYA1-8B is the opposite. Pretraining, the long-context midtraining phase, and supervised fine-tuning all ran on AMD silicon with the ROCm software stack, not a CUDA translation layer.
One definition, because the whole story rides on it. A Mixture-of-Experts (MoE) model splits its feed-forward layers into many small "expert" subnetworks and routes each token to just a few of them. ZAYA1-8B holds 8.4 billion total parameters across 16 experts per layer, but a top-1 router sends each token to a single expert, so only about 760 million parameters do work on any given token. You get some of the capacity of a bigger model at the compute cost of a small one, the same bargain that powers larger open MoEs like DeepSeek-V4. For a builder, that means cheaper inference. For AMD, it is a public test of whether the open ROCm stack can train something real, not just serve it.
Inside the architecture: compressed attention and a smarter router
Zyphra's pitch is "intelligence per parameter," and three architecture choices do the heavy lifting. None of them is a gimmick.
First, Compressed Convolutional Attention (CCA). Standard attention stores a key and value vector for every token in context, and that KV cache is what eats your VRAM on long inputs. CCA runs small 1D convolutions to compress the key and value projections before attention, which Zyphra reports cuts KV-cache memory by roughly 8x. That is the trick that lets a small model carry a long context without melting a GPU. It is a cousin of the compressed-KV idea DeepSeek uses, built for the same goal.
Second, the router is a small multi-layer perceptron (MLP) instead of the usual single linear layer. A smarter router assigns tokens to experts more cleanly, which matters a lot when each token only gets one expert.
Third, learned residual scaling: one trainable scalar per block that tunes how much each layer adds to the residual stream. It is a cheap stability trick for a deep network.
Context grows in stages: 4K tokens during base pretraining, 32K during midtraining, and 131K after fine-tuning. Treat the 131K as a ceiling, not a promise; community testers peg the comfortable working limit nearer 64K for agent loops. There is also a vision-language sibling, ZAYA1-VL-8B, if you need image input.
The real story: a frontier MoE on 1,024 AMD MI300X
Here is the part that made people stop scrolling. Zyphra trained ZAYA1 on 1,024 AMD Instinct MI300X GPUs, arranged as 128 nodes of 8, wired with AMD InfinityFabric inside each node and AMD Pensando Pollara 400 network cards (400 Gbps) plus Pensando Ortano data-processing units (DPUs) between nodes, in a rails-only topology. IBM Cloud built and ran the cluster. Zyphra says it sustained more than 750 PFLOPs during training and pushed roughly 14.86 trillion tokens through the model across all phases.
One hardware fact does a lot of work. Each MI300X carries 192GB of HBM (high-bandwidth memory), versus 80GB on an Nvidia H100 or 141GB on an H200. More memory per GPU means fewer ways you are forced to split the model, which simplifies the parallelism and, per Zyphra, made checkpoint saves about 10x faster.
The software is the real claim. This was not a CUDA program recompiled for AMD. Zyphra wrote its training kernels for ROCm, AMD's open compute stack, and ran a fork of Megatron tuned for the platform. One neat detail: a "router replay" scheme where the trainer reuses the expert-routing decisions that vLLM made during rollout, so training and inference never disagree about which expert saw which token.
Be clear about what Zyphra did not disclose: the exact ROCm version and the wall-clock training time. "We are thrilled to be the first company to demonstrate large-scale training on an AMD platform," said CEO Krithik Puthalath. The honest read is narrower than the press release: this is the first solid public proof that a thousand-GPU AMD cluster can train a competitive MoE end-to-end, which matters to anyone watching GPU supply and prices.
Four rounds of RL where most labs run one
Plenty of labs bolt one reinforcement-learning (RL) pass onto a base model and call it a reasoning model. Zyphra ran four, in sequence, and this cascade is the part of the release most worth studying.
Stage one is a reasoning warmup on verifiable puzzles, competition math, and logic problems, just to teach the model to think in long chains. Stage two is the interesting one: RLVE-Gym, a set of about 400 "verifiable environments," meaning tasks where a small program can check whether an answer is correct, so the reward is never a guess. A scheduler using Thompson sampling keeps nudging puzzle difficulty toward a 0.5 solve rate, the point where the model learns the most per step. Stage three pours on math and code RL with three home-grown task families: predicting a program's input and output, reconstructing code from its behavior, and writing adversarial tests to break a candidate solution. Stage four is behavioral RL for chat tone and instruction-following.
There is also a test-time trick called Markovian RSA (recursive state aggregation). It runs several reasoning attempts in parallel and folds them together while carrying only a short summary between rounds, so the model can reason for a long time without the context window exploding. Remember that name, because it is doing the heavy lifting behind the headline scores. If you have studied OpenThinker-32B or other open reasoning models, the curriculum here is more disciplined than the single-round norm.
Benchmarks, honestly
Now the numbers, with the asterisk attached up front: every ZAYA1 score below was produced by Zyphra, and none has been independently reproduced yet.
Read the single-pass column first. On a normal one-shot run, ZAYA1-8B reports AIME 2026 at 89.1, HMMT February 2026 at 71.6, LiveCodeBench v6 at 64.8, GPQA-Diamond at 71.0, and MMLU-Pro at 74.2. The math and code results are genuinely strong for a 760M-active model. The weak spots are just as real: MMLU-Pro at 74.2 trails dense models like Mistral Small 4 (81.6), because an 8.4B model simply stores fewer facts, and the agentic tool-use score (BFCL-v4) sits at 40.5. The "beats Claude and DeepSeek" claims live in the test-time-compute column, which uses that expensive Markovian RSA method, not a single pass. The table keeps the two apart on purpose.
| Model | Active params | License | AIME | LiveCodeBench | GPQA-Diamond | MMLU-Pro |
| ZAYA1-8B (base, single pass) | 760M | Apache 2.0 | 89.1 (AIME 2026) | 64.8 (v6) | 71.0 | 74.2 |
| ZAYA1-8B + Markovian RSA [1] | 760M | Apache 2.0 | 91.9 (AIME 2025) | n/p | n/p | n/p |
| DeepSeek-R1-0528 | ~37B | MIT | 87.5 (AIME 2025) | n/p | n/p | n/p |
| Mistral Small 4 | ~30B | Apache 2.0 | n/p | 57.9 | 71.2 | 81.6 |
| OLMo 3 32B Think | 32B dense | Apache 2.0 | n/p | n/p | n/p | n/p |
[1] Markovian RSA is an expensive multi-trace test-time-compute method. These are not single-pass numbers and are not comparable to the base row. The AIME year varies by row. "n/p" means not published in the sources we have. All ZAYA1-8B figures are Zyphra-reported and not independently reproduced.
Four things to keep straight before you quote any of this:
- The frontier-beating scores need expensive multi-trace test-time compute. On a single pass, ZAYA1-8B is a strong small model, not a frontier model.
- Knowledge (MMLU-Pro 74.2) and agentic tool use (BFCL-v4 40.5) are the weak spots. Both numbers are real.
- GSM8K is not published.
- Every ZAYA1 figure is Zyphra-reported and, as of now, not independently reproduced.
Limitations and gotchas
The launch deck is polished. Here is what a builder actually hits.
- The frontier scores need that expensive multi-pass test-time compute. On a single pass you have a sharp small reasoner, not a Claude replacement.
- Agentic tool use is weak (BFCL-v4 40.5), so do not drop this into an autonomous tool-calling loop and expect it to hold together.
- The practical context ceiling is community-reported at around 64K for agent work, well short of the 131K spec.
- There is no Ollama or llama.cpp support at launch. You run it through Zyphra's vLLM and transformers forks. Community GGUF quantizations exist (see lainlives/ZAYA1-8B-GGUF), but testers report it runs out of memory on a 24GB RTX 4090, and aggressive low-bit quantization visibly degrades the math and code output that is the whole point.
- Zyphra never published a GSM8K score, and the "2x intelligence per active parameter" line is a Zyphra marketing claim, not a measured number.
Who should use it, and what is next
Use it if you want a small, openly licensed reasoning model for math or code, you can run a custom vLLM build, or you just want to study the architecture and the RL cascade for your own work. The Apache 2.0 license means you can ship it commercially without asking anyone.
Skip it, for now, if you need an agentic tool-caller, a one-line Ollama pull, or broad world knowledge. For agent-heavy work, something like Qwen3.5 is a better fit today.
What to watch: a vision-language variant (ZAYA1-VL-8B) already exists, and stock Ollama or llama.cpp support is the obvious next community milestone. Zyphra is a roughly 65-person San Francisco lab that raised a Series A in October 2025 at a $1 billion valuation, led by Jaan Tallinn, and it has shipped real open models before (BlackMamba, the Zamba2 hybrids, Zonos text-to-speech). This release lands in a year when open weights keep arriving faster than anyone can test them, the same current that carried Gemma 3 and the open reasoning wave. It is not a one-off.
Run it in about 10 minutes
There is no clean "ollama run" for this yet, so be realistic about what "try it" means.
If you have a big GPU (think 32GB or more, because a 24GB card runs out of memory), the fastest real path is the community GGUF:
# No official Ollama image at launch. The community GGUF is the
# lowest-friction local path. Quality is community-reported, so
# expect rough edges, and a 24GB 4090 will OOM at usable quants.
huggingface-cli download lainlives/ZAYA1-8B-GGUF
# Load the .gguf in any llama.cpp-based runner you already use.
# Start high-bit (Q6 or Q8). Sub-Q4 wrecks the math and code output.
For the reference weights, use Zyphra's transformers fork and a 48GB-class card:
# Install Zyphra's fork first (see the model card for the exact branch).
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Zyphra/ZAYA1-8B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
prompt = "Find every integer solution of x^2 + y^2 = 2025."
ids = tok(prompt, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**ids, max_new_tokens=2048)[0]))
That is the under-10-minute move if you have the hardware: pull the GGUF and throw a real AIME problem at it before you trust anyone's benchmark. If you do not have the hardware, spend the ten minutes in the tech report's post-training section instead. The four-stage RL cascade is the part of this release worth borrowing for your own models.
Sources and further reading
Tested on: not independently tested. ZAYA1-8B needs Zyphra's custom vLLM or transformers forks, or a high-VRAM community GGUF that runs out of memory on a 24GB 4090, and the MI300X training claims are infrastructure we cannot reproduce. Every figure here is Zyphra-reported, with community observations (GGUF behavior, VRAM, usable context) flagged as such. Sources linked above.
Date checked: 2026-06-19