Newsletter image

Subscribe to the Newsletter

Join 10k+ people to get notified about new posts, news and tips.

Do not worry we don't spam!

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Search

GDPR Compliance

We use cookies to ensure you get the best experience on our website. By continuing to use our site, you accept our use of cookies, Privacy Policy, and Terms of Service.

Z.ai - AI Coding, AI Agent, Reasoning

GLM-5.2

Z.ai's GLM-5.2 is the first MIT-licensed model you can self-host that beats GPT-5.5 on long-horizon agentic coding, with a real 1M-token context, at about one-sixth the API cost. What changed, the numbers with their asterisks, and how to run it.

License MIT
License MIT
TL;DR
  • Z.ai's GLM-5.2: a roughly 753B-parameter MoE (about 40B active) with a 1M-token context window, MIT-licensed.
  • Tops the Artificial Analysis open-weights Intelligence Index and edges GPT-5.5 on several agentic coding benchmarks (self-reported).
  • About one-sixth the API cost of GPT-5.5, but you need an 8x H200 node to self-host. Ollama cloud tag available.
System Requirements
RAM256GB+ (FP8 node)
GPU8x H200
VRAM~860GB BF16
✓ Ollama

Z.ai shipped GLM-5.2 on June 17, 2026, and the open-weights crowd has not stopped talking about it since. The headline that matters: it is the first MIT-licensed model you can download and self-host that beats GPT-5.5 on several long-horizon agentic coding benchmarks, with a real 1M-token context, at roughly one-sixth the API cost. Artificial Analysis put it at the top of its open-weights Intelligence Index. The catch, which we will get to, is that "download and run" assumes you own a small rack of H200s. Here is what changed, the numbers with their asterisks, and how to actually use it.

What Z.ai shipped

GLM-5.2 is the successor to GLM-5.1 from Z.ai (formerly Zhipu AI), the Beijing lab that has been the most consistent open-weights coding-model shop of the past year. It is a sparse Mixture-of-Experts (MoE) model: roughly 753 billion total parameters, but only about 40 billion fire per token.

One definition, because it is the whole cost story. An MoE splits the feed-forward layers into many "expert" subnetworks and routes each token to just a few of them. You pay storage for all 753B but compute for only about 40B per token, the same trick behind DeepSeek-V4 and most large open models now. GLM-5.2 is text only, and the marquee spec is the context window: 1,048,576 tokens, five times GLM-5.1's 200K.

Two architecture tweaks do the heavy lifting. "IndexShare" reuses a single attention indexer across every four sparse-attention layers, which cuts per-token compute at long context. And Multi-Token Prediction now drafts five tokens at once instead of three, which speeds generation when you pair it with speculative decoding. License: MIT, no regional limits, commercial use fine.

The benchmarks, and the asterisk

Every number in the table is Z.ai's own, and none has been independently reproduced at the raw level. Read them as a vendor claim. The independent signal is separate, and it is real: Artificial Analysis ranks GLM-5.2 first among open-weight models on its Intelligence Index (51), ahead of MiniMax M3, DeepSeek V4 Pro, and Kimi K2.6, and it is the only open model mixing with the OpenAI and Anthropic frontier on the LMArena agent leaderboard.

BenchmarkGLM-5.2GLM-5.1GPT-5.5
SWE-Bench Pro62.158.458.6
Terminal-Bench 2.181.062.0n/p
FrontierSWE74.4n/p72.6
GPQA-Diamond91.2n/pn/p
HLE (with tools)54.7n/p52.2

All GLM figures are self-reported by Z.ai and not independently reproduced. "n/p" means not published in our sources. The open-weights ranking (Artificial Analysis Intelligence Index, LMArena) is third-party.

Read SWE-Bench Pro first: 62.1 against GPT-5.5's 58.6 and GLM-5.1's 58.4. That is the claim that earned the headlines, a downloadable model edging a closed frontier model on a benchmark builders actually trust. Terminal-Bench jumping from 62 to 81 in one release is the other eye-catcher. The honest framing comes from Nathan Lambert, who called it the open model that "feels right inside real coding harnesses as a general agent" and compared the moment to DeepSeek R1's launch. Note where it still trails: Claude Opus 4.8 keeps the lead on Terminal-Bench (85) and on the brutal SWE-Marathon.

The cost angle is the real story

GLM-5.2 runs about $1.40 per million input tokens and $4.40 per million output, against roughly $5 and $30 for GPT-5.5. That is the one-sixth-the-cost line, and for anyone running a coding agent in a loop, token cost is the bill that actually hurts. Because it is MIT and downloadable, you also get the self-host option that closed models never give you: no rate limits, no regional restrictions, no provider reading your repo.

Limitations and gotchas

  • It is not a laptop model. All of the roughly 753B weights must be resident. The native FP8 checkpoint fits one 8x H200 node; full 1M context wants 8x B200. Community 2-bit quants land near 241GB, still a multi-GPU rig.
  • The headline benchmarks are self-reported. The independent rankings (Artificial Analysis, LMArena) corroborate the vibe, not the exact numbers.
  • Text only. No vision, no audio.
  • 1M context is the spec, not a free lunch. IndexShare helps, but long context still costs memory and latency.

Who should use it

If you run agentic coding workloads and care about cost or data control, GLM-5.2 is the strongest open option on the board right now. Most teams will hit it through the cheap API or a rented GPU node rather than self-hosting. If you need vision, or you are on a single consumer GPU, this is not your model; look at a smaller MoE or a dense mid-size model instead.

Run it in about 10 minutes

The lowest-friction path is the Ollama cloud tag or an FP8 deployment on rented GPUs. Local single-box inference means a multi-GPU server.

# Easiest: Ollama (serves Z.ai's hosted weights via the cloud tag)
ollama run glm-5.2

# Self-host FP8 on an 8x H200 node with vLLM (the data-center path)
vllm serve zai-org/GLM-5.2-FP8 \
  --tensor-parallel-size 8 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 5

Point your existing coding harness (an Aider, Cline, or Claude Code-style CLI, whatever you already run) at the endpoint and give it a real multi-file task. The whole pitch is that it holds together across a long agent loop, so test it on something with more than one step, not a toy snippet.

Sources and further reading

Tested on: not independently tested. GLM-5.2 is a roughly 753B MoE that needs an 8x H200-class node even at FP8, beyond our bench. Every benchmark here is Z.ai-reported; the open-weights ranking is from Artificial Analysis and LMArena, flagged as third-party. Sources linked above.
Date checked: 2026-06-26

Prev Article
Rio 3.5 Open 397B
Next Article
Mistral Large 3

Related to this topic: