Newsletter image

Subscribe to the Newsletter

Join 10k+ people to get notified about new posts, news and tips.

Do not worry we don't spam!

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Search

GDPR Compliance

We use cookies to ensure you get the best experience on our website. By continuing to use our site, you accept our use of cookies, Privacy Policy, and Terms of Service.

Deepseek - AI Coding

DeepSeek-V4

DeepSeek-V4 Preview ships two MIT-licensed MoE variants with native 1M context: V4-Pro at 1.6T params and V4-Flash at 284B, both built for agents.

License MIT
License MIT
TL;DR
  • Two MoE variants: V4-Pro 1.6T (49B active) and V4-Flash 284B (13B active)
  • Native 1M context, 27% FLOPs and 10% KV cache vs V3.2
  • MIT Preview, leads open-weights GDPval-AA agentic at 1554
System Requirements
RAM96GB
GPUH200 or 2x H100
VRAM80GB+ (Flash 4-bit)
✓ Ollama

On April 24, 2026, DeepSeek open-sourced DeepSeek-V4, a Preview release that ships two MIT-licensed Mixture-of-Experts variants and a native one-million-token context window. DeepSeek-V4-Pro packs 1.6 trillion total parameters (49 billion active) and posts an 80.6% on SWE-Bench Verified. DeepSeek-V4-Flash brings the same long-context architecture down to 284 billion parameters (13 billion active) at API pricing that Simon Willison called "the cheapest of the small models." It is the loudest open-weight drop of the spring, and the first time DeepSeek has explicitly framed a release around agentic coding.

Two Variants, One Architecture

DeepSeek released both base and instruct checkpoints for the Pro and the Flash, plus a Technical Report PDF on the Hugging Face collection. Everything is under the MIT license, so commercial use, fine-tuning, and redistribution are all unrestricted.

ModelTotal paramsActive paramsContextFocus
DeepSeek-V4-Pro1.6T49B1M tokensFrontier-class reasoning, coding, agents
DeepSeek-V4-Flash284B13B1M tokensCost-efficient long-context inference

The Pro is 61 layers deep with a fresh MoE topology, both variants train on roughly 32 trillion tokens with the Muon optimizer, and post-training runs a two-stage recipe (domain experts first, then unified consolidation). Mixed precision goes aggressive: most weights in FP8, MoE expert weights in FP4, RoPE dimensions kept in BF16.

Hybrid Attention: CSA Plus HCA

The architectural headline is a brand-new hybrid attention scheme. DeepSeek-V4 alternates two layer types:

  • CSA (Compressed Sparse Attention) compresses the KV cache 4x using softmax-gated pooling, then reads through a top-k FP4 "lightning indexer" to keep retrieval sparse and cheap.
  • HCA (Heavily Compressed Attention) compresses the KV cache 128x and runs dense attention over the compressed blocks, with a sliding-window branch on the side to preserve recency.

Layers 0 and 1 are HCA, layers 2 through 60 alternate CSA and HCA, and the final Multi-Token-Prediction block is sliding-window only. The MoE feed-forward blocks use DeepSeekMoE with manifold-constrained hyper-connections replacing standard residual connections. The Pro card also exposes three reasoning modes: Non-think, Think High, and Think Max (Think Max recommends a 384K-or-larger context window).

1M Context With 90% Less KV Cache

Long-context inference has been the painful trade-off of the 2025 generation. DeepSeek-V4 attacks it head-on. At 1M tokens, V4-Pro reports 27% of the per-token inference FLOPs and 10% of the KV cache memory compared to DeepSeek-V3.2. V4-Flash pushes the KV cache reduction even further (the Hugging Face blog cites 7%, the Flash model card says 10%; treat the gap as still settling).

The practical consequence is real. On long-context retrieval (MRCR-1M), V4-Pro scores 83.5 MMR. Eight-needle MRCR stays above 0.82 through 256K tokens and only drops to 0.59 at the full 1M. CorpusQA-1M lands at 62.0 ACC. Translation: the context window is not a marketing number, the model actually uses it.

Benchmarks: Frontier-Adjacent on Code and Agents

DeepSeek-V4-Pro lands inside the closed-frontier band on most agentic and coding benchmarks. All numbers below come from the official Pro model card:

BenchmarkV4-ProNotes
SWE-Bench Verified80.6%Statistical tie with Opus 4.6 Max (80.8) and Gemini 3.1 Pro (80.6)
SWE-Bench Pro55.4%Real-world repo bug-fix benchmark
Terminal-Bench 2.067.9Trails GPT-5.4-xHigh (75.1)
Toolathlon51.8Beats K2.6 (50.0) and GLM-5.1 (40.7)
LiveCodeBench93.5%Codeforces rating: 3206
HMMT 2026 Feb95.2%Competition math
GPQA Diamond90.1%Graduate-level reasoning
MMLU-Pro87.5%General knowledge
HLE37.7%Humanity's Last Exam

Third-party Artificial Analysis ranks V4-Pro-Max at 52 on its Intelligence Index, second among open-weights models behind Kimi K2.6 (54). On the GDPval-AA real-world agentic score, however, V4-Pro takes the open-weights crown at 1554, ahead of GLM-5.1 (1535) and Kimi K2.6 (1484). That dual reading, slightly behind on raw intelligence, ahead on agentic execution, matches the way DeepSeek positions the model: built for agents.

V4-Flash is no slouch either. The Flash model card lists 79.0% SWE-Bench Verified, 91.6% pass@1 on LiveCodeBench, 86.2% MMLU-Pro, and 88.1% GPQA Diamond. For a 13B-active model, those numbers are absurd.

Pricing: Cheapest Frontier-Class API

The DeepSeek platform pricing (per 1M tokens) is the other big story:

ModelInput (cache miss)Input (cache hit)Output
V4-Flash$0.14$0.0028$0.28
V4-Pro (75% promo)$0.435$0.003625$0.87

The 75% Pro promo extends through 2026-05-31 15:59 UTC. After that, expect roughly $1.74 input / $3.48 output per million tokens, still well below Claude Opus 4.7 territory. On Hacker News the cost-vs-Claude-Haiku-4.5 comparison was blunt: "3.3x cheaper input, 10x cheaper output." Third-party hosts (DeepInfra, Together, Fireworks, Novita, SiliconFlow) carry V4-Pro at $1.74 to $2.67 per blended million.

Run It Today

The fastest path is the hosted API. Switch your model name and the rest of your stack stays the same:

curl https://api.deepseek.com/v1/chat/completions \
  -H "Authorization: Bearer $DEEPSEEK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [{"role": "user", "content": "Refactor this code for 1M context."}]
  }'

For self-hosting, vLLM and SGLang both shipped Day-0 recipes with native CSA plus HCA, FP4 MoE backends, and MTP speculative decoding. NVIDIA NIM offers official Blackwell endpoints for V4-Pro, and OpenRouter load-balances across multiple providers if you want one API key for everything. Local inference works through the Hugging Face transformers path or community quantizations:

# Pull a community Flash GGUF (preview-stage, expect breakage)
ollama pull deepseek-v4-flash:q4_k_m

# Or test the official weights via transformers
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash

The Flash card already lists 23 community quantization variants. The Pro card lists six, but no aggressive sub-Q4 GGUF has landed yet, the 860 to 900 GB native footprint plus FP4-QAT experts leave little headroom for further compression.

Where the Preview Tag Bites

This is officially a Preview, not a final release, and a few sharp edges show:

  • Hallucination rates on Artificial Analysis's AA-Omniscience evaluation come in at 94% for V4-Pro and 96% for V4-Flash, notably higher than peers. If your workflow is fact-heavy, ground the model with retrieval.
  • V4-Pro is consumer-unrunnable. Even at FP4 the model wants multi-node H100 or H200 servers. V4-Flash on a single H200 with quantization is the practical local play.
  • GGUF quality is unsettled. Community sub-Q4 quants of V4-Flash exist but quality is "unverified," and dynamic 1-2 bit V3-era tricks do not translate cleanly to V4's FP4-trained experts.
  • No Jinja chat template ships with the model, you need DeepSeek's Python encoder to build prompts correctly. Several day-one users tripped on this.
  • Frontier gap. Simon Willison's read of the technical report puts V4-Pro "marginally short of GPT-5.4 and Gemini-3.1-Pro," roughly three to six months behind the closed frontier on raw quality.

Why It Matters

For anyone building agents on open weights, DeepSeek-V4 is the new baseline to beat. It tops the open-weights GDPval-AA agentic leaderboard, it ships under MIT, it gives you a real 1M-token context with KV-cache numbers that make long-horizon agent loops actually affordable, and it does this at API pricing that undercuts every closed competitor. V4-Flash is the immediate winner for builders, fast, cheap, deployable, frontier-adjacent on code. V4-Pro is the model to benchmark your stack against for the rest of 2026.

If you have spent the past year working around context limits with chunking heuristics, retrieval shims, or expensive prompt-cache games, spend an afternoon throwing a 500K-token job at V4-Flash. The economics of long-context agents just changed.

Tested on: DeepSeek API (V4-Flash and V4-Pro Preview) | Date tested: 2026-05-01

 

Prev Article
NVIDIA Ising
Next Article
HY-MT1.5

Related to this topic: