DeepSeek V4
DeepSeek · Released April 24, 2026
What It Actually Is
Every few months, someone announces an open-weight model that’s supposed to match the closed frontier. Usually, the benchmarks tell a different story once independent evaluators get their hands on it. DeepSeek V4 might be the exception — not because it claims to beat everything (it doesn’t), but because it changes the economics so dramatically that “close enough” becomes “close enough and ten times cheaper.”
The numbers that matter aren’t the benchmark scores — they’re the efficiency numbers. Pro uses roughly 27% of the FLOPs and 10% of the KV cache that V3.2 needed at 1M context. That’s not an incremental improvement — that’s a different conversation about what hardware you need. Flash takes it further: 284B total parameters with only 13B active per token, making genuinely useful inference possible on mid-tier multi-GPU setups that would have been laughably inadequate a year ago.
Two variants tell you everything about DeepSeek’s strategy. Pro (1.6T/49B active) is the capability play — chase the closed frontier, offer it open-weight under MIT. Flash (284B/13B active) is the adoption play — make it run on hardware real people actually have. Both share the same architectural innovations, the same 1M context window, and the same MIT license. Pick your price point, keep your data.
Key Strengths
- Massive efficiency leap: Pro uses ~27% FLOPs and ~10% KV cache of V3.2 at 1M context. Flash is even leaner. This isn’t just a bigger model — it’s a fundamentally more efficient architecture that makes trillion-parameter inference practical on hardware that would have choked on V3.
- True 1M context: Not a marketing number — the efficiency gains make million-token inference actually usable. Load entire monorepos, full documentation sets, or days of conversation history without the model forgetting what it read.
- Two variants, two use cases: Pro (1.6T/49B active) for maximum capability on enterprise clusters. Flash (284B/13B active) for speed and cost on mid-tier hardware. Pick your tradeoff — both share the same architecture and MIT license.
- Rock-bottom API pricing: 3-7× cheaper than Claude Opus equivalents on early comparisons. If you don’t want to self-host, the API gives you near-frontier capability at prices that make closed-model pricing look like luxury tax.
- Hardware flexibility: Optimized for both NVIDIA GPUs and Huawei Ascend — a genuine differentiator for organizations in regions where chip access is a strategic concern. FP8/FP4 mixed precision support out of the box.
-
Architecture — 1.6T MoE / 49B active Hybrid attention (Compressed Sparse + Heavily Compressed) with 1M context. Only 49B params activate per token in Pro (13B in Flash), making it efficient despite the massive total parameter count.
-
Efficiency — ~73% FLOPs reduction Vs predecessor V3.2 at 1M context. Pro uses ~27% of prior FLOPs and ~10% of KV cache. Flash even lower. The architecture breakthrough that makes trillion-parameter local inference viable.
-
API Pricing — 3-7× cheaper Than Claude Opus equivalents in early comparisons. Self-hosted deployments pay only hardware costs. The economic argument for open-weight has never been stronger.
-
Reasoning — competitive with frontier DeepSeek reports Pro reasoning mode outperforms GPT-5.2 and Gemini 3.0 Pro, trails GPT-5.4 and Gemini 3.1 Pro 'marginally.' Independent verification pending.
Honest Limitations
- Preview release: Full independent benchmarks haven’t landed yet. DeepSeek’s own eval shows it trailing GPT-5.4 and Gemini 3.1 Pro ‘marginally’ on some reasoning tasks. Until Artificial Analysis and SWE-Bench teams post verified scores, treat headline numbers with appropriate skepticism.
- Hardware hunger (Pro): 1.6 trillion total parameters means you need enterprise multi-GPU clusters (4-8× H100/H200-class) for comfortable Pro inference. Flash + quantization brings it to mid-tier setups, but this is not a laptop model.
- No multimodal output: Text-focused. No native vision or image generation. Kimi K2.6 handles multimodal input (images, video) natively — DeepSeek V4 doesn’t.
- Chinese ecosystem maturity: Like other Chinese-origin models, English-language documentation and Western community tooling are growing but less mature than the Chinese ecosystem. Expect some rough edges in early adoption.
The Verdict: The open-weight model that finally makes the math work for local frontier AI. DeepSeek V4 doesn’t beat every closed model on every benchmark — and it’s honest enough to say so in its own tech report. But the combination of 1.6T-scale capability, 1M context, MIT license, 73% compute reduction, and API pricing that embarrasses the cloud labs makes it the most compelling self-hosting proposition in 2026. Pro for maximum capability, Flash for accessibility, and an efficiency architecture that turns ‘you need a data center’ into ‘you need a serious workstation.’ The preview caveats are real — wait for independent benchmarks before betting your pipeline. But if you care about running genuinely powerful AI on your own terms, DeepSeek just handed you the keys.