Gemma 4
By Google DeepMind · Updated
What It Actually Is
Here’s what makes Gemma 4 different from every other open model: it doesn’t just scale up, it scales down. While the AI industry obsesses over who can build the biggest model, Google DeepMind asked a different question: how smart can we make the smallest one?
The answer turns out to be “surprisingly smart.” The E4B — a model designed to run on your phone — scores 42.5% on AIME 2026, a competitive math exam that would have been science fiction for a model this size just a year ago. The E2B fits in 1.5 GB of RAM and still handles text, images, and live audio. These aren’t dumbed-down chatbots. They’re genuinely multimodal reasoning engines that happen to run without a cloud connection.
The bigger variants (26B MoE, 31B dense) compete with Gemma’s cloud-hosted siblings. The 31B ranks #3 among open models on Arena AI. The 26B MoE model is the efficiency play — 26 billion total parameters, but only 3.8 billion activate per token, delivering near-31B quality at a fraction of the compute cost.
All four models share the same Apache 2.0 license, support 140+ languages, and offer built-in thinking modes for complex reasoning tasks. Whether you’re building an offline translation app, a privacy-first health assistant, or an on-device photo analyzer, there’s a Gemma 4 model that fits.
Key Strengths
- Four models, one family: E2B (~1.5 GB quantized) for extreme edge, E4B for flagship phones, 26B MoE (3.8B active) for workstations, 31B dense for server-side. Pick the size that matches your hardware.
- E2B and E4B — real AI on real phones: Native multimodal input — text, images, and audio — running entirely on-device. The E4B scores 42.5% on AIME 2026, more than doubling Gemma 3’s 27B model (20.8%). That’s competitive math reasoning on a smartphone.
- Apache 2.0 — genuinely open: No usage restrictions, no royalties, full commercial rights. Download from Hugging Face, Ollama, or Google AI Studio and use however you want.
- 140+ languages: The entire family is trained on a massive multilingual corpus. For local-first apps serving global users, this is significant.
- Built-in reasoning mode: Configurable ’thinking’ modes for multi-step planning and complex task decomposition — even on the edge models.
- AIME 2026 — E4B 42.5%, E2B 37.5%Competitive math benchmark. Edge models more than double Gemma 3 27B's 20.8%. The 31B dense model hits 89.2%.
- Arena AI — 31B #3, 26B MoE #6 (open models)Crowd-sourced comparison leaderboard. The 31B is top-tier among open models; the 26B MoE comes within 1–2% at a fraction of the compute.
- Architecture — Dense (E2B, E4B, 31B) + MoE (26B)Per-Layer Embeddings (PLE) maximize parameter efficiency on edge; 26B MoE activates only 3.8B params per token for workstation efficiency.
Honest Limitations
- Edge models are edge models: E2B and E4B aren’t going to match a 31B dense model on complex reasoning tasks. They’re optimized for quality-per-byte, not absolute quality.
- No video on edge: Video understanding is exclusive to the 26B and 31B variants. Edge models handle text, images, and audio only.
- Google-preferred tooling: Best supported through MediaPipe, LiteRT, and Google AI Studio. Works with Ollama and llama.cpp too, but the Google stack is the smoothest path.
- No agentic focus: Unlike GLM-5.1’s sustained autonomous sessions, Gemma 4 is built for single-turn and multi-turn inference — not marathon coding sprints.
The Verdict: Gemma 4 is the most practical open model family released this year. The 31B and 26B are impressive workstation models, sure — but the real story is E2B and E4B. Running genuine multimodal AI on a phone, understanding text, images, and spoken audio, with math reasoning that would have been frontier-tier two years ago? That’s not a gimmick. That’s the future of offline-first applications.