Gemma 4 Complete Guide (2026): Google's Open Model for Agentic AI on Android & Edge Devices
Google released Gemma 4 on April 2, 2026 — Apache 2.0 licensed, #3 on Arena AI globally, 89.2% on AIME math, 80% on LiveCodeBench. Four variants from 2.3B (runs offline on your phone) to 31B (beats models 20× its size). Every benchmark, every install command, and the EDGE Framework for choosing the right variant.
A model with 31 billion parameters should not rank third among all open models in the world, ahead of systems with 400 billion parameters and data center infrastructure requirements. But that is where Gemma 4 sits on the Arena AI text leaderboard, as of the day Google released it on April 2, 2026.
The number that best captures what happened between Gemma 3 and Gemma 4 is not a leaderboard position. It is a Codeforces ELO score. Gemma 3 scored 110. Gemma 4 31B scores 2,150. That is not a percentage improvement. It is a jump from "barely functional at competitive programming" to "expert-level competitive programmer" — in a single model generation. No prior open-source model has made a larger inter-generational leap on that benchmark.
The AIME 2026 mathematics benchmark tells the same story differently: Gemma 3 27B scored 20.8%. Gemma 4 31B scores 89.2%. The 26B MoE — activating only 3.8 billion of its 25.2 billion parameters per token — scores 88.3%.
And then there is the τ2-bench agentic tool use benchmark: Gemma 3 scored 6.6%. Gemma 4 31B scores 86.4%. That number matters more than the others for most developers. It measures not whether a model is intelligent, but whether a model can reliably complete the multi-step tool-calling workflows that production agentic applications require. Gemma 3 essentially could not do it. Gemma 4 is among the best open models in the world at it. Also read: AI Agents Are Failing in Enterprise: Here Is the Real Data
Google released this as an open-weight model under Apache 2.0 — the most permissive commercial license in AI — for the first time in Gemma's history. You can download the weights today, build a product, charge for it, redistribute it, and compete with Google using it. No agreements. No usage limits. No legal review required.
This guide covers all four variants, every verified benchmark, the production deployment decision matrix, the complete Ollama setup, and the honest competitive map — including where Gemma 4 is not the right answer.
What Gemma 4 Actually Is
The Lineage: Built From Gemini 3 Research
Gemma 4 is Google DeepMind's fourth-generation open-weight model family. The official framing: "Built from the same world-class research and technology as Gemini 3." Gemma is not Google's proprietary offering cut down and rebundled. It is the open-weight counterpart — built from the same research pipeline but packaged for self-hosted deployment. You own the weights. Google does not see your data. Your inference costs are your compute costs, not per-token fees to Google's API.
The Gemma family launched in February 2024 with 2B and 7B parameter variants. Gemma 2 (mid-2024) expanded to 9B and 27B. Gemma 3 (early 2025) introduced multimodal capabilities for the first time. Gemma 4, released April 2, 2026, introduces a four-variant architecture spanning edge devices to workstation GPUs, a shift to Apache 2.0, and benchmark results that are not incremental over Gemma 3 — they are categorically different. If you are building a comparison with Google's closed models, see our Complete Guide to Gemini Models 2026.
The community that built around the first three generations: over 400 million cumulative Gemma downloads and more than 100,000 fine-tuned community variants. Google calls this the "Gemmaverse." The INSAIT Bulgarian-first language model BgGPT was built on Gemma. Yale University used Gemma for cell biology research in cancer therapy pathway discovery. The combination of permissive licensing and open weights has produced applied research use cases that closed-weight models cannot enable.
The Four-Variant Architecture — Every Model Explained
Google released four distinct models on April 2, 2026. They are not a scaling continuum — they are four architecturally distinct systems targeting four different deployment scenarios.
Variant 1: E2B — The Smartphone Model
The E2B is Gemma 4's "effective 2 billion" parameter edge model. The "effective" designation matters: the E2B uses Per-Layer Embeddings (PLE), a technique where each decoder layer has its own small embedding for every token. This maximizes parameter efficiency for on-device deployment. The model fits on an 8GB RAM device and runs completely offline.
What makes E2B unique in the market: Native audio input for speech recognition. No competing open model at this size tier offers audio input natively at E2B's hardware footprint. Google co-developed E2B with Qualcomm Technologies and MediaTek specifically for Pixel, mid-range Android phones, and IoT devices.
Variant 2: E4B — The Premium Edge Model
The E4B is Gemma 4's "effective 4 billion" parameter edge model, also using PLE architecture. It produces meaningfully better reasoning quality than E2B while still fitting within the hardware envelope of a high-end phone or a laptop with limited VRAM. Still carries native audio input. Powers Android Studio's Agent Mode locally.
Variant 3: 26B MoE — The Production Default
The 26B Mixture of Experts model has 25.2 billion total parameters organized across 128 small expert networks plus one always-on shared expert. For each token, the routing mechanism selects 8 of the 128 experts to activate — totaling 3.8B active parameters per inference step. All 25.2B must reside in VRAM. The result: 97% of the 31B Dense model's benchmark quality at approximately 1/8 the compute per inference step.
Variant 4: 31B Dense — The Flagship
The 31B Dense model is Gemma 4's flagship: all 30.7B parameters activate for every token, delivering the highest and most consistent reasoning quality. Dense models have no routing variance — every token sees the full network. Currently ranks #3 among all open models globally on the Arena AI text leaderboard.
Gemma 4 Model Variants — Which One Belongs in Your Stack?
Click any variant to expand full specifications
Model Specification Summary
| Spec | E2B | E4B | 26B MoE | 31B Dense |
|---|---|---|---|---|
| Effective params | 2.3B | 4.5B | 3.8B active | 30.7B |
| Total params | 2.3B | 4.5B | 25.2B | 30.7B |
| Architecture | Dense + PLE | Dense + PLE | MoE (128 experts, 8 active) | Dense |
| Context window | 256K | 256K | 256K | 256K |
| Audio input | ✅ Native | ✅ Native | ❌ | ❌ |
| VRAM (Q4 quant) | ~5GB | ~8GB | ~24GB | ~24GB |
| Target hardware | Phone, Jetson Nano | Laptop, Jetson | Single workstation GPU | A100 / H100 |
| Arena AI rank | — | — | #6 open models | #3 open models |
| AIME 2026 | — | 42.5% | 88.3% | 89.2% |
The Benchmark Story — What the Numbers Actually Mean
AIME 2026: From 20.8% to 89.2%
AIME (American Invitational Mathematics Examination) 2026 is a graduate-level mathematics benchmark testing multi-step reasoning, algebraic manipulation, and mathematical intuition. Gemma 3 27B scored 20.8%. Gemma 4 31B scores 89.2%.
To put 89.2% in context: Llama 4 Scout, which has 109 billion total parameters, scores approximately 88% on AIME 2026 — Gemma 4 matches it with a model 3.5× smaller in total size.
Codeforces ELO 110 → 2,150: What It Measures and Why It Matters
Codeforces is a competitive programming platform where humans (and now models) compete to solve algorithmic problems under time pressure. A Codeforces ELO of 2,150 places a programmer in the "International Master" tier — the top 1–2% of all competitive programmers globally.
Gemma 3 27B had a Codeforces ELO of 110 — the equivalent of a beginner. The jump to 2,150 is not an improvement in kind; it is a category change.
τ2-bench: The Agentic Tool Use Benchmark Most Articles Skip
τ2-bench measures an AI agent's ability to complete multi-step tasks requiring calling tools in sequence, interpreting tool outputs, and adapting based on intermediate results. It is the benchmark most directly predictive of whether a model will work reliably as a production agentic backbone.
Gemma 3 27B scored 6.6% on τ2-bench. Gemma 4 31B scores 86.4%. A 6.6% score means the model reliably completes agentic tasks approximately 1 time in 15. An 86.4% score means 6 times in 7 — the reliability threshold for production agentic systems.
This Is Not an Iteration. This Is a Category Change.
Four benchmarks that tell the same story: something fundamentally different happened in the training pipeline
The Apache 2.0 License — Why It Matters More Than Any Benchmark
The Prior License's Enterprise Problem
Gemma 1, 2, and 3 all shipped under a custom "Gemma Open License" that was not Apache 2.0. The custom license contained acceptable use restrictions requiring legal interpretation, Google-specific terms that raised questions about upstream control, no clear commercial redistribution path matching enterprise standards, and provisions creating uncertainty about scaling a product built on Gemma.
The practical result: companies building commercial products on Gemma 3 needed custom legal analysis. Many enterprise legal teams simply marked Gemma as "review required, probable block." Hugging Face CEO Clément Delangue called the Apache 2.0 switch "a huge milestone." That is not marketing language — the friction was real.
What Apache 2.0 Actually Permits
With Apache 2.0, you can:
- Build a commercial product using Gemma 4 and charge for it without any agreement with Google
- Fine-tune the model on your proprietary data and keep the fine-tuned weights
- Redistribute the base model or fine-tuned variants commercially
- Use Gemma 4 to build products that compete directly with Google's own offerings
- Run it on any infrastructure without usage reporting requirements
You must include the Apache 2.0 license text in any distribution. That is the complete obligation.
For enterprises that rejected Gemma 3 on legal grounds, Gemma 4 is a different product. The benchmark improvement matters. The license change may matter more for the adoption decision.
The Android & Edge AI Story
This is the part of Gemma 4 that is systematically underreported in benchmark-focused coverage. Google is not just releasing a smaller version of a server model. The E2B and E4B variants are the result of a coordinated multi-year effort with hardware manufacturers to build a complete on-device AI stack for Android.
The Hardware Partnership
The E2B and E4B were co-developed with Qualcomm Technologies and MediaTek. Both chipmakers provide the neural processing unit (NPU) silicon in most high-end and mid-range Android phones. Gemma 4's edge models are optimized for Qualcomm's Snapdragon and MediaTek's Dimensity platforms — not GPU-first models ported to mobile, but models designed from the training process for the inference hardware they will run on.
The practical difference: near-zero inference latency on supported devices. A Snapdragon 8 Gen 3 phone running Gemma 4 E2B can respond to prompts in under 100ms for short sequences — fast enough for real-time voice interaction without the network round-trip that cloud-based voice assistants require.
Gemma 4 Android Integration Stack
A complete edge AI stack — not a feature list. Three layers from prototype to production.
Standardized interface for prompt-based AI features in Android apps. Model delivery handled by ML Kit infrastructure — no bundling model weights in your APK, no cloud API keys, no per-call costs. The recommended production integration path.
- ✓No model file bundling in APK
- ✓ML Kit handles model delivery & updates
- ✓No cloud API keys required
- ✓Zero per-call inference costs
The most underserved market for on-device AI is regulated industries. These sectors have data residency requirements, HIPAA/GDPR obligations, or security policies prohibiting cloud API data transmission. Gemma 4 E4B running via ML Kit GenAI Prompt API processes all data locally — zero API calls, zero compliance exposure.
The Complete Benchmark Table — Verified Data
Complete Benchmark Comparison
Verified data — including where Gemma 4 loses
| Benchmark | G4 E4B | G4 26B MoE | G4 31B | G3 27B | Llama 4 Scout | Qwen 3.5 27B |
|---|---|---|---|---|---|---|
AIME 2026 Math | 42.5% | 88.3% | 89.2%✦ | 20.8% | ~88% | ~85% |
LiveCodeBench v6 Coding | — | 77.1% | 80.0%✦ | 29.1% | — | ~72% |
GPQA Diamond Reasoning | — | 82.3% | 84.3% | 42.4% | 74.3% | 85.5%✦ |
MMLU Pro Knowledge | — | ~84% | 85.2% | — | — | 86.1%✦ |
τ2-bench (Agentic) Agentic | — | ~85% | 86.4%✦ | 6.6% | — | — |
Codeforces ELO Coding | — | — | 2,150✦ | 110 | — | ~1,800 |
Context Window Capacity | 256K | 256K | 256K | 128K | 10M ✦✦ | 262K |
Languages Multilingual | 140 | 140 | 140 | 140 | 200+ | 201 ✦✦ |
Audio Input Multimodal | ✅ Native✦ | ❌ | ❌ | ❌ | ❌ | ❌ |
License Legal | Apache 2.0 | Apache 2.0 | Apache 2.0✦ | Custom ❌ | Community ⚠️ | Apache 2.0 |
Honest read: Gemma 4 31B wins on math (AIME), coding (LiveCodeBench, Codeforces ELO), agentic tool use (τ2-bench), and Arena AI ranking. Qwen 3.5 edges it on GPQA Diamond, MMLU Pro, language count, and context window. Llama 4 Scout's 10-million-token context is unmatched — if that is your requirement, Gemma 4 is not the answer.
The EDGE Framework — Choosing the Right Gemma 4 Variant
EDGE is the deployment selection methodology derived from the benchmark data, hardware constraints, and use case profiles across all four Gemma 4 variants.
EDGE: Environment Constraints → Deployment Scale → Gap Analysis → Ecosystem Fit
Environment → Deployment → Gap → Ecosystem
The deployment selection methodology for choosing the right Gemma 4 variant
Before any feature comparison, hardware determines which variants are in scope. The environment is the first and non-negotiable filter.
A Critical Counterintuitive Finding on Throughput
The 26B MoE is slower than the 31B Dense on a single consumer GPU. MoE inference requires loading all 25.2B parameters into VRAM even though only 8/128 are used per token. The routing adds overhead. On an RTX 4090 with Q4 quantization: Gemma 4 31B Dense runs at ~25 tokens/second, Gemma 4 26B MoE runs at ~11 tokens/second. The MoE's compute efficiency advantage manifests on optimized server infrastructure (A100/H100 with dedicated MoE serving frameworks like vLLM). For latency-sensitive local applications, test the 31B Dense before assuming the MoE is faster.
Installation Guide — Running Gemma 4 in Under Five Minutes
Via Ollama (Recommended for Local Development)
Via Hugging Face Transformers (Python)
Via vLLM (Production Server Deployment)
For Android Development (ML Kit GenAI Prompt API)
Common Mistakes Developers Make With Gemma 4
Defaulting to the 31B Dense when the 26B MoE is sufficient. On server infrastructure, the 26B MoE achieves 97% of the 31B Dense's benchmark quality at 1/8 the active compute per inference step. For the vast majority of production workloads, the quality difference is imperceptible to users. The cost difference at scale is not.
Running the 26B MoE on a single consumer GPU and being surprised by throughput. The MoE architecture requires loading all 25.2B parameters into VRAM regardless of how many activate per token. On an RTX 4090, the routing overhead reduces throughput to approximately 11 tokens/second — slower than the 31B Dense at 25 tokens/second on the same hardware.
Treating the 256K context window as interchangeable with Llama 4 Scout's 10M. Gemma 4 cannot analyze a 500K-token codebase in a single context. Llama 4 Scout can. If your use case involves processing entire large codebases or multi-document archives in a single session, Gemma 4 is not the right model regardless of its other qualities.
Ignoring the τ2-bench score when evaluating for agentic applications. For developers evaluating models for agentic workflows, τ2-bench measures whether the model actually completes tool-calling workflows reliably — the number most relevant to production agents. It appears in fewer than 10% of comparison articles developers will find when searching. Gemma 4's 86.4% τ2-bench is the most important number for agentic deployments. See also: Agentic Development 2.0 & OpenAI Codex.
Deploying Gemma 3 workflows unchanged with Gemma 4. The agentic capabilities in Gemma 4 are qualitatively different from Gemma 3. Prompt patterns optimized for Gemma 3's limited tool-use abilities may under-utilize Gemma 4's native function-calling and structured JSON output. Existing Gemma 3 deployments should be re-evaluated to take advantage of Gemma 4's expanded agentic primitives.
Missing the Android integration path for on-device use cases. Developers building Android apps who encounter Gemma 4 in a server-model context may not realize that E4B runs locally through the ML Kit GenAI Prompt API — no model file bundling, no cloud API keys, no per-call costs. For regulated industries or privacy-sensitive applications, this path is available today.
Strategic Conclusion: The Intelligence-Per-Parameter Inflection Point
The most significant thing about Gemma 4 is not any individual benchmark. It is what the benchmark pattern says about where open-weight models are in 2026.
A 31-billion-parameter model ranking third among all open models in the world — beating systems with 400 billion parameters — is not the normal trajectory of open-source AI. The normal trajectory is "good enough for most tasks, close to but below frontier." Gemma 4 is not "good enough." It is the frontier, at a size that runs on a single consumer GPU.
The Codeforces ELO jump from 110 to 2,150 between Gemma 3 and Gemma 4 is the clearest signal: this is not an iteration. Something changed in the research-to-open-weight pipeline. The same techniques that produce Gemini 3's reasoning capability now transfer more completely into the open-weight release.
The Apache 2.0 license removes the last major enterprise adoption barrier. The edge models with native audio create a capability that neither Llama 4 nor Qwen 3.5 can match at phone-sized deployment. The τ2-bench jump from 6.6% to 86.4% opens agentic use cases that Gemma 3 could not reliably support.
The 26B MoE is the production default for the majority of agentic workloads in 2026. The 31B Dense is for applications where consistency and peak reasoning quality justify the additional compute. The E4B is the Android path. The E2B is the minimum viable on-device intelligence.
Gemma 4 has earned a default position in every developer's model evaluation shortlist. The only remaining question is which variant — and the EDGE Framework answers it.
Frequently Asked Questions
Common questions about this topic
