Skip to main content
Latest on AP
April 15, 2026Complete GuideFeaturedopen-source-ai

Gemma 4 Complete Guide (2026): Google's Open Model for Agentic AI on Android & Edge Devices

Google released Gemma 4 on April 2, 2026 — Apache 2.0 licensed, #3 on Arena AI globally, 89.2% on AIME math, 80% on LiveCodeBench. Four variants from 2.3B (runs offline on your phone) to 31B (beats models 20× its size). Every benchmark, every install command, and the EDGE Framework for choosing the right variant.

By Academia Pilot Strategy Team
Gemma 4 Complete Guide (2026): Google's Open Model for Agentic AI on Android & Edge Devices
Gemma 4 complete guide 2026Google Gemma 4 benchmarksGemma 4 Apache 2.0 commercial useGemma 4 vs Llama 4 vs Qwen 3.5Gemma 4 31B local deploymentGemma 4 E4B Android on-deviceGemma 4 26B MoE mixture expertsGemma 4 agentic workflows function callingbest open source AI model 2026Gemma 4 AIME math benchmarkGemma 4 Codeforces ELOGemma 4 Ollama installation guideGemma 4 edge model Android offline

TL;DR — Key Takeaways

  • 1Released April 2, 2026 under Apache 2.0 — the first clean commercial license in Gemma's history. No agreements, no usage limits, no legal review.
  • 2#3 globally on Arena AI open model text leaderboard (31B Dense) — beating models with 20× more parameters.
  • 389.2% on AIME 2026 math (up from Gemma 3's 20.8%) and Codeforces ELO 2,150 (up from 110) — the largest inter-generational reasoning jump of any open model.
  • 4τ2-bench agentic tool use: 86.4% (up from Gemma 3's 6.6%) — the reliability threshold that unlocks production agentic applications.
  • 5The 26B MoE activates only 3.8B parameters per token: 97% of 31B Dense quality at 1/8 the compute cost. The production default for most agentic workloads.
  • 6E2B/E4B edge models run entirely offline on phones, Raspberry Pi, and NVIDIA Jetson Orin Nano — with native audio input unavailable in any competitor at those sizes.

A model with 31 billion parameters should not rank third among all open models in the world, ahead of systems with 400 billion parameters and data center infrastructure requirements. But that is where Gemma 4 sits on the Arena AI text leaderboard, as of the day Google released it on April 2, 2026.

The number that best captures what happened between Gemma 3 and Gemma 4 is not a leaderboard position. It is a Codeforces ELO score. Gemma 3 scored 110. Gemma 4 31B scores 2,150. That is not a percentage improvement. It is a jump from "barely functional at competitive programming" to "expert-level competitive programmer" — in a single model generation. No prior open-source model has made a larger inter-generational leap on that benchmark.

The AIME 2026 mathematics benchmark tells the same story differently: Gemma 3 27B scored 20.8%. Gemma 4 31B scores 89.2%. The 26B MoE — activating only 3.8 billion of its 25.2 billion parameters per token — scores 88.3%.

And then there is the τ2-bench agentic tool use benchmark: Gemma 3 scored 6.6%. Gemma 4 31B scores 86.4%. That number matters more than the others for most developers. It measures not whether a model is intelligent, but whether a model can reliably complete the multi-step tool-calling workflows that production agentic applications require. Gemma 3 essentially could not do it. Gemma 4 is among the best open models in the world at it. Also read: AI Agents Are Failing in Enterprise: Here Is the Real Data

Google released this as an open-weight model under Apache 2.0 — the most permissive commercial license in AI — for the first time in Gemma's history. You can download the weights today, build a product, charge for it, redistribute it, and compete with Google using it. No agreements. No usage limits. No legal review required.

This guide covers all four variants, every verified benchmark, the production deployment decision matrix, the complete Ollama setup, and the honest competitive map — including where Gemma 4 is not the right answer.

What Gemma 4 Actually Is

The Lineage: Built From Gemini 3 Research

Gemma 4 is Google DeepMind's fourth-generation open-weight model family. The official framing: "Built from the same world-class research and technology as Gemini 3." Gemma is not Google's proprietary offering cut down and rebundled. It is the open-weight counterpart — built from the same research pipeline but packaged for self-hosted deployment. You own the weights. Google does not see your data. Your inference costs are your compute costs, not per-token fees to Google's API.

The Gemma family launched in February 2024 with 2B and 7B parameter variants. Gemma 2 (mid-2024) expanded to 9B and 27B. Gemma 3 (early 2025) introduced multimodal capabilities for the first time. Gemma 4, released April 2, 2026, introduces a four-variant architecture spanning edge devices to workstation GPUs, a shift to Apache 2.0, and benchmark results that are not incremental over Gemma 3 — they are categorically different. If you are building a comparison with Google's closed models, see our Complete Guide to Gemini Models 2026.

The community that built around the first three generations: over 400 million cumulative Gemma downloads and more than 100,000 fine-tuned community variants. Google calls this the "Gemmaverse." The INSAIT Bulgarian-first language model BgGPT was built on Gemma. Yale University used Gemma for cell biology research in cancer therapy pathway discovery. The combination of permissive licensing and open weights has produced applied research use cases that closed-weight models cannot enable.

The Four-Variant Architecture — Every Model Explained

Google released four distinct models on April 2, 2026. They are not a scaling continuum — they are four architecturally distinct systems targeting four different deployment scenarios.

Variant 1: E2B — The Smartphone Model

The E2B is Gemma 4's "effective 2 billion" parameter edge model. The "effective" designation matters: the E2B uses Per-Layer Embeddings (PLE), a technique where each decoder layer has its own small embedding for every token. This maximizes parameter efficiency for on-device deployment. The model fits on an 8GB RAM device and runs completely offline.

What makes E2B unique in the market: Native audio input for speech recognition. No competing open model at this size tier offers audio input natively at E2B's hardware footprint. Google co-developed E2B with Qualcomm Technologies and MediaTek specifically for Pixel, mid-range Android phones, and IoT devices.

Variant 2: E4B — The Premium Edge Model

The E4B is Gemma 4's "effective 4 billion" parameter edge model, also using PLE architecture. It produces meaningfully better reasoning quality than E2B while still fitting within the hardware envelope of a high-end phone or a laptop with limited VRAM. Still carries native audio input. Powers Android Studio's Agent Mode locally.

Variant 3: 26B MoE — The Production Default

The 26B Mixture of Experts model has 25.2 billion total parameters organized across 128 small expert networks plus one always-on shared expert. For each token, the routing mechanism selects 8 of the 128 experts to activate — totaling 3.8B active parameters per inference step. All 25.2B must reside in VRAM. The result: 97% of the 31B Dense model's benchmark quality at approximately 1/8 the compute per inference step.

Variant 4: 31B Dense — The Flagship

The 31B Dense model is Gemma 4's flagship: all 30.7B parameters activate for every token, delivering the highest and most consistent reasoning quality. Dense models have no routing variance — every token sees the full network. Currently ranks #3 among all open models globally on the Arena AI text leaderboard.

Four Distinct Systems

Gemma 4 Model Variants — Which One Belongs in Your Stack?

Click any variant to expand full specifications

📱
E2B
The Smartphone Model
Edge
Effective Parameters
2.3B
Q4 VRAM
~5GB RAM
Context
256K
Architecture
Dense + PLE
Arena Rank
Only model with native audio at smartphone size
▼ Click to expand specs
💻
E4B
The Premium Edge Model
Edge
Effective Parameters
4.5B
Q4 VRAM
~8GB RAM
Context
256K
Architecture
Dense + PLE
Arena Rank
Powers Android Studio Agent Mode locally
▼ Click to expand specs
26B MoE
The Production Default
Recommended
Effective Parameters
3.8B active
25.2B total total in VRAM
Q4 VRAM
~24GB VRAM
Context
256K
Architecture
MoE
Arena Rank
#6 open models
97% of flagship quality at 1/8 compute cost
▼ Click to expand specs
🏆
31B Dense
The Flagship
Flagship
Effective Parameters
30.7B
Q4 VRAM
~24GB VRAM
Context
256K
Architecture
Dense
Arena Rank
#3 globally
#3 globally on Arena AI — beats 400B models
▼ Click to expand specs
🎯 Production Default
26B MoE — 97% flagship quality, 1/8 compute
📱 Android Path
E4B via ML Kit GenAI Prompt API
🏆 Max Quality
31B Dense — #3 Arena AI globally

Model Specification Summary

SpecE2BE4B26B MoE31B Dense
Effective params2.3B4.5B3.8B active30.7B
Total params2.3B4.5B25.2B30.7B
ArchitectureDense + PLEDense + PLEMoE (128 experts, 8 active)Dense
Context window256K256K256K256K
Audio input✅ Native✅ Native
VRAM (Q4 quant)~5GB~8GB~24GB~24GB
Target hardwarePhone, Jetson NanoLaptop, JetsonSingle workstation GPUA100 / H100
Arena AI rank#6 open models#3 open models
AIME 202642.5%88.3%89.2%

The Benchmark Story — What the Numbers Actually Mean

AIME 2026: From 20.8% to 89.2%

AIME (American Invitational Mathematics Examination) 2026 is a graduate-level mathematics benchmark testing multi-step reasoning, algebraic manipulation, and mathematical intuition. Gemma 3 27B scored 20.8%. Gemma 4 31B scores 89.2%.

To put 89.2% in context: Llama 4 Scout, which has 109 billion total parameters, scores approximately 88% on AIME 2026 — Gemma 4 matches it with a model 3.5× smaller in total size.

Codeforces ELO 110 → 2,150: What It Measures and Why It Matters

Codeforces is a competitive programming platform where humans (and now models) compete to solve algorithmic problems under time pressure. A Codeforces ELO of 2,150 places a programmer in the "International Master" tier — the top 1–2% of all competitive programmers globally.

Gemma 3 27B had a Codeforces ELO of 110 — the equivalent of a beginner. The jump to 2,150 is not an improvement in kind; it is a category change.

τ2-bench: The Agentic Tool Use Benchmark Most Articles Skip

τ2-bench measures an AI agent's ability to complete multi-step tasks requiring calling tools in sequence, interpreting tool outputs, and adapting based on intermediate results. It is the benchmark most directly predictive of whether a model will work reliably as a production agentic backbone.

Gemma 3 27B scored 6.6% on τ2-bench. Gemma 4 31B scores 86.4%. A 6.6% score means the model reliably completes agentic tasks approximately 1 time in 15. An 86.4% score means 6 times in 7 — the reliability threshold for production agentic systems.

Gemma 3 → Gemma 4 Benchmark Jumps

This Is Not an Iteration. This Is a Category Change.

Four benchmarks that tell the same story: something fundamentally different happened in the training pipeline

🧮
AIME 2026
Graduate-Level Mathematics
Jump
4.3×
Gemma 3
20.8%
Gemma 4
89.2%
💡 Largest inter-generational reasoning jump of any open model
💻
Codeforces ELO
Competitive Programming
Jump
~20×
Gemma 3
110
Gemma 4
2,150
💡 Beginner → International Master tier (top 1–2% globally)
🤖
τ2-bench (Agentic)
Multi-step Tool Use Reliability
Jump
13×
Gemma 3
6.6%
Gemma 4
86.4%
💡 Unlocks entire class of production agentic applications
⚙️
LiveCodeBench v6
Real-world Coding Tasks
Jump
2.7×
Gemma 3
29.1%
Gemma 4
80.0%
💡 From weak to frontier-class at real coding evaluation
The τ2-bench number matters most. A 6.6% τ2-bench score means the model completes agentic tool-use tasks 1 time in 15. An 86.4% score means 6 times in 7 — the reliability threshold for production agentic systems. Gemma 3 essentially could not support agentic backends. Gemma 4 is among the best open models in the world at it.

The Apache 2.0 License — Why It Matters More Than Any Benchmark

The Prior License's Enterprise Problem

Gemma 1, 2, and 3 all shipped under a custom "Gemma Open License" that was not Apache 2.0. The custom license contained acceptable use restrictions requiring legal interpretation, Google-specific terms that raised questions about upstream control, no clear commercial redistribution path matching enterprise standards, and provisions creating uncertainty about scaling a product built on Gemma.

The practical result: companies building commercial products on Gemma 3 needed custom legal analysis. Many enterprise legal teams simply marked Gemma as "review required, probable block." Hugging Face CEO Clément Delangue called the Apache 2.0 switch "a huge milestone." That is not marketing language — the friction was real.

What Apache 2.0 Actually Permits

With Apache 2.0, you can:

  • Build a commercial product using Gemma 4 and charge for it without any agreement with Google
  • Fine-tune the model on your proprietary data and keep the fine-tuned weights
  • Redistribute the base model or fine-tuned variants commercially
  • Use Gemma 4 to build products that compete directly with Google's own offerings
  • Run it on any infrastructure without usage reporting requirements

You must include the Apache 2.0 license text in any distribution. That is the complete obligation.

For enterprises that rejected Gemma 3 on legal grounds, Gemma 4 is a different product. The benchmark improvement matters. The license change may matter more for the adoption decision.

The Android & Edge AI Story

This is the part of Gemma 4 that is systematically underreported in benchmark-focused coverage. Google is not just releasing a smaller version of a server model. The E2B and E4B variants are the result of a coordinated multi-year effort with hardware manufacturers to build a complete on-device AI stack for Android.

The Hardware Partnership

The E2B and E4B were co-developed with Qualcomm Technologies and MediaTek. Both chipmakers provide the neural processing unit (NPU) silicon in most high-end and mid-range Android phones. Gemma 4's edge models are optimized for Qualcomm's Snapdragon and MediaTek's Dimensity platforms — not GPU-first models ported to mobile, but models designed from the training process for the inference hardware they will run on.

The practical difference: near-zero inference latency on supported devices. A Snapdragon 8 Gen 3 phone running Gemma 4 E2B can respond to prompts in under 100ms for short sequences — fast enough for real-time voice interaction without the network round-trip that cloud-based voice assistants require.

🤖

Gemma 4 Android Integration Stack

A complete edge AI stack — not a feature list. Three layers from prototype to production.

🔒All processing entirely on-device — no data leaves the phone
🔧
Layer 1
AICore Developer Preview
Developer Preview
Developers building prototype agentic flows
📦
Layer 2
ML Kit GenAI Prompt API
Production Ready
Android app developers — production path
How It Works

Standardized interface for prompt-based AI features in Android apps. Model delivery handled by ML Kit infrastructure — no bundling model weights in your APK, no cloud API keys, no per-call costs. The recommended production integration path.

🧠 Model: Gemma 4 E4B (on-device)
Key Capabilities
  • No model file bundling in APK
  • ML Kit handles model delivery & updates
  • No cloud API keys required
  • Zero per-call inference costs
🤖
Layer 3
Android Studio Agent Mode
Available Now
App developers building in Android Studio
🔒 The Offline Privacy Advantage — Regulated Industries

The most underserved market for on-device AI is regulated industries. These sectors have data residency requirements, HIPAA/GDPR obligations, or security policies prohibiting cloud API data transmission. Gemma 4 E4B running via ML Kit GenAI Prompt API processes all data locally — zero API calls, zero compliance exposure.

🏥
Healthcare
AI symptom analysis — all processing local, HIPAA-safe
⚖️
Legal
Client document processing — zero API exposure
🏦
Finance
Account data AI analysis — no cloud compliance risk
🏛️
Government
Classified data processing — air-gapped deployable

The Complete Benchmark Table — Verified Data

Complete Benchmark Comparison

Verified data — including where Gemma 4 loses

6/10
Gemma 4 wins
Benchmark
G4 E4B
G4 26B MoE
G4 31B
G3 27B
Llama 4 Scout
Qwen 3.5 27B
AIME 2026
Math
42.5%88.3%89.2%20.8%~88%~85%
LiveCodeBench v6
Coding
77.1%80.0%29.1%~72%
GPQA Diamond
Reasoning
82.3%84.3%42.4%74.3%85.5%
MMLU Pro
Knowledge
~84%85.2%86.1%
τ2-bench (Agentic)
Agentic
~85%86.4%6.6%
Codeforces ELO
Coding
2,150110~1,800
Context Window
Capacity
256K256K256K128K10M ✦262K
Languages
Multilingual
140140140140200+201 ✦
Audio Input
Multimodal
✅ Native
License
Legal
Apache 2.0Apache 2.0Apache 2.0Custom ❌Community ⚠️Apache 2.0
Gemma 4 wins
Math, coding (×2), agentic tool use, audio input at edge, Apache 2.0
⚠️ Qwen 3.5 edges ahead
GPQA Diamond (85.5% vs 84.3%), MMLU Pro, 201 languages
Llama 4 Scout wins
10M-token context window — the only category where it leads

Honest read: Gemma 4 31B wins on math (AIME), coding (LiveCodeBench, Codeforces ELO), agentic tool use (τ2-bench), and Arena AI ranking. Qwen 3.5 edges it on GPQA Diamond, MMLU Pro, language count, and context window. Llama 4 Scout's 10-million-token context is unmatched — if that is your requirement, Gemma 4 is not the answer.

The EDGE Framework — Choosing the Right Gemma 4 Variant

EDGE is the deployment selection methodology derived from the benchmark data, hardware constraints, and use case profiles across all four Gemma 4 variants.

EDGE: Environment Constraints → Deployment Scale → Gap Analysis → Ecosystem Fit

The EDGE Framework

Environment → Deployment → Gap → Ecosystem

The deployment selection methodology for choosing the right Gemma 4 variant

🖥️
EDGE — E Step
Environment Constraints
Start With Hardware — Not Features

Before any feature comparison, hardware determines which variants are in scope. The environment is the first and non-negotiable filter.

IF: On-device / mobile / IoT
E2B or E4B only
detail
IF: Single consumer GPU (24GB VRAM)
26B MoE recommended
detail
IF: Workstation / Cloud GPU (48–80GB)
26B MoE or 31B Dense
detail
EDGE-E Verdict
Edge deployment → E4B. Single consumer GPU → 26B MoE. Consistency critical → 31B Dense.

A Critical Counterintuitive Finding on Throughput

The 26B MoE is slower than the 31B Dense on a single consumer GPU. MoE inference requires loading all 25.2B parameters into VRAM even though only 8/128 are used per token. The routing adds overhead. On an RTX 4090 with Q4 quantization: Gemma 4 31B Dense runs at ~25 tokens/second, Gemma 4 26B MoE runs at ~11 tokens/second. The MoE's compute efficiency advantage manifests on optimized server infrastructure (A100/H100 with dedicated MoE serving frameworks like vLLM). For latency-sensitive local applications, test the 31B Dense before assuming the MoE is faster.

Installation Guide — Running Gemma 4 in Under Five Minutes

# The 26B MoE — default and recommended for most workflows
ollama run gemma4

# The 31B Dense — maximum quality
ollama run gemma4:31b

# The 4B edge model — for laptops or limited VRAM
ollama run gemma4:e4b

# The 2B ultra-edge model — for minimum footprint
ollama run gemma4:e2b

Via Hugging Face Transformers (Python)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "google/gemma-4-31b"  # or gemma-4-26b-moe, gemma-4-e4b, gemma-4-e2b

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a Python function that finds the nth Fibonacci number using dynamic programming."}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    return_dict=True
).to(model.device)

outputs = model.generate(**input_ids, max_new_tokens=1024)
response = tokenizer.decode(outputs[0][input_ids['input_ids'].shape[-1]:], skip_special_tokens=True)
print(response)

Via vLLM (Production Server Deployment)

# Install vLLM
pip install vllm

# Start the 26B MoE server
python -m vllm.entrypoints.openai.api_server \
    --model google/gemma-4-26b-a4b \
    --dtype bfloat16 \
    --tensor-parallel-size 2 \
    --max-model-len 131072 \
    --port 8000

For Android Development (ML Kit GenAI Prompt API)

// In your Android project's build.gradle
dependencies {
    implementation "com.google.mlkit:genai-common:1.0.0"
    implementation "com.google.mlkit:genai-inferencing:1.0.0"
}

// In your Activity or ViewModel
val promptOptions = GenerativeModelOptions.builder()
    .setModelName("gemma-4-e4b")
    .build()

val model = GenerativeModel.getInstance(context, promptOptions)

val request = GenerateContentRequest.builder()
    .addContent(Content.builder().addText("Summarize this document: $documentText").build())
    .build()

model.generateContent(request).addOnSuccessListener { response ->
    val text = response.candidates[0].content.parts[0].text
    // Handle response
}

Common Mistakes Developers Make With Gemma 4

Defaulting to the 31B Dense when the 26B MoE is sufficient. On server infrastructure, the 26B MoE achieves 97% of the 31B Dense's benchmark quality at 1/8 the active compute per inference step. For the vast majority of production workloads, the quality difference is imperceptible to users. The cost difference at scale is not.

Running the 26B MoE on a single consumer GPU and being surprised by throughput. The MoE architecture requires loading all 25.2B parameters into VRAM regardless of how many activate per token. On an RTX 4090, the routing overhead reduces throughput to approximately 11 tokens/second — slower than the 31B Dense at 25 tokens/second on the same hardware.

Treating the 256K context window as interchangeable with Llama 4 Scout's 10M. Gemma 4 cannot analyze a 500K-token codebase in a single context. Llama 4 Scout can. If your use case involves processing entire large codebases or multi-document archives in a single session, Gemma 4 is not the right model regardless of its other qualities.

Ignoring the τ2-bench score when evaluating for agentic applications. For developers evaluating models for agentic workflows, τ2-bench measures whether the model actually completes tool-calling workflows reliably — the number most relevant to production agents. It appears in fewer than 10% of comparison articles developers will find when searching. Gemma 4's 86.4% τ2-bench is the most important number for agentic deployments. See also: Agentic Development 2.0 & OpenAI Codex.

Deploying Gemma 3 workflows unchanged with Gemma 4. The agentic capabilities in Gemma 4 are qualitatively different from Gemma 3. Prompt patterns optimized for Gemma 3's limited tool-use abilities may under-utilize Gemma 4's native function-calling and structured JSON output. Existing Gemma 3 deployments should be re-evaluated to take advantage of Gemma 4's expanded agentic primitives.

Missing the Android integration path for on-device use cases. Developers building Android apps who encounter Gemma 4 in a server-model context may not realize that E4B runs locally through the ML Kit GenAI Prompt API — no model file bundling, no cloud API keys, no per-call costs. For regulated industries or privacy-sensitive applications, this path is available today.

Strategic Conclusion: The Intelligence-Per-Parameter Inflection Point

The most significant thing about Gemma 4 is not any individual benchmark. It is what the benchmark pattern says about where open-weight models are in 2026.

A 31-billion-parameter model ranking third among all open models in the world — beating systems with 400 billion parameters — is not the normal trajectory of open-source AI. The normal trajectory is "good enough for most tasks, close to but below frontier." Gemma 4 is not "good enough." It is the frontier, at a size that runs on a single consumer GPU.

The Codeforces ELO jump from 110 to 2,150 between Gemma 3 and Gemma 4 is the clearest signal: this is not an iteration. Something changed in the research-to-open-weight pipeline. The same techniques that produce Gemini 3's reasoning capability now transfer more completely into the open-weight release.

The Apache 2.0 license removes the last major enterprise adoption barrier. The edge models with native audio create a capability that neither Llama 4 nor Qwen 3.5 can match at phone-sized deployment. The τ2-bench jump from 6.6% to 86.4% opens agentic use cases that Gemma 3 could not reliably support.

The 26B MoE is the production default for the majority of agentic workloads in 2026. The 31B Dense is for applications where consistency and peak reasoning quality justify the additional compute. The E4B is the Android path. The E2B is the minimum viable on-device intelligence.

Gemma 4 has earned a default position in every developer's model evaluation shortlist. The only remaining question is which variant — and the EDGE Framework answers it.

Frequently Asked Questions

Common questions about this topic

Gemma 4 is Google DeepMind's fourth-generation open-weight AI model family, released on April 2, 2026. Built from the same research foundation as Gemini 3, licensed under Apache 2.0, available in four variants: E2B (2.3B effective parameters), E4B (4.5B effective), 26B MoE (3.8B active), and 31B Dense. The 31B Dense ranks #3 globally on the Arena AI open model text leaderboard. All variants natively process text and images; video on 26B and 31B; native audio on E2B and E4B.
E2B (2.3B): smartphones, Raspberry Pi, NVIDIA Jetson — 8GB RAM. E4B (4.5B): laptops and Android development — 8–16GB RAM. 26B MoE (25.2B total, 3.8B active): server production deployments — 24GB VRAM at Q4. 31B Dense: maximum quality, consistency-critical applications — 24GB VRAM at Q4 or 80GB at bfloat16. Start with the 26B MoE for most production agentic workloads. Note: on a single RTX 4090, the 31B Dense may actually run faster than the 26B MoE (~25 tok/s vs ~11 tok/s) due to MoE routing overhead.
Yes. Gemma 4 ships under Apache 2.0 — the same license as Linux, Kubernetes, and TensorFlow. Build a commercial product, charge for it, redistribute fine-tuned versions, and compete with Google using Gemma 4 without any agreement or payment. The only obligation is including the Apache 2.0 license text in distributions. This is the first Gemma generation with truly clean commercial terms — Gemma 1 through 3 used a custom Google license that enterprise legal teams frequently flagged as ambiguous.
Gemma 4 31B beats Llama 4 Scout (109B total parameters) on AIME 2026 math, LiveCodeBench coding, GPQA Diamond reasoning, and Codeforces ELO. Llama 4 Scout wins on context window (10M vs 256K). Qwen 3.5 27B edges Gemma 4 on GPQA Diamond (85.5% vs 84.3%) and MMLU Pro (86.1% vs 85.2%) and leads on multilingual coverage (201 languages vs 140). Gemma 4 leads on math reasoning, competitive coding, and has the only native audio input in the open-weight small-model market.
E2B: any device with 8GB RAM — modern smartphones, Raspberry Pi 5, MacBook Air. E4B: 8–16GB RAM — any modern laptop. 26B MoE: 24GB VRAM at Q4 quantization (RTX 3090, 4090, 5090), 48GB at bfloat16. 31B Dense: 24GB VRAM at Q4 quantization, 80GB at bfloat16 (A100, H100). Counterintuitive note: on a single RTX 4090, the 31B Dense may run faster than the 26B MoE due to MoE routing overhead. The MoE's efficiency advantage manifests on optimized server infrastructure.
Yes. The E2B and E4B models were co-developed with Qualcomm and MediaTek for on-device deployment. Developers deploy Gemma 4 E4B to Android apps via the ML Kit GenAI Prompt API, which handles model delivery without requiring apps to bundle model weights. The AICore Developer Preview enables agentic on-device flows compatible with Gemini Nano 4 forward compatibility. Android Studio's Agent Mode runs on Gemma 4 E4B locally.
Run: ollama run gemma4 (installs and runs the 26B MoE default), ollama run gemma4:31b (flagship), ollama run gemma4:e4b (laptop/edge model), ollama run gemma4:e2b (minimum footprint). Ollama handles quantization automatically. A 24GB VRAM GPU runs the 26B MoE and 31B Dense comfortably.
The 26B MoE has 25.2 billion total parameters organized across 128 expert networks plus one always-on shared expert. For each token, a routing mechanism selects 8 of 128 experts to activate — totaling 3.8B active parameters per inference step. All 25.2B must reside in VRAM (random-access expert loading requires the full parameter set). On optimized server infrastructure, this architecture delivers 97% of 31B Dense quality at approximately 1/8 the compute per token. The 26B MoE scores 88.3% on AIME 2026 vs the 31B Dense's 89.2%.
Gemma 1, 2, and 3 shipped under a custom Gemma Open License with commercial use restrictions, Google-specific terms, and ambiguous provisions that enterprise legal teams frequently flagged as blockers. Companies building products on Gemma 3 needed custom legal analysis for commercial deployment at scale. Gemma 4 switches to Apache 2.0 with no custom clauses — the same license used by Linux, TensorFlow, and Kubernetes. Hugging Face CEO Clément Delangue publicly called this change a huge milestone.
Choose Gemma 4 if: on-device deployment (E2B/E4B), audio input at small model sizes, best math reasoning per parameter, clean Apache 2.0 license, tight Google/Android ecosystem integration, or agentic tool use reliability (τ2-bench 86.4%). Choose Qwen 3.5 if: deep multilingual support beyond 140 languages (especially CJK scripts), largest context window beyond 256K, widest model size range, or maximum reasoning at scale with the 397B flagship. Both are Apache 2.0. Both support agentic workflows. Decision: edge/audio/math → Gemma 4, multilingual/extreme-scale → Qwen 3.5.
🎁

Claim Your Free 2026 AI Starter Kit

Get our definitive guide to the essential AI tools, top prompts, and career templates. Plus, join 10,000+ professionals getting our weekly AI insights.

No spam. Unsubscribe anytime. Powered by Beehiiv.

Explore Related Sections: