Key Takeaway
Google Gemma 4 is the most capable open-weight model family ever released under a truly permissive license. The 31B Dense model scores 85.2% on MMLU Pro and ranks 3rd among all open models on Arena AI — while the 26B MoE achieves nearly identical quality with only 3.8B active parameters. For the first time, Gemma ships under Apache 2.0, removing every licensing friction that held back commercial adoption of previous generations.
Google Gemma 4: Everything You Need to Know
Release Overview
Google DeepMind released Gemma 4 on April 2, 2026, introducing four model sizes built on the same technology foundation as Gemini 3. This generation represents the biggest leap in the Gemma family across every dimension: model quality, multimodal capabilities, context length, and licensing terms.
The key changes from Gemma 3:
- Apache 2.0 licensing — no usage restrictions, no custom license, full commercial freedom
- Four model sizes instead of three, including a new MoE architecture
- Native multimodal support across all sizes (text, images, video, audio)
- Configurable thinking mode with 4,000+ token reasoning chains
- 256K context windows on larger models (up from Gemma 3's limits)
- 35+ supported languages, pre-trained on 140+ languages
- Structured tool use for agentic workflows
The Four Model Sizes
Gemma 4 ships in four distinct sizes, each targeting different deployment scenarios:
| Model | Parameters | Active Params | Architecture | Context | Modalities |
|---|---|---|---|---|---|
| E2B | 2.3B effective | 2.3B | Dense | 128K | Text, Image, Video, Audio |
| E4B | 4.5B effective | 4.5B | Dense | 128K | Text, Image, Video, Audio |
| 26B MoE | 26B total | 3.8B | Mixture of Experts | 256K | Text, Image |
| 31B Dense | 31B | 31B | Dense | 256K | Text, Image |
E2B and E4B: The Edge Models
The smallest Gemma 4 models are designed for on-device deployment. At 2.3B and 4.5B effective parameters respectively, they run on smartphones, tablets, and laptops with as little as 5GB RAM using 4-bit quantization.
What makes these models remarkable is their modality breadth. Despite being the smallest in the family, E2B and E4B are the only Gemma 4 models that support all four input modalities: text, images, video, and audio. This is a deliberate design choice — edge devices with cameras and microphones benefit most from multimodal capabilities.
Both models support 128K token context windows, which is generous for their parameter count and sufficient for most on-device use cases.
26B MoE: Maximum Efficiency
The 26B Mixture of Experts model is arguably the most interesting model in the Gemma 4 lineup. It contains 26B total parameters but only activates 3.8B parameters for any given input — roughly the same compute cost as the E4B model but with access to dramatically more knowledge and capability.
On Arena AI, the 26B MoE ranks 6th among all open models with a score of 1441, despite using only 3.8B active parameters. This efficiency ratio is unprecedented — no other model achieves comparable quality at this compute cost.
The MoE architecture routes each token through specialized expert sub-networks, allowing the model to maintain large knowledge capacity while keeping inference cost low. For deployment scenarios where you need strong reasoning but have limited GPU memory, the 26B MoE is the optimal choice.
31B Dense: Maximum Quality
The 31B Dense model is Gemma 4's flagship. Every parameter is active for every token, giving it the most consistent and highest-quality outputs across all task types.
On Arena AI, the 31B Dense ranks 3rd among all open models with a score of 1452. On MMLU Pro, it achieves 85.2% — competitive with models several times its size. The 89.2% score on AIME 2026 demonstrates strong mathematical reasoning, while 74% on BigBench Extra Hard (up from 19% in previous generations) shows a massive improvement in complex reasoning tasks.
Benchmarks: The Complete Data
Reasoning and Knowledge
| Benchmark | 31B Dense | 26B MoE | Notes |
|---|---|---|---|
| MMLU Pro | 85.2% | — | Graduate-level knowledge |
| AIME 2026 | 89.2% | — | Competition mathematics |
| BigBench Extra Hard | 74% | — | Up from 19% in previous gen |
| Arena AI Score | 1452 (3rd) | 1441 (6th) | Open model rankings |
Source: Google DeepMind technical report
BigBench Extra Hard: The Standout Result
The jump from 19% to 74% on BigBench Extra Hard deserves special attention. This benchmark tests complex multi-step reasoning, logical deduction, and tasks that require genuine understanding rather than pattern matching. A 55-percentage-point improvement in a single generation suggests fundamental advances in Gemma 4's reasoning architecture, not just scaling.
This improvement is likely connected to the configurable thinking mode and the underlying Gemini 3 technology that Gemma 4 is built on. The thinking mode generates extended reasoning chains that help the model work through complex problems step by step.
Arena AI Rankings Context
Arena AI ranks models based on head-to-head human preference comparisons. The 31B Dense scoring 1452 and ranking 3rd among open models places it above many models with significantly more parameters. For context:
- Models ranking above it are typically 70B+ parameter models
- The 26B MoE achieving 1441 with only 3.8B active parameters is an efficiency breakthrough
- Both models outperform the previous Gemma 3 27B by a significant margin
Multimodal Capabilities
Image Understanding
All four Gemma 4 models process images natively. Capabilities include:
- Image description and analysis — detailed understanding of visual content
- OCR and document parsing — extracting text from images, receipts, screenshots
- Chart and diagram interpretation — understanding data visualizations
- Visual reasoning — answering questions that require understanding spatial relationships
Video and Audio (E2B/E4B Only)
The smaller E2B and E4B models add native video and audio processing:
- Video understanding — analyzing video content without frame-by-frame extraction
- Audio transcription and understanding — processing speech and environmental audio
- Cross-modal reasoning — answering questions that span text, image, video, and audio inputs
This design choice reflects Google's focus on edge deployment. Mobile devices capture video and audio natively, so the models designed for those devices support those modalities.
Configurable Thinking Mode
Gemma 4 introduces a configurable thinking mode that generates 4,000+ tokens of internal reasoning before producing a response. This is similar to the extended thinking capabilities seen in Claude's models and OpenAI's o-series, but implemented in an open-weight model.
How It Works
When thinking mode is enabled, the model:
- Receives the input prompt
- Generates an internal reasoning chain (visible or hidden, depending on configuration)
- Uses the reasoning chain to produce a higher-quality final response
The thinking mode can be toggled per request, allowing developers to:
- Enable thinking for complex math, logic, coding, and analysis tasks
- Disable thinking for simple queries, chat, and latency-sensitive applications
- Adjust thinking depth based on the expected complexity of the task
Impact on Quality
The thinking mode is a primary driver behind Gemma 4's strong benchmark performance. The AIME 2026 score of 89.2% and BigBench Extra Hard score of 74% are both achieved with thinking mode enabled. Without thinking mode, these scores would be notably lower — similar to the pattern seen in other models with extended reasoning capabilities.
Apache 2.0: Why the License Change Matters
Previous Gemma generations shipped under Google's custom Gemma license, which included restrictions on:
- Usage in certain applications
- Redistribution terms
- Commercial deployment limitations for large-scale use
Gemma 4 switches to Apache 2.0, the same license used by projects like Kubernetes, TensorFlow, and Apache HTTP Server. This means:
- No usage restrictions — use it for anything, including commercial products
- No redistribution limitations — share modified weights freely
- No attribution requirements beyond the license — standard Apache 2.0 notice
- No Google approval needed — deploy at any scale without permission
- Compatible with other open-source licenses — easy to integrate into existing projects
For enterprises and startups building products on top of open models, this removes the legal review overhead that Gemma's custom license required. It also makes Gemma 4 directly comparable to Meta's Llama models (which use their own custom license with some restrictions) and positions it as the most permissively licensed high-quality open model family available.
Language Support
Gemma 4 supports 35+ languages for inference and was pre-trained on 140+ languages. This makes it one of the most multilingual open models available, alongside Qwen's models which also emphasize broad language coverage.
Supported languages include major world languages (English, Chinese, Spanish, French, German, Japanese, Korean, Arabic, Hindi, Portuguese, Russian) as well as many languages with smaller digital footprints. The pre-training on 140+ languages means the model has some capability in languages beyond the officially supported 35+, though quality may vary.
For applications targeting global audiences or non-English markets, this broad language support reduces the need for specialized fine-tuning or separate models per language.
Structured Tool Use and Agentic Workflows
Gemma 4 includes native support for structured tool use, enabling agentic workflows where the model can:
- Call external APIs with properly formatted requests
- Parse structured responses from tools and services
- Chain multiple tool calls to complete complex tasks
- Handle errors and retries in tool execution
This capability is particularly relevant for Android Studio integration, where Gemma 4 powers local agentic coding workflows. The model can understand code context, suggest changes, execute tools, and iterate — all running locally on the developer's machine without sending code to external servers.
For developers building AI agents, Gemma 4's structured tool use provides a fully local, fully private foundation. Combined with the Apache 2.0 license, this enables building and deploying agentic applications without any dependency on external model providers.
Hardware Requirements
Local Deployment via Ollama
| Model | RAM Required (4-bit) | RAM Required (FP16) | GPU Recommendation |
|---|---|---|---|
| E2B | ~5 GB | ~5 GB | Any modern GPU / CPU only |
| E4B | ~5 GB | ~9 GB | Any modern GPU / CPU only |
| 26B MoE | ~18 GB | ~52 GB | RTX 4090 / RTX 5090 |
| 31B Dense | ~20 GB | ~62 GB | RTX 4090 / RTX 5090 |
The E2B and E4B models are specifically designed for edge deployment. They run comfortably on laptops, desktop CPUs, and even some smartphones. The 26B MoE and 31B Dense models require dedicated GPU hardware but remain accessible to individual developers with consumer GPUs.
NVIDIA Optimization
NVIDIA has released optimized versions of Gemma 4 for RTX GPUs, providing:
- Faster inference through GPU-specific kernel optimizations
- Better memory utilization on RTX 4000 and 5000 series cards
- TensorRT integration for production deployment
- CUDA graph support for reduced overhead in repeated inference
What Changed from Gemma 3
| Feature | Gemma 3 | Gemma 4 |
|---|---|---|
| License | Gemma License (restricted) | Apache 2.0 (unrestricted) |
| Model Sizes | 3 sizes | 4 sizes (added MoE) |
| Context Window | Up to 128K | Up to 256K |
| Modalities | Text, Image | Text, Image, Video, Audio |
| Thinking Mode | No | Yes (configurable) |
| Tool Use | Limited | Structured tool use |
| Languages | 30+ | 35+ (pre-trained on 140+) |
| BigBench Extra Hard | 19% | 74% |
Every dimension improved. The most impactful changes for developers are the Apache 2.0 license (removes legal friction), the thinking mode (improves quality on hard tasks), and the MoE architecture (provides flagship-quality at a fraction of the compute cost).
Practical Use Cases
Coding and Development
Gemma 4's structured tool use and thinking mode make it effective for:
- Local code completion and generation
- Code review and bug detection
- Automated test generation
- Documentation writing
- Agentic coding workflows in Android Studio
Document Processing
With 256K context windows and multimodal support:
- Process entire codebases or long documents in a single prompt
- Extract information from images of documents, receipts, and forms
- Analyze charts and data visualizations
- Summarize lengthy research papers or legal documents
Building AI-Powered Applications
For developers building products that incorporate AI capabilities, Gemma 4 provides a strong on-device or self-hosted inference layer. The model handles the intelligence — understanding queries, generating responses, processing images — while your application framework handles the rest. Tools like ZBuild can accelerate building the application shell (frontend, backend, database, deployment), letting you focus development effort on the AI integration layer where Gemma 4's capabilities matter most.
Edge and Mobile Deployment
The E2B and E4B models open up use cases that were previously impossible with open models:
- On-device assistants that work offline
- Privacy-preserving AI features that never send data to external servers
- Real-time video and audio processing on mobile devices
- Embedded AI in IoT and robotics applications
How to Get Started
Ollama (Fastest Path)
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run Gemma 4
ollama run gemma4:e2b # Smallest, runs anywhere
ollama run gemma4:e4b # Small, broader capability
ollama run gemma4:26b-moe # MoE, best efficiency
ollama run gemma4:31b # Dense, highest quality
Hugging Face
All Gemma 4 models are available on Hugging Face with full transformers integration:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("google/gemma-4-31b")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31b")
Google AI Studio
Google provides free API access to Gemma 4 through AI Studio for experimentation and prototyping, with Vertex AI available for production deployment.
Gemma 4 in the Competitive Landscape
To understand where Gemma 4 sits in the broader ecosystem:
| Model | Params | License | MMLU Pro | Arena AI | Context |
|---|---|---|---|---|---|
| Gemma 4 31B | 31B | Apache 2.0 | 85.2% | 1452 | 256K |
| Gemma 4 26B MoE | 26B (3.8B active) | Apache 2.0 | — | 1441 | 256K |
| Llama 4 Maverick | 400B (~17B active) | Meta License | 79.6% | 1417 | 1M |
| Llama 4 Scout | 109B (~17B active) | Meta License | — | ~1400 | 10M |
| Qwen 3.5 72B | 72B | Apache 2.0 | 81.4% | 1438 | 128K |
| Qwen 3.5 MoE | 397B (~22B active) | Apache 2.0 | 83.1% | 1449 | 128K |
Gemma 4 31B achieves the highest MMLU Pro score and Arena AI ranking among open models — with the fewest total parameters. This parameter efficiency is a direct result of the Gemini 3 technology foundation and the configurable thinking mode.
The 26B MoE model's efficiency story is even more compelling. It ranks 6th on Arena AI while activating only 3.8B parameters per token. No other model achieves a comparable quality-to-compute ratio. For production deployments where inference cost scales with usage, this efficiency translates directly into cost savings.
Compared to proprietary models, Gemma 4 31B's benchmarks are competitive with mid-tier offerings from Anthropic and OpenAI. While the top proprietary models still lead on the hardest tasks, the gap has narrowed dramatically — and Gemma 4 comes with zero per-token cost and full Apache 2.0 freedom.
Verdict
Gemma 4 sets a new standard for open-weight models in 2026. The combination of Apache 2.0 licensing, four well-differentiated model sizes, native multimodal support, configurable thinking mode, and benchmark scores competitive with much larger models makes it the most practical open model family available.
The 31B Dense is the right choice when you need maximum quality. The 26B MoE is the right choice when you need strong quality at minimum compute cost. The E2B and E4B are the right choices for edge deployment and on-device AI. For the first time in the Gemma family, the license does not limit any of these use cases.
Sources
- Introducing Gemma 4 - Google Blog
- Gemma 4 Technical Report - Google DeepMind
- Gemma 4 on Hugging Face
- Gemma 4 Ollama Models
- NVIDIA Gemma 4 RTX Optimization
- Gemma 4 Arena AI Rankings
- Gemma 4 Android Studio Integration
- Apache 2.0 License
- Gemma 4 Benchmark Analysis - Artificial Analysis
- Gemma 4 Overview - Google AI for Developers