What is Google Gemma 4 and when was it released?

Google Gemma 4 is Google DeepMind's open-weight model family released on April 2, 2026. It includes 4 sizes — E2B (2.3B effective), E4B (4.5B effective), 26B MoE (3.8B active / 26B total), and 31B Dense. All models are released under Apache 2.0, the most permissive license ever used for a Gemma release.

Is Gemma 4 truly open source?

Yes. Gemma 4 is the first Gemma generation released under the Apache 2.0 license, which allows unrestricted commercial use, modification, and redistribution without requiring permission from Google. Previous Gemma models used Google's custom Gemma license which imposed usage restrictions.

What context window does Gemma 4 support?

The smaller models (E2B and E4B) support 128K token context windows. The larger models (26B MoE and 31B Dense) support 256K token context windows. This is a major improvement over Gemma 3's context limits and enables processing of entire codebases or long documents in a single prompt.

Can Gemma 4 process images, video, and audio?

Yes. All four Gemma 4 models are natively multimodal and support text and image inputs. The E2B and E4B models go further with native video and audio processing capabilities. This makes Gemma 4 the first open-weight model family where the smallest models have the broadest modality support.

How does Gemma 4's thinking mode work?

Gemma 4 includes a configurable thinking mode that generates 4,000+ tokens of internal reasoning before producing a response. This chain-of-thought reasoning can be turned on or off per request, letting developers choose between faster responses for simple tasks and deeper reasoning for complex problems like math, logic, and coding.

What hardware do I need to run Gemma 4 locally?

Gemma 4 E2B and E4B run on devices with as little as 5GB RAM using 4-bit quantization, including smartphones and laptops. The 26B MoE model requires approximately 18GB RAM and the 31B Dense requires approximately 20GB RAM. All models run via Ollama with NVIDIA RTX GPU optimization available.

Key Takeaway

Google Gemma 4 is the most capable open-weight model family ever released under a truly permissive license. The 31B Dense model scores 85.2% on MMLU Pro and ranks 3rd among all open models on Arena AI — while the 26B MoE achieves nearly identical quality with only 3.8B active parameters. For the first time, Gemma ships under Apache 2.0, removing every licensing friction that held back commercial adoption of previous generations.

Google Gemma 4: Everything You Need to Know

Release Overview

Google DeepMind released Gemma 4 on April 2, 2026, introducing four model sizes built on the same technology foundation as Gemini 3. This generation represents the biggest leap in the Gemma family across every dimension: model quality, multimodal capabilities, context length, and licensing terms.

The key changes from Gemma 3:

Apache 2.0 licensing — no usage restrictions, no custom license, full commercial freedom
Four model sizes instead of three, including a new MoE architecture
Native multimodal support across all sizes (text, images, video, audio)
Configurable thinking mode with 4,000+ token reasoning chains
256K context windows on larger models (up from Gemma 3's limits)
35+ supported languages, pre-trained on 140+ languages
Structured tool use for agentic workflows

The Four Model Sizes

Gemma 4 ships in four distinct sizes, each targeting different deployment scenarios:

Model	Parameters	Active Params	Architecture	Context	Modalities
E2B	2.3B effective	2.3B	Dense	128K	Text, Image, Video, Audio
E4B	4.5B effective	4.5B	Dense	128K	Text, Image, Video, Audio
26B MoE	26B total	3.8B	Mixture of Experts	256K	Text, Image
31B Dense	31B	31B	Dense	256K	Text, Image

Source: Google AI Blog

E2B and E4B: The Edge Models

The smallest Gemma 4 models are designed for on-device deployment. At 2.3B and 4.5B effective parameters respectively, they run on smartphones, tablets, and laptops with as little as 5GB RAM using 4-bit quantization.

What makes these models remarkable is their modality breadth. Despite being the smallest in the family, E2B and E4B are the only Gemma 4 models that support all four input modalities: text, images, video, and audio. This is a deliberate design choice — edge devices with cameras and microphones benefit most from multimodal capabilities.

Both models support 128K token context windows, which is generous for their parameter count and sufficient for most on-device use cases.

26B MoE: Maximum Efficiency

The 26B Mixture of Experts model is arguably the most interesting model in the Gemma 4 lineup. It contains 26B total parameters but only activates 3.8B parameters for any given input — roughly the same compute cost as the E4B model but with access to dramatically more knowledge and capability.

On Arena AI, the 26B MoE ranks 6th among all open models with a score of 1441, despite using only 3.8B active parameters. This efficiency ratio is unprecedented — no other model achieves comparable quality at this compute cost.

The MoE architecture routes each token through specialized expert sub-networks, allowing the model to maintain large knowledge capacity while keeping inference cost low. For deployment scenarios where you need strong reasoning but have limited GPU memory, the 26B MoE is the optimal choice.

31B Dense: Maximum Quality

The 31B Dense model is Gemma 4's flagship. Every parameter is active for every token, giving it the most consistent and highest-quality outputs across all task types.

On Arena AI, the 31B Dense ranks 3rd among all open models with a score of 1452. On MMLU Pro, it achieves 85.2% — competitive with models several times its size. The 89.2% score on AIME 2026 demonstrates strong mathematical reasoning, while 74% on BigBench Extra Hard (up from 19% in previous generations) shows a massive improvement in complex reasoning tasks.

Benchmarks: The Complete Data

Reasoning and Knowledge

Benchmark	31B Dense	26B MoE	Notes
MMLU Pro	85.2%	—	Graduate-level knowledge
AIME 2026	89.2%	—	Competition mathematics
BigBench Extra Hard	74%	—	Up from 19% in previous gen
Arena AI Score	1452 (3rd)	1441 (6th)	Open model rankings

Source: Google DeepMind technical report

BigBench Extra Hard: The Standout Result

The jump from 19% to 74% on BigBench Extra Hard deserves special attention. This benchmark tests complex multi-step reasoning, logical deduction, and tasks that require genuine understanding rather than pattern matching. A 55-percentage-point improvement in a single generation suggests fundamental advances in Gemma 4's reasoning architecture, not just scaling.

This improvement is likely connected to the configurable thinking mode and the underlying Gemini 3 technology that Gemma 4 is built on. The thinking mode generates extended reasoning chains that help the model work through complex problems step by step.

Arena AI Rankings Context

Arena AI ranks models based on head-to-head human preference comparisons. The 31B Dense scoring 1452 and ranking 3rd among open models places it above many models with significantly more parameters. For context:

Models ranking above it are typically 70B+ parameter models
The 26B MoE achieving 1441 with only 3.8B active parameters is an efficiency breakthrough
Both models outperform the previous Gemma 3 27B by a significant margin

Multimodal Capabilities

Image Understanding

All four Gemma 4 models process images natively. Capabilities include:

Image description and analysis — detailed understanding of visual content
OCR and document parsing — extracting text from images, receipts, screenshots
Chart and diagram interpretation — understanding data visualizations
Visual reasoning — answering questions that require understanding spatial relationships

Video and Audio (E2B/E4B Only)

The smaller E2B and E4B models add native video and audio processing:

Video understanding — analyzing video content without frame-by-frame extraction
Audio transcription and understanding — processing speech and environmental audio
Cross-modal reasoning — answering questions that span text, image, video, and audio inputs

This design choice reflects Google's focus on edge deployment. Mobile devices capture video and audio natively, so the models designed for those devices support those modalities.

Configurable Thinking Mode

Gemma 4 introduces a configurable thinking mode that generates 4,000+ tokens of internal reasoning before producing a response. This is similar to the extended thinking capabilities seen in Claude's models and OpenAI's o-series, but implemented in an open-weight model.

How It Works

When thinking mode is enabled, the model:

Receives the input prompt
Generates an internal reasoning chain (visible or hidden, depending on configuration)
Uses the reasoning chain to produce a higher-quality final response

The thinking mode can be toggled per request, allowing developers to:

Enable thinking for complex math, logic, coding, and analysis tasks
Disable thinking for simple queries, chat, and latency-sensitive applications
Adjust thinking depth based on the expected complexity of the task

Impact on Quality

The thinking mode is a primary driver behind Gemma 4's strong benchmark performance. The AIME 2026 score of 89.2% and BigBench Extra Hard score of 74% are both achieved with thinking mode enabled. Without thinking mode, these scores would be notably lower — similar to the pattern seen in other models with extended reasoning capabilities.

Apache 2.0: Why the License Change Matters

Previous Gemma generations shipped under Google's custom Gemma license, which included restrictions on:

Usage in certain applications
Redistribution terms
Commercial deployment limitations for large-scale use

Gemma 4 switches to Apache 2.0, the same license used by projects like Kubernetes, TensorFlow, and Apache HTTP Server. This means:

No usage restrictions — use it for anything, including commercial products
No redistribution limitations — share modified weights freely
No attribution requirements beyond the license — standard Apache 2.0 notice
No Google approval needed — deploy at any scale without permission
Compatible with other open-source licenses — easy to integrate into existing projects

For enterprises and startups building products on top of open models, this removes the legal review overhead that Gemma's custom license required. It also makes Gemma 4 directly comparable to Meta's Llama models (which use their own custom license with some restrictions) and positions it as the most permissively licensed high-quality open model family available.

Language Support

Gemma 4 supports 35+ languages for inference and was pre-trained on 140+ languages. This makes it one of the most multilingual open models available, alongside Qwen's models which also emphasize broad language coverage.

Supported languages include major world languages (English, Chinese, Spanish, French, German, Japanese, Korean, Arabic, Hindi, Portuguese, Russian) as well as many languages with smaller digital footprints. The pre-training on 140+ languages means the model has some capability in languages beyond the officially supported 35+, though quality may vary.

For applications targeting global audiences or non-English markets, this broad language support reduces the need for specialized fine-tuning or separate models per language.

Structured Tool Use and Agentic Workflows

Gemma 4 includes native support for structured tool use, enabling agentic workflows where the model can:

Call external APIs with properly formatted requests
Parse structured responses from tools and services
Chain multiple tool calls to complete complex tasks
Handle errors and retries in tool execution

This capability is particularly relevant for Android Studio integration, where Gemma 4 powers local agentic coding workflows. The model can understand code context, suggest changes, execute tools, and iterate — all running locally on the developer's machine without sending code to external servers.

For developers building AI agents, Gemma 4's structured tool use provides a fully local, fully private foundation. Combined with the Apache 2.0 license, this enables building and deploying agentic applications without any dependency on external model providers.

Hardware Requirements

Local Deployment via Ollama

Model	RAM Required (4-bit)	RAM Required (FP16)	GPU Recommendation
E2B	~5 GB	~5 GB	Any modern GPU / CPU only
E4B	~5 GB	~9 GB	Any modern GPU / CPU only
26B MoE	~18 GB	~52 GB	RTX 4090 / RTX 5090
31B Dense	~20 GB	~62 GB	RTX 4090 / RTX 5090

Source: Ollama model library

The E2B and E4B models are specifically designed for edge deployment. They run comfortably on laptops, desktop CPUs, and even some smartphones. The 26B MoE and 31B Dense models require dedicated GPU hardware but remain accessible to individual developers with consumer GPUs.

NVIDIA Optimization

NVIDIA has released optimized versions of Gemma 4 for RTX GPUs, providing:

Faster inference through GPU-specific kernel optimizations
Better memory utilization on RTX 4000 and 5000 series cards
TensorRT integration for production deployment
CUDA graph support for reduced overhead in repeated inference

Source: NVIDIA AI Blog

What Changed from Gemma 3

Feature	Gemma 3	Gemma 4
License	Gemma License (restricted)	Apache 2.0 (unrestricted)
Model Sizes	3 sizes	4 sizes (added MoE)
Context Window	Up to 128K	Up to 256K
Modalities	Text, Image	Text, Image, Video, Audio
Thinking Mode	No	Yes (configurable)
Tool Use	Limited	Structured tool use
Languages	30+	35+ (pre-trained on 140+)
BigBench Extra Hard	19%	74%

Every dimension improved. The most impactful changes for developers are the Apache 2.0 license (removes legal friction), the thinking mode (improves quality on hard tasks), and the MoE architecture (provides flagship-quality at a fraction of the compute cost).

Practical Use Cases

Coding and Development

Gemma 4's structured tool use and thinking mode make it effective for:

Local code completion and generation
Code review and bug detection
Automated test generation
Documentation writing
Agentic coding workflows in Android Studio

Document Processing

With 256K context windows and multimodal support:

Process entire codebases or long documents in a single prompt
Extract information from images of documents, receipts, and forms
Analyze charts and data visualizations
Summarize lengthy research papers or legal documents

Building AI-Powered Applications

For developers building products that incorporate AI capabilities, Gemma 4 provides a strong on-device or self-hosted inference layer. The model handles the intelligence — understanding queries, generating responses, processing images — while your application framework handles the rest. Tools like ZBuild can accelerate building the application shell (frontend, backend, database, deployment), letting you focus development effort on the AI integration layer where Gemma 4's capabilities matter most.

Edge and Mobile Deployment

The E2B and E4B models open up use cases that were previously impossible with open models:

On-device assistants that work offline
Privacy-preserving AI features that never send data to external servers
Real-time video and audio processing on mobile devices
Embedded AI in IoT and robotics applications

How to Get Started

Ollama (Fastest Path)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Gemma 4
ollama run gemma4:e2b      # Smallest, runs anywhere
ollama run gemma4:e4b      # Small, broader capability
ollama run gemma4:26b-moe  # MoE, best efficiency
ollama run gemma4:31b      # Dense, highest quality

Hugging Face

All Gemma 4 models are available on Hugging Face with full transformers integration:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("google/gemma-4-31b")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31b")

Google AI Studio

Google provides free API access to Gemma 4 through AI Studio for experimentation and prototyping, with Vertex AI available for production deployment.

Gemma 4 in the Competitive Landscape

To understand where Gemma 4 sits in the broader ecosystem:

Model	Params	License	MMLU Pro	Arena AI	Context
Gemma 4 31B	31B	Apache 2.0	85.2%	1452	256K
Gemma 4 26B MoE	26B (3.8B active)	Apache 2.0	—	1441	256K
Llama 4 Maverick	400B (~17B active)	Meta License	79.6%	1417	1M
Llama 4 Scout	109B (~17B active)	Meta License	—	~1400	10M
Qwen 3.5 72B	72B	Apache 2.0	81.4%	1438	128K
Qwen 3.5 MoE	397B (~22B active)	Apache 2.0	83.1%	1449	128K

Gemma 4 31B achieves the highest MMLU Pro score and Arena AI ranking among open models — with the fewest total parameters. This parameter efficiency is a direct result of the Gemini 3 technology foundation and the configurable thinking mode.

The 26B MoE model's efficiency story is even more compelling. It ranks 6th on Arena AI while activating only 3.8B parameters per token. No other model achieves a comparable quality-to-compute ratio. For production deployments where inference cost scales with usage, this efficiency translates directly into cost savings.

Compared to proprietary models, Gemma 4 31B's benchmarks are competitive with mid-tier offerings from Anthropic and OpenAI. While the top proprietary models still lead on the hardest tasks, the gap has narrowed dramatically — and Gemma 4 comes with zero per-token cost and full Apache 2.0 freedom.

Verdict

Gemma 4 sets a new standard for open-weight models in 2026. The combination of Apache 2.0 licensing, four well-differentiated model sizes, native multimodal support, configurable thinking mode, and benchmark scores competitive with much larger models makes it the most practical open model family available.

The 31B Dense is the right choice when you need maximum quality. The 26B MoE is the right choice when you need strong quality at minimum compute cost. The E2B and E4B are the right choices for edge deployment and on-device AI. For the first time in the Gemma family, the license does not limit any of these use cases.

Google Gemma 4: Complete Guide to Specs, Benchmarks, and What's New (2026)