← Back to news
ZBuild News

Google Gemma 4: Complete Guide to Specs, Benchmarks, and What's New (2026)

Everything you need to know about Google Gemma 4 — the first Apache 2.0 licensed Gemma release. Covers all 4 model sizes (E2B, E4B, 26B MoE, 31B Dense), multimodal capabilities, configurable thinking mode, 256K context, 85.2% MMLU Pro, and hardware requirements for local deployment.

Published
2026-04-03T00:00:00.000Z
Author
ZBuild Team
Reading Time
13 min read
gemma 4google gemma 4gemma 4 releasegemma 4 benchmarksgemma 4 specsgemma 4 open source
Google Gemma 4: Complete Guide to Specs, Benchmarks, and What's New (2026)
ZBuild Teamen
XLinkedIn

Key Takeaway

Google Gemma 4 is the most capable open-weight model family ever released under a truly permissive license. The 31B Dense model scores 85.2% on MMLU Pro and ranks 3rd among all open models on Arena AI — while the 26B MoE achieves nearly identical quality with only 3.8B active parameters. For the first time, Gemma ships under Apache 2.0, removing every licensing friction that held back commercial adoption of previous generations.


Google Gemma 4: Everything You Need to Know

Release Overview

Google DeepMind released Gemma 4 on April 2, 2026, introducing four model sizes built on the same technology foundation as Gemini 3. This generation represents the biggest leap in the Gemma family across every dimension: model quality, multimodal capabilities, context length, and licensing terms.

The key changes from Gemma 3:

  • Apache 2.0 licensing — no usage restrictions, no custom license, full commercial freedom
  • Four model sizes instead of three, including a new MoE architecture
  • Native multimodal support across all sizes (text, images, video, audio)
  • Configurable thinking mode with 4,000+ token reasoning chains
  • 256K context windows on larger models (up from Gemma 3's limits)
  • 35+ supported languages, pre-trained on 140+ languages
  • Structured tool use for agentic workflows

The Four Model Sizes

Gemma 4 ships in four distinct sizes, each targeting different deployment scenarios:

ModelParametersActive ParamsArchitectureContextModalities
E2B2.3B effective2.3BDense128KText, Image, Video, Audio
E4B4.5B effective4.5BDense128KText, Image, Video, Audio
26B MoE26B total3.8BMixture of Experts256KText, Image
31B Dense31B31BDense256KText, Image

Source: Google AI Blog

E2B and E4B: The Edge Models

The smallest Gemma 4 models are designed for on-device deployment. At 2.3B and 4.5B effective parameters respectively, they run on smartphones, tablets, and laptops with as little as 5GB RAM using 4-bit quantization.

What makes these models remarkable is their modality breadth. Despite being the smallest in the family, E2B and E4B are the only Gemma 4 models that support all four input modalities: text, images, video, and audio. This is a deliberate design choice — edge devices with cameras and microphones benefit most from multimodal capabilities.

Both models support 128K token context windows, which is generous for their parameter count and sufficient for most on-device use cases.

26B MoE: Maximum Efficiency

The 26B Mixture of Experts model is arguably the most interesting model in the Gemma 4 lineup. It contains 26B total parameters but only activates 3.8B parameters for any given input — roughly the same compute cost as the E4B model but with access to dramatically more knowledge and capability.

On Arena AI, the 26B MoE ranks 6th among all open models with a score of 1441, despite using only 3.8B active parameters. This efficiency ratio is unprecedented — no other model achieves comparable quality at this compute cost.

The MoE architecture routes each token through specialized expert sub-networks, allowing the model to maintain large knowledge capacity while keeping inference cost low. For deployment scenarios where you need strong reasoning but have limited GPU memory, the 26B MoE is the optimal choice.

31B Dense: Maximum Quality

The 31B Dense model is Gemma 4's flagship. Every parameter is active for every token, giving it the most consistent and highest-quality outputs across all task types.

On Arena AI, the 31B Dense ranks 3rd among all open models with a score of 1452. On MMLU Pro, it achieves 85.2% — competitive with models several times its size. The 89.2% score on AIME 2026 demonstrates strong mathematical reasoning, while 74% on BigBench Extra Hard (up from 19% in previous generations) shows a massive improvement in complex reasoning tasks.


Benchmarks: The Complete Data

Reasoning and Knowledge

Benchmark31B Dense26B MoENotes
MMLU Pro85.2%Graduate-level knowledge
AIME 202689.2%Competition mathematics
BigBench Extra Hard74%Up from 19% in previous gen
Arena AI Score1452 (3rd)1441 (6th)Open model rankings

Source: Google DeepMind technical report

BigBench Extra Hard: The Standout Result

The jump from 19% to 74% on BigBench Extra Hard deserves special attention. This benchmark tests complex multi-step reasoning, logical deduction, and tasks that require genuine understanding rather than pattern matching. A 55-percentage-point improvement in a single generation suggests fundamental advances in Gemma 4's reasoning architecture, not just scaling.

This improvement is likely connected to the configurable thinking mode and the underlying Gemini 3 technology that Gemma 4 is built on. The thinking mode generates extended reasoning chains that help the model work through complex problems step by step.

Arena AI Rankings Context

Arena AI ranks models based on head-to-head human preference comparisons. The 31B Dense scoring 1452 and ranking 3rd among open models places it above many models with significantly more parameters. For context:

  • Models ranking above it are typically 70B+ parameter models
  • The 26B MoE achieving 1441 with only 3.8B active parameters is an efficiency breakthrough
  • Both models outperform the previous Gemma 3 27B by a significant margin

Multimodal Capabilities

Image Understanding

All four Gemma 4 models process images natively. Capabilities include:

  • Image description and analysis — detailed understanding of visual content
  • OCR and document parsing — extracting text from images, receipts, screenshots
  • Chart and diagram interpretation — understanding data visualizations
  • Visual reasoning — answering questions that require understanding spatial relationships

Video and Audio (E2B/E4B Only)

The smaller E2B and E4B models add native video and audio processing:

  • Video understanding — analyzing video content without frame-by-frame extraction
  • Audio transcription and understanding — processing speech and environmental audio
  • Cross-modal reasoning — answering questions that span text, image, video, and audio inputs

This design choice reflects Google's focus on edge deployment. Mobile devices capture video and audio natively, so the models designed for those devices support those modalities.


Configurable Thinking Mode

Gemma 4 introduces a configurable thinking mode that generates 4,000+ tokens of internal reasoning before producing a response. This is similar to the extended thinking capabilities seen in Claude's models and OpenAI's o-series, but implemented in an open-weight model.

How It Works

When thinking mode is enabled, the model:

  1. Receives the input prompt
  2. Generates an internal reasoning chain (visible or hidden, depending on configuration)
  3. Uses the reasoning chain to produce a higher-quality final response

The thinking mode can be toggled per request, allowing developers to:

  • Enable thinking for complex math, logic, coding, and analysis tasks
  • Disable thinking for simple queries, chat, and latency-sensitive applications
  • Adjust thinking depth based on the expected complexity of the task

Impact on Quality

The thinking mode is a primary driver behind Gemma 4's strong benchmark performance. The AIME 2026 score of 89.2% and BigBench Extra Hard score of 74% are both achieved with thinking mode enabled. Without thinking mode, these scores would be notably lower — similar to the pattern seen in other models with extended reasoning capabilities.


Apache 2.0: Why the License Change Matters

Previous Gemma generations shipped under Google's custom Gemma license, which included restrictions on:

  • Usage in certain applications
  • Redistribution terms
  • Commercial deployment limitations for large-scale use

Gemma 4 switches to Apache 2.0, the same license used by projects like Kubernetes, TensorFlow, and Apache HTTP Server. This means:

  • No usage restrictions — use it for anything, including commercial products
  • No redistribution limitations — share modified weights freely
  • No attribution requirements beyond the license — standard Apache 2.0 notice
  • No Google approval needed — deploy at any scale without permission
  • Compatible with other open-source licenses — easy to integrate into existing projects

For enterprises and startups building products on top of open models, this removes the legal review overhead that Gemma's custom license required. It also makes Gemma 4 directly comparable to Meta's Llama models (which use their own custom license with some restrictions) and positions it as the most permissively licensed high-quality open model family available.


Language Support

Gemma 4 supports 35+ languages for inference and was pre-trained on 140+ languages. This makes it one of the most multilingual open models available, alongside Qwen's models which also emphasize broad language coverage.

Supported languages include major world languages (English, Chinese, Spanish, French, German, Japanese, Korean, Arabic, Hindi, Portuguese, Russian) as well as many languages with smaller digital footprints. The pre-training on 140+ languages means the model has some capability in languages beyond the officially supported 35+, though quality may vary.

For applications targeting global audiences or non-English markets, this broad language support reduces the need for specialized fine-tuning or separate models per language.


Structured Tool Use and Agentic Workflows

Gemma 4 includes native support for structured tool use, enabling agentic workflows where the model can:

  • Call external APIs with properly formatted requests
  • Parse structured responses from tools and services
  • Chain multiple tool calls to complete complex tasks
  • Handle errors and retries in tool execution

This capability is particularly relevant for Android Studio integration, where Gemma 4 powers local agentic coding workflows. The model can understand code context, suggest changes, execute tools, and iterate — all running locally on the developer's machine without sending code to external servers.

For developers building AI agents, Gemma 4's structured tool use provides a fully local, fully private foundation. Combined with the Apache 2.0 license, this enables building and deploying agentic applications without any dependency on external model providers.


Hardware Requirements

Local Deployment via Ollama

ModelRAM Required (4-bit)RAM Required (FP16)GPU Recommendation
E2B~5 GB~5 GBAny modern GPU / CPU only
E4B~5 GB~9 GBAny modern GPU / CPU only
26B MoE~18 GB~52 GBRTX 4090 / RTX 5090
31B Dense~20 GB~62 GBRTX 4090 / RTX 5090

Source: Ollama model library

The E2B and E4B models are specifically designed for edge deployment. They run comfortably on laptops, desktop CPUs, and even some smartphones. The 26B MoE and 31B Dense models require dedicated GPU hardware but remain accessible to individual developers with consumer GPUs.

NVIDIA Optimization

NVIDIA has released optimized versions of Gemma 4 for RTX GPUs, providing:

  • Faster inference through GPU-specific kernel optimizations
  • Better memory utilization on RTX 4000 and 5000 series cards
  • TensorRT integration for production deployment
  • CUDA graph support for reduced overhead in repeated inference

Source: NVIDIA AI Blog


What Changed from Gemma 3

FeatureGemma 3Gemma 4
LicenseGemma License (restricted)Apache 2.0 (unrestricted)
Model Sizes3 sizes4 sizes (added MoE)
Context WindowUp to 128KUp to 256K
ModalitiesText, ImageText, Image, Video, Audio
Thinking ModeNoYes (configurable)
Tool UseLimitedStructured tool use
Languages30+35+ (pre-trained on 140+)
BigBench Extra Hard19%74%

Every dimension improved. The most impactful changes for developers are the Apache 2.0 license (removes legal friction), the thinking mode (improves quality on hard tasks), and the MoE architecture (provides flagship-quality at a fraction of the compute cost).


Practical Use Cases

Coding and Development

Gemma 4's structured tool use and thinking mode make it effective for:

  • Local code completion and generation
  • Code review and bug detection
  • Automated test generation
  • Documentation writing
  • Agentic coding workflows in Android Studio

Document Processing

With 256K context windows and multimodal support:

  • Process entire codebases or long documents in a single prompt
  • Extract information from images of documents, receipts, and forms
  • Analyze charts and data visualizations
  • Summarize lengthy research papers or legal documents

Building AI-Powered Applications

For developers building products that incorporate AI capabilities, Gemma 4 provides a strong on-device or self-hosted inference layer. The model handles the intelligence — understanding queries, generating responses, processing images — while your application framework handles the rest. Tools like ZBuild can accelerate building the application shell (frontend, backend, database, deployment), letting you focus development effort on the AI integration layer where Gemma 4's capabilities matter most.

Edge and Mobile Deployment

The E2B and E4B models open up use cases that were previously impossible with open models:

  • On-device assistants that work offline
  • Privacy-preserving AI features that never send data to external servers
  • Real-time video and audio processing on mobile devices
  • Embedded AI in IoT and robotics applications

How to Get Started

Ollama (Fastest Path)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Gemma 4
ollama run gemma4:e2b      # Smallest, runs anywhere
ollama run gemma4:e4b      # Small, broader capability
ollama run gemma4:26b-moe  # MoE, best efficiency
ollama run gemma4:31b      # Dense, highest quality

Hugging Face

All Gemma 4 models are available on Hugging Face with full transformers integration:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("google/gemma-4-31b")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31b")

Google AI Studio

Google provides free API access to Gemma 4 through AI Studio for experimentation and prototyping, with Vertex AI available for production deployment.


Gemma 4 in the Competitive Landscape

To understand where Gemma 4 sits in the broader ecosystem:

ModelParamsLicenseMMLU ProArena AIContext
Gemma 4 31B31BApache 2.085.2%1452256K
Gemma 4 26B MoE26B (3.8B active)Apache 2.01441256K
Llama 4 Maverick400B (~17B active)Meta License79.6%14171M
Llama 4 Scout109B (~17B active)Meta License~140010M
Qwen 3.5 72B72BApache 2.081.4%1438128K
Qwen 3.5 MoE397B (~22B active)Apache 2.083.1%1449128K

Gemma 4 31B achieves the highest MMLU Pro score and Arena AI ranking among open models — with the fewest total parameters. This parameter efficiency is a direct result of the Gemini 3 technology foundation and the configurable thinking mode.

The 26B MoE model's efficiency story is even more compelling. It ranks 6th on Arena AI while activating only 3.8B parameters per token. No other model achieves a comparable quality-to-compute ratio. For production deployments where inference cost scales with usage, this efficiency translates directly into cost savings.

Compared to proprietary models, Gemma 4 31B's benchmarks are competitive with mid-tier offerings from Anthropic and OpenAI. While the top proprietary models still lead on the hardest tasks, the gap has narrowed dramatically — and Gemma 4 comes with zero per-token cost and full Apache 2.0 freedom.


Verdict

Gemma 4 sets a new standard for open-weight models in 2026. The combination of Apache 2.0 licensing, four well-differentiated model sizes, native multimodal support, configurable thinking mode, and benchmark scores competitive with much larger models makes it the most practical open model family available.

The 31B Dense is the right choice when you need maximum quality. The 26B MoE is the right choice when you need strong quality at minimum compute cost. The E2B and E4B are the right choices for edge deployment and on-device AI. For the first time in the Gemma family, the license does not limit any of these use cases.


Sources

Back to all news
Enjoyed this article?
FAQ

Common questions

What is Google Gemma 4 and when was it released?+
Google Gemma 4 is Google DeepMind's open-weight model family released on April 2, 2026. It includes 4 sizes — E2B (2.3B effective), E4B (4.5B effective), 26B MoE (3.8B active / 26B total), and 31B Dense. All models are released under Apache 2.0, the most permissive license ever used for a Gemma release.
Is Gemma 4 truly open source?+
Yes. Gemma 4 is the first Gemma generation released under the Apache 2.0 license, which allows unrestricted commercial use, modification, and redistribution without requiring permission from Google. Previous Gemma models used Google's custom Gemma license which imposed usage restrictions.
What context window does Gemma 4 support?+
The smaller models (E2B and E4B) support 128K token context windows. The larger models (26B MoE and 31B Dense) support 256K token context windows. This is a major improvement over Gemma 3's context limits and enables processing of entire codebases or long documents in a single prompt.
Can Gemma 4 process images, video, and audio?+
Yes. All four Gemma 4 models are natively multimodal and support text and image inputs. The E2B and E4B models go further with native video and audio processing capabilities. This makes Gemma 4 the first open-weight model family where the smallest models have the broadest modality support.
How does Gemma 4's thinking mode work?+
Gemma 4 includes a configurable thinking mode that generates 4,000+ tokens of internal reasoning before producing a response. This chain-of-thought reasoning can be turned on or off per request, letting developers choose between faster responses for simple tasks and deeper reasoning for complex problems like math, logic, and coding.
What hardware do I need to run Gemma 4 locally?+
Gemma 4 E2B and E4B run on devices with as little as 5GB RAM using 4-bit quantization, including smartphones and laptops. The 26B MoE model requires approximately 18GB RAM and the 31B Dense requires approximately 20GB RAM. All models run via Ollama with NVIDIA RTX GPU optimization available.

Build with ZBuild

Turn your idea into a working app — no coding required.

46,000+ developers built with ZBuild this month

Now try it yourself

Describe what you want — ZBuild builds it for you.

46,000+ developers built with ZBuild this month
More Reading

Related articles

Claude Sonnet 4.6 Complete Guide: Benchmarks, Pricing, Capabilities, and When to Use It (2026)
2026-03-27T00:00:00.000Z

Claude Sonnet 4.6 Complete Guide: Benchmarks, Pricing, Capabilities, and When to Use It (2026)

The definitive guide to Claude Sonnet 4.6 — Anthropic's mid-tier model released February 17, 2026. Covers all benchmarks (SWE-bench 79.6%, OSWorld 72.5%, ARC-AGI-2 58.3%), API pricing ($3/$15 per million tokens), extended thinking, 1M context window, and detailed comparisons with Opus 4.6 and GPT-5.4.

DeepSeek V4 Release: Specs, Benchmarks & Everything We Know About the 1T Open-Source Model (2026)
2026-03-27T00:00:00.000Z

DeepSeek V4 Release: Specs, Benchmarks & Everything We Know About the 1T Open-Source Model (2026)

A complete guide to DeepSeek V4 — the 1 trillion parameter open-source model with Engram memory, million-token context, and 81% SWE-Bench. We cover architecture, benchmarks, pricing, release timeline, and how it compares to GPT-5.4 and Claude Opus 4.6.

Grok 5 Complete Guide: Release Date, 6T Parameters, Colossus 2 & xAI's AGI Ambitions (2026)
2026-03-27T00:00:00.000Z

Grok 5 Complete Guide: Release Date, 6T Parameters, Colossus 2 & xAI's AGI Ambitions (2026)

Everything known about Grok 5 as of March 2026 — the 6 trillion parameter model training on xAI's Colossus 2 supercluster. We cover the delayed release date, technical specs, Elon Musk's 10% AGI claim, benchmark predictions, and what it means for the AI industry.

Harness Engineering: The Complete Guide to Building Systems for AI Agents and Codex in 2026
2026-03-27T00:00:00.000Z

Harness Engineering: The Complete Guide to Building Systems for AI Agents and Codex in 2026

Learn harness engineering — the new discipline of designing systems that make AI coding agents actually work at scale. Covers OpenAI's million-line Codex experiment, golden principles, dependency layers, repository-first architecture, garbage collection, and practical implementation for your own team.