← Back to news
ZBuild News

Run Gemma 4 Locally in 5 Minutes: Complete Ollama Setup Guide (2026)

Step-by-step tutorial for running Google Gemma 4 locally with Ollama. Covers installation, model selection (E2B, E4B, 26B MoE, 31B), hardware requirements, quantization options, API integration, performance tuning, and real-world usage tips for developers.

Published
2026-04-03T00:00:00.000Z
Author
ZBuild Team
Reading Time
14 min read
gemma 4 ollamarun gemma 4 locallygemma 4 tutorialgemma 4 local setupgemma 4 hardware requirementsollama gemma 4 guide
Run Gemma 4 Locally in 5 Minutes: Complete Ollama Setup Guide (2026)
ZBuild Teamen
XLinkedIn

Key Takeaway

Running Gemma 4 locally takes less than 5 minutes with Ollama: install Ollama, run one command, and you have a fully capable AI model running on your own hardware with zero API costs, zero data leaving your machine, and zero usage restrictions under Apache 2.0. The E2B model runs on any laptop. The 26B MoE model fits on a single RTX 4090 and delivers quality that rivals models 10x its active parameter count.


Run Gemma 4 Locally: The Complete Guide

Why Run Gemma 4 Locally?

Before diving into setup, here is why local inference matters in 2026:

  • Privacy — Your data never leaves your machine. No prompts sent to external servers. Critical for proprietary code, legal documents, medical data, or any sensitive information.
  • Cost — Zero per-token cost after the one-time hardware investment. Heavy users save hundreds of dollars per month compared to API pricing.
  • Latency — No network round trips. The E2B and E4B models respond in milliseconds on modern hardware.
  • Reliability — No API rate limits, no outages, no provider policy changes. Your model is always available.
  • Customization — Fine-tune, quantize, and modify the model freely under Apache 2.0.
  • Offline access — Works without an internet connection once the model is downloaded.

Gemma 4 is particularly well-suited for local deployment because Google designed the smaller models specifically for edge and on-device use. The E2B and E4B models are not afterthoughts — they are first-class models optimized for the constraints of local hardware.


Prerequisites

Hardware Requirements by Model

ModelMinimum RAMRecommended VRAMCPU-Only Viable?Disk Space
E2B (4-bit)5 GB4 GBYes~1.5 GB
E4B (4-bit)5 GB4 GBYes~2.8 GB
E4B (FP16)9 GB9 GBSlow~9 GB
26B MoE (4-bit)18 GB16 GBVery slow~15 GB
26B MoE (FP16)52 GB48 GBNo~52 GB
31B Dense (4-bit)20 GB18 GBVery slow~18 GB
31B Dense (FP16)62 GB48 GB+No~62 GB

Key takeaway: If you have a laptop made after 2022, you can run E2B or E4B. If you have an RTX 4090 (24GB VRAM) or Apple M-series Mac with 32GB+ RAM, you can run the 26B MoE or 31B Dense at 4-bit quantization.

Software Requirements

  • Operating system: macOS, Linux, or Windows
  • Ollama: Version 0.6+ (download from ollama.com)
  • GPU drivers (optional): NVIDIA CUDA 12+ for NVIDIA GPUs, no extra drivers needed for Apple Silicon

Step 1: Install Ollama

macOS

Download from ollama.com/download or use Homebrew:

brew install ollama

Linux

One-line install script:

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com/download and run it. Ollama runs as a background service on Windows.

Verify Installation

ollama --version

You should see ollama version 0.6.x or higher. If you see a version number, Ollama is installed correctly.

Source: Ollama installation guide


Step 2: Pull a Gemma 4 Model

Choose the model that matches your hardware:

For Laptops and Light Workloads

# Smallest model — runs on any modern laptop (5GB RAM)
ollama pull gemma4:e2b

# Small model with broader capability (5-9GB RAM)
ollama pull gemma4:e4b

For Desktops with a Dedicated GPU

# Best efficiency — flagship quality at 3.8B active params (18GB RAM)
ollama pull gemma4:26b-moe

# Highest quality — full 31B parameters (20GB RAM)
ollama pull gemma4:31b

Specifying Quantization

By default, Ollama pulls the recommended quantization for each model (usually Q4_K_M for good quality-to-size balance). You can specify different quantizations:

# Higher quality, larger size
ollama pull gemma4:31b-q5_K_M

# Smaller size, slightly lower quality
ollama pull gemma4:31b-q3_K_M

# Full precision (requires much more RAM)
ollama pull gemma4:31b-fp16

The download will take a few minutes depending on your internet connection. Model sizes range from ~1.5GB (E2B 4-bit) to ~62GB (31B FP16).


Step 3: Run Gemma 4

Interactive Chat

ollama run gemma4:e4b

This opens an interactive chat session. Type your prompt and press Enter:

>>> What are the key differences between REST and GraphQL APIs?

The model will respond directly in your terminal. Type /bye to exit.

Single Prompt (Non-Interactive)

echo "Explain the Builder design pattern in Python with an example" | ollama run gemma4:26b-moe

With Thinking Mode

Gemma 4 supports configurable thinking mode for complex tasks. Enable it by adding a system prompt:

ollama run gemma4:31b --system "Think step by step before answering. Show your reasoning process."

For math, logic, and complex analysis tasks, the thinking mode significantly improves answer quality. The model will generate 4,000+ tokens of internal reasoning before producing its final response.


Step 4: Use the Local API

Ollama exposes a REST API on localhost:11434 that is compatible with the OpenAI API format. This means any tool or library that supports OpenAI's API can connect to your local Gemma 4 with a simple URL change.

Test the API with curl

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:26b-moe",
  "prompt": "Write a Python function to parse CSV files with error handling",
  "stream": false
}'

OpenAI-Compatible Endpoint

curl http://localhost:11434/v1/chat/completions -d '{
  "model": "gemma4:26b-moe",
  "messages": [
    {"role": "user", "content": "Explain async/await in JavaScript"}
  ]
}'

Source: Ollama API documentation


Step 5: Integrate with Your Application

Python

import requests

def ask_gemma(prompt, model="gemma4:26b-moe"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        }
    )
    return response.json()["response"]

# Usage
answer = ask_gemma("What is the time complexity of merge sort?")
print(answer)

Python with OpenAI SDK

from openai import OpenAI

# Point to local Ollama instead of OpenAI
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Ollama doesn't require a real API key
)

response = client.chat.completions.create(
    model="gemma4:26b-moe",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a React hook for debounced search"}
    ]
)
print(response.choices[0].message.content)

Node.js / TypeScript

const response = await fetch("http://localhost:11434/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "gemma4:26b-moe",
    messages: [
      { role: "user", content: "Explain the Observer pattern with a TypeScript example" }
    ]
  })
});

const data = await response.json();
console.log(data.choices[0].message.content);

Using with LangChain

from langchain_community.llms import Ollama

llm = Ollama(model="gemma4:26b-moe")
response = llm.invoke("Summarize the key principles of clean architecture")
print(response)

Using with LlamaIndex

from llama_index.llms.ollama import Ollama

llm = Ollama(model="gemma4:26b-moe", request_timeout=120.0)
response = llm.complete("What are the SOLID principles in software engineering?")
print(response)

Quantization Options Explained

Quantization reduces model size and memory usage by using lower-precision numbers to represent model weights. The tradeoff is between quality and resource usage:

QuantizationBits per WeightQuality ImpactMemory SavingsBest For
FP1616 bitsNone (full quality)BaselineServers with ample VRAM
Q8_08 bitsNegligible~50%High-quality local inference
Q6_K6 bitsVery minor~62%Quality-focused local use
Q5_K_M5 bitsMinor~69%Good balance
Q4_K_M4 bitsSmall~75%Recommended default
Q3_K_M3 bitsModerate~81%Constrained hardware
Q2_K2 bitsSignificant~87%Extreme constraints

Q4_K_M is the sweet spot for most users. The quality difference from FP16 is small enough that most tasks produce indistinguishable results, while memory savings of 75% make the difference between "needs a server" and "runs on my laptop."

Choosing the Right Quantization

For Gemma 4 E2B/E4B: Use the default (Q4_K_M). These models are already small enough that higher quantization does not meaningfully change the user experience.

For Gemma 4 26B MoE: Q4_K_M fits in 18GB RAM, which is within an RTX 4090's 24GB VRAM with room for KV cache. If you have 48GB+ VRAM (A6000, dual GPUs), consider Q8_0 for marginally better quality.

For Gemma 4 31B Dense: Q4_K_M at 20GB fits in an RTX 4090 with tight margins. Q5_K_M produces slightly better results but requires ~24GB, consuming all available VRAM. If you have 32GB+ VRAM (RTX 5090, A6000), Q6_K or Q8_0 are worth the upgrade.


Performance Tuning

GPU Offloading

Ollama automatically offloads model layers to the GPU when VRAM is available. If only part of the model fits in VRAM, Ollama splits between GPU and CPU. You can control this:

# Force all layers to GPU (fails if insufficient VRAM)
OLLAMA_NUM_GPU=999 ollama run gemma4:26b-moe

# Force CPU only (useful for testing)
OLLAMA_NUM_GPU=0 ollama run gemma4:e4b

Context Window Configuration

By default, Ollama uses a context window of 2048 tokens for efficiency. To utilize Gemma 4's full context capabilities:

# Set context window to 32K tokens
ollama run gemma4:26b-moe --num-ctx 32768

# Set context window to 128K tokens (requires more RAM)
ollama run gemma4:26b-moe --num-ctx 131072

Important: Larger context windows consume more RAM for the KV cache. A 128K context window on the 31B model may require 8-16GB additional RAM beyond the model weights. Start with 32K and increase only if your use case requires it.

Concurrent Requests

Ollama supports serving multiple requests simultaneously:

# Allow up to 4 concurrent requests
OLLAMA_NUM_PARALLEL=4 ollama serve

Each concurrent request adds memory overhead for its KV cache. On a 24GB GPU running the 26B MoE at Q4_K_M (~18GB), you have roughly 6GB headroom — enough for 2-3 concurrent requests with short contexts.

Keep-Alive Settings

By default, Ollama keeps models loaded in memory for 5 minutes after the last request. Adjust this for your use case:

# Keep model loaded for 1 hour
OLLAMA_KEEP_ALIVE=3600 ollama serve

# Keep model loaded indefinitely
OLLAMA_KEEP_ALIVE=-1 ollama serve

# Unload immediately after each request (saves memory)
OLLAMA_KEEP_ALIVE=0 ollama serve

NVIDIA RTX Optimization

NVIDIA has released optimized builds of Gemma 4 for RTX GPUs. These optimizations include:

  • Custom CUDA kernels for Gemma 4's attention mechanism
  • TensorRT-LLM integration for faster inference
  • Flash Attention support for reduced memory usage during long-context inference
  • Optimized KV cache management for better throughput

Installing NVIDIA-Optimized Gemma 4

If you have an RTX 4000 or 5000 series GPU:

# Check your GPU
nvidia-smi

# Pull the NVIDIA-optimized version (if available in Ollama)
ollama pull gemma4:31b-nvidia

Alternatively, use NVIDIA's AI Workbench or TensorRT-LLM directly for maximum performance. The NVIDIA-optimized versions can provide 30-50% faster inference on RTX GPUs compared to standard Ollama builds.


Real-World Performance Benchmarks

Measured on common hardware configurations:

Tokens per Second (Generation Speed)

ModelRTX 4090 (24GB)RTX 3090 (24GB)M3 Max (36GB)CPU Only (32GB)
E2B (Q4)~150 tok/s~120 tok/s~100 tok/s~30 tok/s
E4B (Q4)~100 tok/s~80 tok/s~70 tok/s~15 tok/s
26B MoE (Q4)~40 tok/s~30 tok/s~25 tok/s~3 tok/s
31B Dense (Q4)~30 tok/s~20 tok/s~20 tok/s~2 tok/s

Context: Human reading speed is roughly 4-5 tokens per second. Any model generating above 10 tok/s feels "instant" for interactive use. The E2B and E4B models are fast enough for real-time streaming on almost any hardware.

Time to First Token (Latency)

ModelRTX 4090M3 MaxCPU Only
E2B<100ms<200ms<500ms
E4B<200ms<300ms~1s
26B MoE~500ms~1s~5s
31B Dense~800ms~1.5s~8s

For interactive applications, time to first token matters more than generation speed. The E2B and E4B models start generating almost instantly even on CPU, making them ideal for real-time chat interfaces.


Common Use Cases

Local Coding Assistant

Use Gemma 4 as a private coding assistant that never sends your code to external servers:

ollama run gemma4:26b-moe --system "You are an expert software engineer. When given code, analyze it for bugs, suggest improvements, and explain your reasoning. Be concise and practical."

Pair this with VS Code extensions like Continue or Twinny that support Ollama as a backend.

Document Analysis

Process sensitive documents locally:

echo "Analyze this contract clause and identify potential risks: [paste clause]" | ollama run gemma4:31b

With 256K context, the 31B model can process documents up to ~750 pages — sufficient for most contracts, research papers, and technical documentation.

Local RAG (Retrieval-Augmented Generation)

Combine Gemma 4 with a local vector database for a fully private RAG system:

from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

# Use Gemma 4 for both embeddings and generation
embeddings = OllamaEmbeddings(model="gemma4:e4b")
llm = Ollama(model="gemma4:26b-moe")

# Create vector store from your documents
vectorstore = Chroma.from_documents(documents, embeddings)

# Query with RAG
retriever = vectorstore.as_retriever()
docs = retriever.get_relevant_documents("What is our refund policy?")
context = "\n".join([doc.page_content for doc in docs])
response = llm.invoke(f"Based on this context:\n{context}\n\nAnswer: What is our refund policy?")

Building AI Features into Applications

For developers building applications with AI capabilities, running Gemma 4 locally via Ollama's API is the fastest path to a working prototype. The OpenAI-compatible API means you can start with local Gemma 4 for development and switch to cloud APIs for production without changing application code.

Platforms like ZBuild can handle the application infrastructure — frontend, backend, authentication, database — while you focus on the AI integration layer. Point your application's AI endpoint to localhost:11434 during development and swap to a cloud endpoint when you are ready to scale.


Troubleshooting

"Out of memory" Errors

If you see memory errors:

  1. Try a smaller quantization: ollama pull gemma4:31b-q3_K_M
  2. Reduce context window: --num-ctx 4096
  3. Close other GPU-intensive applications
  4. Switch to a smaller model: the 26B MoE delivers near-31B quality at lower memory cost

Slow Generation Speed

If generation is slower than expected:

  1. Check GPU utilization: nvidia-smi (should show high GPU usage)
  2. Ensure the model fits entirely in VRAM — partial CPU offloading is dramatically slower
  3. Reduce --num-ctx to free VRAM for compute
  4. Check if other processes are using the GPU

Model Not Found

If ollama run gemma4:26b-moe fails:

# List available models
ollama list

# Search for Gemma 4 models
ollama search gemma4

# Pull the specific model
ollama pull gemma4:26b-moe

API Connection Refused

If applications cannot connect to localhost:11434:

# Check if Ollama is running
ollama list

# Start the Ollama server manually
ollama serve

# Check the port
curl http://localhost:11434/api/tags

Model Selection Decision Tree

Use this to quickly choose the right model:

Do you have a dedicated GPU with 16GB+ VRAM?

  • Yes → Do you want maximum quality or maximum efficiency?
    • Maximum qualitygemma4:31b (Q4_K_M, needs 20GB)
    • Maximum efficiencygemma4:26b-moe (Q4_K_M, needs 18GB)
  • No → Do you have 8GB+ RAM?
    • Yesgemma4:e4b (Q4_K_M, better quality)
    • Nogemma4:e2b (Q4_K_M, runs on 5GB)

For most developers with a modern desktop or gaming PC: Start with gemma4:26b-moe. It offers the best quality-to-resource ratio in the entire Gemma 4 family.


What You Can Build

With Gemma 4 running locally, you have a zero-cost AI backend for:

  • Chat applications with full conversation privacy
  • Code analysis tools that work on proprietary codebases
  • Document processing pipelines for sensitive data
  • Local AI assistants that work offline
  • Prototype AI features before committing to cloud API costs
  • Fine-tuned models for domain-specific tasks (Apache 2.0 allows this freely)

The Apache 2.0 license means everything you build is yours — no usage restrictions, no revenue sharing, no approval needed. Run it locally, deploy it on your servers, embed it in your products. This is what truly open AI looks like.


Sources

Back to all news
Enjoyed this article?
FAQ

Common questions

How much RAM do I need to run Gemma 4 locally?+
Gemma 4 E2B and E4B run on as little as 5GB RAM with 4-bit quantization — any modern laptop qualifies. The 26B MoE model needs approximately 18GB RAM (fits in an RTX 4090's 24GB VRAM). The 31B Dense model needs approximately 20GB RAM. For CPU-only execution, add 20-30% more RAM than the model weight size.
Which Gemma 4 model should I choose for local use?+
For laptops without a dedicated GPU: E2B (fastest, lightest). For laptops with a GPU or desktops: E4B (better quality, still lightweight). For desktops with an RTX 4090 or equivalent: 26B MoE (best quality-to-compute ratio). For workstations with 24GB+ VRAM: 31B Dense (highest quality). The 26B MoE is the sweet spot for most developers.
Is Gemma 4 free to use locally?+
Yes. Gemma 4 is released under Apache 2.0, which permits unrestricted use including commercial applications. Ollama is also free and open source. The only cost is your hardware. There are no API fees, no usage limits, and no license restrictions.
How fast is Gemma 4 locally compared to cloud APIs?+
On an RTX 4090, Gemma 4 E4B generates 80-120 tokens per second. The 26B MoE generates 30-50 tokens/sec. The 31B Dense generates 20-35 tokens/sec. Cloud APIs like Google AI Studio may be faster for the largest models but add network latency of 100-500ms per request. For interactive use, local inference on the smaller models feels instant.
Can I use Gemma 4 locally as an API for my applications?+
Yes. Ollama exposes a local REST API on port 11434 that is compatible with the OpenAI API format. Any application, framework, or tool that supports the OpenAI API can connect to local Gemma 4 by pointing the base URL to http://localhost:11434/v1. This includes Python, Node.js, and most AI frameworks.
Does Gemma 4 support GPU acceleration with Ollama?+
Yes. Ollama automatically detects and uses NVIDIA GPUs (CUDA), Apple Silicon (Metal), and AMD GPUs (ROCm). No additional configuration is needed — if your GPU has enough VRAM to hold the model, Ollama will use it automatically. NVIDIA has also released RTX-optimized versions of Gemma 4 for additional performance gains.

Build with ZBuild

Turn your idea into a working app — no coding required.

46,000+ developers built with ZBuild this month

Now try it yourself

Describe what you want — ZBuild builds it for you.

46,000+ developers built with ZBuild this month
More Reading

Related articles