Key Takeaway
Running Gemma 4 locally takes less than 5 minutes with Ollama: install Ollama, run one command, and you have a fully capable AI model running on your own hardware with zero API costs, zero data leaving your machine, and zero usage restrictions under Apache 2.0. The E2B model runs on any laptop. The 26B MoE model fits on a single RTX 4090 and delivers quality that rivals models 10x its active parameter count.
Run Gemma 4 Locally: The Complete Guide
Why Run Gemma 4 Locally?
Before diving into setup, here is why local inference matters in 2026:
- Privacy — Your data never leaves your machine. No prompts sent to external servers. Critical for proprietary code, legal documents, medical data, or any sensitive information.
- Cost — Zero per-token cost after the one-time hardware investment. Heavy users save hundreds of dollars per month compared to API pricing.
- Latency — No network round trips. The E2B and E4B models respond in milliseconds on modern hardware.
- Reliability — No API rate limits, no outages, no provider policy changes. Your model is always available.
- Customization — Fine-tune, quantize, and modify the model freely under Apache 2.0.
- Offline access — Works without an internet connection once the model is downloaded.
Gemma 4 is particularly well-suited for local deployment because Google designed the smaller models specifically for edge and on-device use. The E2B and E4B models are not afterthoughts — they are first-class models optimized for the constraints of local hardware.
Prerequisites
Hardware Requirements by Model
| Model | Minimum RAM | Recommended VRAM | CPU-Only Viable? | Disk Space |
|---|---|---|---|---|
| E2B (4-bit) | 5 GB | 4 GB | Yes | ~1.5 GB |
| E4B (4-bit) | 5 GB | 4 GB | Yes | ~2.8 GB |
| E4B (FP16) | 9 GB | 9 GB | Slow | ~9 GB |
| 26B MoE (4-bit) | 18 GB | 16 GB | Very slow | ~15 GB |
| 26B MoE (FP16) | 52 GB | 48 GB | No | ~52 GB |
| 31B Dense (4-bit) | 20 GB | 18 GB | Very slow | ~18 GB |
| 31B Dense (FP16) | 62 GB | 48 GB+ | No | ~62 GB |
Key takeaway: If you have a laptop made after 2022, you can run E2B or E4B. If you have an RTX 4090 (24GB VRAM) or Apple M-series Mac with 32GB+ RAM, you can run the 26B MoE or 31B Dense at 4-bit quantization.
Software Requirements
- Operating system: macOS, Linux, or Windows
- Ollama: Version 0.6+ (download from ollama.com)
- GPU drivers (optional): NVIDIA CUDA 12+ for NVIDIA GPUs, no extra drivers needed for Apple Silicon
Step 1: Install Ollama
macOS
Download from ollama.com/download or use Homebrew:
brew install ollama
Linux
One-line install script:
curl -fsSL https://ollama.com/install.sh | sh
Windows
Download the installer from ollama.com/download and run it. Ollama runs as a background service on Windows.
Verify Installation
ollama --version
You should see ollama version 0.6.x or higher. If you see a version number, Ollama is installed correctly.
Source: Ollama installation guide
Step 2: Pull a Gemma 4 Model
Choose the model that matches your hardware:
For Laptops and Light Workloads
# Smallest model — runs on any modern laptop (5GB RAM)
ollama pull gemma4:e2b
# Small model with broader capability (5-9GB RAM)
ollama pull gemma4:e4b
For Desktops with a Dedicated GPU
# Best efficiency — flagship quality at 3.8B active params (18GB RAM)
ollama pull gemma4:26b-moe
# Highest quality — full 31B parameters (20GB RAM)
ollama pull gemma4:31b
Specifying Quantization
By default, Ollama pulls the recommended quantization for each model (usually Q4_K_M for good quality-to-size balance). You can specify different quantizations:
# Higher quality, larger size
ollama pull gemma4:31b-q5_K_M
# Smaller size, slightly lower quality
ollama pull gemma4:31b-q3_K_M
# Full precision (requires much more RAM)
ollama pull gemma4:31b-fp16
The download will take a few minutes depending on your internet connection. Model sizes range from ~1.5GB (E2B 4-bit) to ~62GB (31B FP16).
Step 3: Run Gemma 4
Interactive Chat
ollama run gemma4:e4b
This opens an interactive chat session. Type your prompt and press Enter:
>>> What are the key differences between REST and GraphQL APIs?
The model will respond directly in your terminal. Type /bye to exit.
Single Prompt (Non-Interactive)
echo "Explain the Builder design pattern in Python with an example" | ollama run gemma4:26b-moe
With Thinking Mode
Gemma 4 supports configurable thinking mode for complex tasks. Enable it by adding a system prompt:
ollama run gemma4:31b --system "Think step by step before answering. Show your reasoning process."
For math, logic, and complex analysis tasks, the thinking mode significantly improves answer quality. The model will generate 4,000+ tokens of internal reasoning before producing its final response.
Step 4: Use the Local API
Ollama exposes a REST API on localhost:11434 that is compatible with the OpenAI API format. This means any tool or library that supports OpenAI's API can connect to your local Gemma 4 with a simple URL change.
Test the API with curl
curl http://localhost:11434/api/generate -d '{
"model": "gemma4:26b-moe",
"prompt": "Write a Python function to parse CSV files with error handling",
"stream": false
}'
OpenAI-Compatible Endpoint
curl http://localhost:11434/v1/chat/completions -d '{
"model": "gemma4:26b-moe",
"messages": [
{"role": "user", "content": "Explain async/await in JavaScript"}
]
}'
Source: Ollama API documentation
Step 5: Integrate with Your Application
Python
import requests
def ask_gemma(prompt, model="gemma4:26b-moe"):
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False
}
)
return response.json()["response"]
# Usage
answer = ask_gemma("What is the time complexity of merge sort?")
print(answer)
Python with OpenAI SDK
from openai import OpenAI
# Point to local Ollama instead of OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Ollama doesn't require a real API key
)
response = client.chat.completions.create(
model="gemma4:26b-moe",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a React hook for debounced search"}
]
)
print(response.choices[0].message.content)
Node.js / TypeScript
const response = await fetch("http://localhost:11434/v1/chat/completions", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "gemma4:26b-moe",
messages: [
{ role: "user", content: "Explain the Observer pattern with a TypeScript example" }
]
})
});
const data = await response.json();
console.log(data.choices[0].message.content);
Using with LangChain
from langchain_community.llms import Ollama
llm = Ollama(model="gemma4:26b-moe")
response = llm.invoke("Summarize the key principles of clean architecture")
print(response)
Using with LlamaIndex
from llama_index.llms.ollama import Ollama
llm = Ollama(model="gemma4:26b-moe", request_timeout=120.0)
response = llm.complete("What are the SOLID principles in software engineering?")
print(response)
Quantization Options Explained
Quantization reduces model size and memory usage by using lower-precision numbers to represent model weights. The tradeoff is between quality and resource usage:
| Quantization | Bits per Weight | Quality Impact | Memory Savings | Best For |
|---|---|---|---|---|
| FP16 | 16 bits | None (full quality) | Baseline | Servers with ample VRAM |
| Q8_0 | 8 bits | Negligible | ~50% | High-quality local inference |
| Q6_K | 6 bits | Very minor | ~62% | Quality-focused local use |
| Q5_K_M | 5 bits | Minor | ~69% | Good balance |
| Q4_K_M | 4 bits | Small | ~75% | Recommended default |
| Q3_K_M | 3 bits | Moderate | ~81% | Constrained hardware |
| Q2_K | 2 bits | Significant | ~87% | Extreme constraints |
Q4_K_M is the sweet spot for most users. The quality difference from FP16 is small enough that most tasks produce indistinguishable results, while memory savings of 75% make the difference between "needs a server" and "runs on my laptop."
Choosing the Right Quantization
For Gemma 4 E2B/E4B: Use the default (Q4_K_M). These models are already small enough that higher quantization does not meaningfully change the user experience.
For Gemma 4 26B MoE: Q4_K_M fits in 18GB RAM, which is within an RTX 4090's 24GB VRAM with room for KV cache. If you have 48GB+ VRAM (A6000, dual GPUs), consider Q8_0 for marginally better quality.
For Gemma 4 31B Dense: Q4_K_M at 20GB fits in an RTX 4090 with tight margins. Q5_K_M produces slightly better results but requires ~24GB, consuming all available VRAM. If you have 32GB+ VRAM (RTX 5090, A6000), Q6_K or Q8_0 are worth the upgrade.
Performance Tuning
GPU Offloading
Ollama automatically offloads model layers to the GPU when VRAM is available. If only part of the model fits in VRAM, Ollama splits between GPU and CPU. You can control this:
# Force all layers to GPU (fails if insufficient VRAM)
OLLAMA_NUM_GPU=999 ollama run gemma4:26b-moe
# Force CPU only (useful for testing)
OLLAMA_NUM_GPU=0 ollama run gemma4:e4b
Context Window Configuration
By default, Ollama uses a context window of 2048 tokens for efficiency. To utilize Gemma 4's full context capabilities:
# Set context window to 32K tokens
ollama run gemma4:26b-moe --num-ctx 32768
# Set context window to 128K tokens (requires more RAM)
ollama run gemma4:26b-moe --num-ctx 131072
Important: Larger context windows consume more RAM for the KV cache. A 128K context window on the 31B model may require 8-16GB additional RAM beyond the model weights. Start with 32K and increase only if your use case requires it.
Concurrent Requests
Ollama supports serving multiple requests simultaneously:
# Allow up to 4 concurrent requests
OLLAMA_NUM_PARALLEL=4 ollama serve
Each concurrent request adds memory overhead for its KV cache. On a 24GB GPU running the 26B MoE at Q4_K_M (~18GB), you have roughly 6GB headroom — enough for 2-3 concurrent requests with short contexts.
Keep-Alive Settings
By default, Ollama keeps models loaded in memory for 5 minutes after the last request. Adjust this for your use case:
# Keep model loaded for 1 hour
OLLAMA_KEEP_ALIVE=3600 ollama serve
# Keep model loaded indefinitely
OLLAMA_KEEP_ALIVE=-1 ollama serve
# Unload immediately after each request (saves memory)
OLLAMA_KEEP_ALIVE=0 ollama serve
NVIDIA RTX Optimization
NVIDIA has released optimized builds of Gemma 4 for RTX GPUs. These optimizations include:
- Custom CUDA kernels for Gemma 4's attention mechanism
- TensorRT-LLM integration for faster inference
- Flash Attention support for reduced memory usage during long-context inference
- Optimized KV cache management for better throughput
Installing NVIDIA-Optimized Gemma 4
If you have an RTX 4000 or 5000 series GPU:
# Check your GPU
nvidia-smi
# Pull the NVIDIA-optimized version (if available in Ollama)
ollama pull gemma4:31b-nvidia
Alternatively, use NVIDIA's AI Workbench or TensorRT-LLM directly for maximum performance. The NVIDIA-optimized versions can provide 30-50% faster inference on RTX GPUs compared to standard Ollama builds.
Real-World Performance Benchmarks
Measured on common hardware configurations:
Tokens per Second (Generation Speed)
| Model | RTX 4090 (24GB) | RTX 3090 (24GB) | M3 Max (36GB) | CPU Only (32GB) |
|---|---|---|---|---|
| E2B (Q4) | ~150 tok/s | ~120 tok/s | ~100 tok/s | ~30 tok/s |
| E4B (Q4) | ~100 tok/s | ~80 tok/s | ~70 tok/s | ~15 tok/s |
| 26B MoE (Q4) | ~40 tok/s | ~30 tok/s | ~25 tok/s | ~3 tok/s |
| 31B Dense (Q4) | ~30 tok/s | ~20 tok/s | ~20 tok/s | ~2 tok/s |
Context: Human reading speed is roughly 4-5 tokens per second. Any model generating above 10 tok/s feels "instant" for interactive use. The E2B and E4B models are fast enough for real-time streaming on almost any hardware.
Time to First Token (Latency)
| Model | RTX 4090 | M3 Max | CPU Only |
|---|---|---|---|
| E2B | <100ms | <200ms | <500ms |
| E4B | <200ms | <300ms | ~1s |
| 26B MoE | ~500ms | ~1s | ~5s |
| 31B Dense | ~800ms | ~1.5s | ~8s |
For interactive applications, time to first token matters more than generation speed. The E2B and E4B models start generating almost instantly even on CPU, making them ideal for real-time chat interfaces.
Common Use Cases
Local Coding Assistant
Use Gemma 4 as a private coding assistant that never sends your code to external servers:
ollama run gemma4:26b-moe --system "You are an expert software engineer. When given code, analyze it for bugs, suggest improvements, and explain your reasoning. Be concise and practical."
Pair this with VS Code extensions like Continue or Twinny that support Ollama as a backend.
Document Analysis
Process sensitive documents locally:
echo "Analyze this contract clause and identify potential risks: [paste clause]" | ollama run gemma4:31b
With 256K context, the 31B model can process documents up to ~750 pages — sufficient for most contracts, research papers, and technical documentation.
Local RAG (Retrieval-Augmented Generation)
Combine Gemma 4 with a local vector database for a fully private RAG system:
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
# Use Gemma 4 for both embeddings and generation
embeddings = OllamaEmbeddings(model="gemma4:e4b")
llm = Ollama(model="gemma4:26b-moe")
# Create vector store from your documents
vectorstore = Chroma.from_documents(documents, embeddings)
# Query with RAG
retriever = vectorstore.as_retriever()
docs = retriever.get_relevant_documents("What is our refund policy?")
context = "\n".join([doc.page_content for doc in docs])
response = llm.invoke(f"Based on this context:\n{context}\n\nAnswer: What is our refund policy?")
Building AI Features into Applications
For developers building applications with AI capabilities, running Gemma 4 locally via Ollama's API is the fastest path to a working prototype. The OpenAI-compatible API means you can start with local Gemma 4 for development and switch to cloud APIs for production without changing application code.
Platforms like ZBuild can handle the application infrastructure — frontend, backend, authentication, database — while you focus on the AI integration layer. Point your application's AI endpoint to localhost:11434 during development and swap to a cloud endpoint when you are ready to scale.
Troubleshooting
"Out of memory" Errors
If you see memory errors:
- Try a smaller quantization:
ollama pull gemma4:31b-q3_K_M - Reduce context window:
--num-ctx 4096 - Close other GPU-intensive applications
- Switch to a smaller model: the 26B MoE delivers near-31B quality at lower memory cost
Slow Generation Speed
If generation is slower than expected:
- Check GPU utilization:
nvidia-smi(should show high GPU usage) - Ensure the model fits entirely in VRAM — partial CPU offloading is dramatically slower
- Reduce
--num-ctxto free VRAM for compute - Check if other processes are using the GPU
Model Not Found
If ollama run gemma4:26b-moe fails:
# List available models
ollama list
# Search for Gemma 4 models
ollama search gemma4
# Pull the specific model
ollama pull gemma4:26b-moe
API Connection Refused
If applications cannot connect to localhost:11434:
# Check if Ollama is running
ollama list
# Start the Ollama server manually
ollama serve
# Check the port
curl http://localhost:11434/api/tags
Model Selection Decision Tree
Use this to quickly choose the right model:
Do you have a dedicated GPU with 16GB+ VRAM?
- Yes → Do you want maximum quality or maximum efficiency?
- Maximum quality →
gemma4:31b(Q4_K_M, needs 20GB) - Maximum efficiency →
gemma4:26b-moe(Q4_K_M, needs 18GB)
- Maximum quality →
- No → Do you have 8GB+ RAM?
- Yes →
gemma4:e4b(Q4_K_M, better quality) - No →
gemma4:e2b(Q4_K_M, runs on 5GB)
- Yes →
For most developers with a modern desktop or gaming PC: Start with gemma4:26b-moe. It offers the best quality-to-resource ratio in the entire Gemma 4 family.
What You Can Build
With Gemma 4 running locally, you have a zero-cost AI backend for:
- Chat applications with full conversation privacy
- Code analysis tools that work on proprietary codebases
- Document processing pipelines for sensitive data
- Local AI assistants that work offline
- Prototype AI features before committing to cloud API costs
- Fine-tuned models for domain-specific tasks (Apache 2.0 allows this freely)
The Apache 2.0 license means everything you build is yours — no usage restrictions, no revenue sharing, no approval needed. Run it locally, deploy it on your servers, embed it in your products. This is what truly open AI looks like.
Sources
- Gemma 4 Announcement - Google Blog
- Gemma 4 on Ollama
- Ollama Installation Guide
- Ollama API Documentation
- NVIDIA Gemma 4 RTX Optimization
- Gemma 4 Technical Report - Google DeepMind
- Gemma 4 Hugging Face Models
- Continue.dev - Local AI Code Assistant
- LangChain Ollama Integration
- Google AI for Developers - Gemma