Local LLMOllamaGemma 4MoE+1

Run Gemma 4 Locally with Ollama

Gemma 4 now has a 12B laptop lane, MTP drafters, and QAT checkpoints. Pick an Ollama tag, tune context, and keep local inference on the fast memory path.

LeetLLM TeamApril 2, 2026Updated July 11, 202613 min read

A local assistant can read private code, summarize design docs, and test prompts without sending files to a hosted model. Running locally solves privacy, but it creates a fit problem: which tag runs on your machine without crawling?

Gemma 4 and Ollama are a clean starting point. Gemma 4 is Google's Apache 2.0 open-weights family, with dense and Mixture-of-Experts (MoE) variants, native system-prompt support, multimodal inputs, and local tags with 128K or 256K context windows.^{[1]Reference 1Gemma 4 Model Cardhttps://ai.google.dev/gemma/docs/core/model_card_4}^{[2]Reference 2gemma4 tagshttps://ollama.com/library/gemma4/tags} Ollama provides the local server, registry, CLI, and API wrapper.^{[3]Reference 3Ollama GitHub Repositoryhttps://github.com/ollama/ollama}

June 2026 updates changed the recommendation. Google documents Gemma 4 in five sizes: E2B, E4B, 12B, 26B A4B, and 31B. The family also has multi-token prediction (MTP) drafters and quantized checkpoints for lower memory pressure.^{[4]Reference 4Introducing Gemma 4 12B: a unified, encoder-free multimodal modelhttps://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/}^{[5]Reference 5Accelerating Gemma 4: faster inference with multi-token prediction draftershttps://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/}^{[6]Reference 6Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiencyhttps://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/} Start with E4B, try QAT when memory is tight, test 12B when you want more quality, and move to 26B or 31B only with clear headroom.

Pick the right tag

A tag chooses the weights, context limit, input modes, and runtime footprint Ollama loads. Current local tags split into laptop and workstation lanes.^{[7]Reference 7gemma4https://ollama.com/library/gemma4}^{[2]Reference 2gemma4 tagshttps://ollama.com/library/gemma4/tags} Use gemma4:e2b for tight memory, gemma4:e4b as the default local assistant, and gemma4:12b when E4B already stays fast. Use gemma4:26b as the first workstation MoE test and gemma4:31b only when a dense 20 GB artifact still leaves context headroom.

Dense models run most of the model for every generated token; gemma4:31b is the heavy dense option.^{[1]Reference 1Gemma 4 Model Cardhttps://ai.google.dev/gemma/docs/core/model_card_4} MoE models have many expert blocks, but activate a smaller path for each token. Google labels Gemma 4 26B as an A4B MoE, and Ollama describes the 26B tag as a 4B-active workstation model.^{[1]Reference 1Gemma 4 Model Cardhttps://ai.google.dev/gemma/docs/core/model_card_4}^{[7]Reference 7gemma4https://ollama.com/library/gemma4}

The gemma4:cloud and gemma4:31b-cloud tags are different. They are Ollama's cloud paths, not local artifacts to size against your GPU.^{[2]Reference 2gemma4 tagshttps://ollama.com/library/gemma4/tags}

Gemma 4 tag selection flow that starts from memory headroom, tries QAT when memory is tight, uses 12B as middle lane, and reaches 26B or 31B only with workstation budget — Pick the smallest tag that fits comfortably. Try QAT for fit, then 12B for quality, then workstation tags only when memory margin remains.

First choice	Use it when	Move away when
`gemma4:e2b-it-qat`	Memory is the hard limit and you need a smoke-test lane	Quality misses the task after the runtime stays on fast memory
`gemma4:e4b`	You want the default laptop assistant and a simple baseline	E4B is stable and a measured quality gap justifies 12B
`gemma4:12b`	You have headroom after context and process overhead	`ollama ps` shows CPU offload or latency breaks the budget
`gemma4:26b`	You have workstation memory and want to test the MoE path	The smaller tag meets the same eval target with better latency
`gemma4:31b`	You specifically need the dense large-model comparison	Its 20 GB artifact leaves too little room for context and cache

Diagram showing Memory budget, E2B or E4B plain or QAT, Fast-memory headroom?, and yes, need quality. — Memory budget, E2B or E4B plain or QAT, Fast-memory headroom?, and yes, need quality.

Fit rule: Keep the smallest tag that passes your eval and remains on fast memory at the context length you actually use.

Default to gemma4:e4b. As of July 11, 2026, Ollama still marks E4B as the latest local tag even though 12B and QAT variants are listed.^{[7]Reference 7gemma4https://ollama.com/library/gemma4} Pin exact tags in scripts.

Fit beats size

Published size is the download artifact, not the full runtime budget. At inference time, the machine also needs memory for loaded weights, prompt tokens, image tokens, the key-value (KV) cache, and process overhead.

The KV cache stores attention keys and values from previous tokens so the model doesn't recompute the whole conversation every step. Because it grows with context length, a tag can fit at 4K and become painful at 64K.

Use a conservative fit order:

8-12 GB: start with gemma4:e2b-it-qat or gemma4:e2b.
16 GB: test gemma4:e4b-it-qat, gemma4:e4b, then gemma4:12b only if E4B has margin.
24 GB: try gemma4:12b before gemma4:26b-a4b-it-qat or gemma4:26b.
32 GB and above: test gemma4:26b before dense gemma4:31b.

Gemma 4 local model-fit tiers showing compact QAT options, a 12B middle lane, and larger 26B or 31B workstation choices with memory headroom called out. — QAT savings vary by model: E2B and E4B shrink much more than 12B or 31B. These are published artifact sizes, not complete runtime budgets; context, KV cache, image inputs, and process overhead still need headroom.

Rule: a smaller model that stays fully on fast memory usually beats a bigger model that barely fits. If gemma4:26b and gemma4:31b both look possible, test 26B first because its MoE active path is much smaller.^{[1]Reference 1Gemma 4 Model Cardhttps://ai.google.dev/gemma/docs/core/model_card_4} If gemma4:e4b almost fits, try the QAT tag before dropping to E2B.

What QAT changes

Normal post-training quantization compresses a trained model after training. That often works well, but the model didn't learn under those low-precision constraints.

Quantization-aware training (QAT) changes that sequence. The model trains with simulated low-precision behavior, so the checkpoint is better prepared for compressed inference.^{[6]Reference 6Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiencyhttps://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/}

Ollama's current tags show the desktop effect: E2B drops from 7.2 GB to 4.3 GB with QAT, E4B from 9.6 GB to 6.1 GB, 12B from 7.6 GB to 7.2 GB, 26B from 18 GB to 16 GB, and 31B from 20 GB to 19 GB.^{[2]Reference 2gemma4 tagshttps://ollama.com/library/gemma4/tags} Don't read QAT as "same quality, no tradeoff, always much smaller." The biggest published size drops are on E2B and E4B. QAT improves the chance that a useful model stays within your memory budget.

What MTP changes

Standard decoding generates one token, appends it to context, then runs the model again. That loop is reliable, but slow when the next few tokens are predictable.

Multi-token prediction (MTP) adds a drafter path. The drafter proposes likely next tokens, and the target Gemma 4 model verifies them. Accepted drafts give you multiple output tokens from fewer expensive target-model steps. Rejected drafts fall back to the target model's verified output.^{[5]Reference 5Accelerating Gemma 4: faster inference with multi-token prediction draftershttps://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/}

Gemma 4 local optimization stack showing QAT as memory compression and MTP as decode acceleration before Ollama returns streamed tokens — QAT and MTP solve different bottlenecks. QAT reduces weight pressure; MTP reduces one-token-at-a-time decode latency when the runtime exposes a drafter path.

MTP helps generated-token throughput, not model fit. It doesn't remove prompt-processing cost, image-processing cost, or KV-cache memory. Ollama's visible specialized MTP tag is gemma4:31b-coding-mtp-bf16, a 64 GB text-and-image artifact, so it isn't the laptop starting point.^{[2]Reference 2gemma4 tagshttps://ollama.com/library/gemma4/tags} Add MTP after the base or QAT tag fits, then keep it only if sustained tokens per second improves.

Install and run

Install Ollama from the official project. On Linux, the install script is fastest. On macOS and Windows, use the desktop app from Ollama.^{[3]Reference 3Ollama GitHub Repositoryhttps://github.com/ollama/ollama} Then pull one explicit tag:

terminal

curl -fsSL https://ollama.com/install.sh | sh

ollama --version
ollama pull gemma4:e4b
ollama run gemma4:e4b

Swap in gemma4:e4b-it-qat when memory is tight. Try gemma4:12b only after E4B stays fast under your real context length.

After the first response, run ollama ps. The PROCESSOR column tells you whether the model stayed on GPU or spilled partly to CPU.^{[8]Reference 8FAQ - Ollamahttps://docs.ollama.com/faq} Spilling is often why a model "works" but feels unusable.

Verify the service before testing quality

Separate installation failures from model failures. Check the CLI, local HTTP service, downloaded tag, and loaded process in that order:

verify-gemma4.sh

ollama --version
curl -s http://localhost:11434/api/tags
ollama pull gemma4:e4b
ollama run gemma4:e4b "Reply with exactly: local-ready"
ollama ps

The response text verifies generation. It doesn't prove that the model fits well. Read ollama ps after generation and record its NAME, PROCESSOR, and CONTEXT columns. A successful answer with partial CPU placement is a valid smoke test but a poor latency baseline.

If curl can't connect, start ollama serve and repeat the HTTP check. A responsive service with a missing tag needs another pull with the full error preserved. Working generation with bad placement calls for a context or tag change rather than an Ollama reinstall.

Use the local API

Ollama exposes OpenAI-compatible /v1/chat/completions and /v1/responses endpoints, so existing SDK code can point at http://localhost:11434/v1/ with a placeholder key like ollama.^{[9]Reference 9OpenAI compatibility - Ollamahttps://docs.ollama.com/api/openai-compatibility} Use that path when you want one integration surface across hosted and local models.

Ollama local inference workflow for Gemma 4 showing app requests through Ollama server, selected tag, local memory fit, and streamed tokens returning through same endpoint — Keep the integration surface stable while you tune the tag, context, and memory placement underneath it.

Use the native Ollama API when you need model-specific controls, such as Gemma 4's thinking token, sampling settings, or direct num_ctx options. Send model: "gemma4:e4b", normal chat messages, and options like {"temperature": 1.0, "top_p": 0.95, "top_k": 64, "num_ctx": 32768}. The Ollama Gemma 4 page recommends those sampling defaults and enables thinking with <|think|> at the start of the system prompt.^{[7]Reference 7gemma4https://ollama.com/library/gemma4} Keep prior thinking blocks out of durable chat history unless your application has a specific reason to store them.

For multimodal prompts, Google's model card gives model-input ordering guidance: place image content before the text prompt, but place audio content after the text prompt.^{[1]Reference 1Gemma 4 Model Cardhttps://ai.google.dev/gemma/docs/core/model_card_4} Ollama controls the local request format and runtime support. Its main local Gemma 4 tags currently list text and image input, so don't infer local audio support from Google's model-level guidance; verify the exact Ollama tag and runtime before planning an audio workflow.^{[2]Reference 2gemma4 tagshttps://ollama.com/library/gemma4/tags}

Runtime boundary: Google documents how Gemma 4 expects supported modalities to be ordered. Ollama documentation and tag metadata determine which of those modalities your local endpoint can accept.

Tune context before downgrading

A model's maximum context isn't necessarily what Ollama allocates by default. Ollama's current context docs say defaults are tiered by VRAM: less than 24 GiB gets 4K, 24-48 GiB gets 32K, and 48 GiB or more gets 256K.^{[10]Reference 10Context length - Ollamahttps://docs.ollama.com/context-length} The FAQ still describes 4096 tokens as the baseline default and shows override paths.^{[8]Reference 8FAQ - Ollamahttps://docs.ollama.com/faq}

Set the smallest context window that fits the task. For the daemon, start with OLLAMA_CONTEXT_LENGTH=32768 ollama serve; inside ollama run, use /set parameter num_ctx 32768; and through the native API, send "options": {"num_ctx": 32768}.

Verify allocation

Then run ollama ps and check CONTEXT plus PROCESSOR. If the model is split across CPU and GPU, reduce num_ctx, try the matching QAT tag, or move down a model size.^{[10]Reference 10Context length - Ollamahttps://docs.ollama.com/context-length}^{[8]Reference 8FAQ - Ollamahttps://docs.ollama.com/faq}

OpenAI-compatible requests don't set context size per request. For that path, create a derived model with a Modelfile, such as FROM gemma4:e4b plus PARAMETER num_ctx 32768, then call ollama create.^{[9]Reference 9OpenAI compatibility - Ollamahttps://docs.ollama.com/api/openai-compatibility}

Common failure modes

Common failures usually come from mixing up bottlenecks:

Latency climbs after first tokens: KV cache probably pushed the model into offload; lower num_ctx and check ollama ps.
Larger tag feels worse than smaller tag: prefer the smaller model that stays fully on GPU.
Standard tag almost fits: try the matching QAT tag.
MTP doesn't improve first-token time: measure time to first token and tokens per second separately because MTP helps decode throughput, not prompt ingestion.
Audio workflow fails: verify Ollama runtime support before building the feature.

Debugging habit: separate model fit, context fit, and decode speed. QAT helps model fit. num_ctx helps context fit. MTP helps decode speed when your runtime exposes a drafter path.

Diagnose one layer at a time

Use symptoms to choose the next measurement:

Symptom	Check	Likely next move
Connection refused on port 11434	Ollama service status	Start or restart the service
Pull stops repeatedly	Free disk space and network error	Free space, then retry the same tag
First prompt crashes	`num_ctx` and system memory	Restart at a shorter context
First token is slow	Prompt length and `load_duration`	Warm the model, then shorten the prompt
Decode starts fast and degrades	`ollama ps` placement	Reduce context or use a smaller tag
QAT fits but quality drops	Fixed task-level acceptance tests	Keep standard E4B or test 12B with headroom

Don't compare tags while changing prompt, context, and sampling together. That produces a number but hides its cause. Keep a written run record so a later runtime update doesn't get mistaken for a model-quality change.

Practice

Build a small benchmark around the local API. Run the same prompt against gemma4:e4b, gemma4:e4b-it-qat, gemma4:12b, and gemma4:26b-a4b-it-qat if your machine has workstation headroom. Record time to first token, tokens per second, num_ctx, ollama ps PROCESSOR, and image usage. That result tells you whether the limit is weight memory, context pressure, or decode throughput.

Build a reproducible benchmark

Use three prompt classes instead of one favorite prompt:

A short extraction request tests instruction following with little prompt cost.
A repository explanation request tests a realistic context packet.
A fixed generation request tests sustained decode speed.

Run each prompt once to warm the model, then run measured trials with the same num_ctx, sampling settings, and output cap. Store raw native-API responses because Ollama includes prompt and generation token counts plus timing fields.^{[9]Reference 9OpenAI compatibility - Ollamahttps://docs.ollama.com/api/openai-compatibility} Don't compare an image request with a text-only request or a cold load with a warm process.

This shell request disables streaming so one JSON object contains the complete timing record:

benchmark-gemma4.sh

curl -s http://localhost:11434/api/generate \
  -d '{
    "model": "gemma4:e4b",
    "prompt": "List three checks for a failing unit test.",
    "stream": false,
    "options": {
      "num_ctx": 8192,
      "temperature": 0
    }
  }' > gemma4-e4b-run.json

Inspect the fields without inventing a machine-independent target:

inspect-gemma4-run.sh

jq '{
  model,
  done,
  prompt_eval_count,
  eval_count,
  total_duration,
  load_duration,
  prompt_eval_duration,
  eval_duration
}' gemma4-e4b-run.json

Compute rates from each response instead of copying a published result. Durations are reported in nanoseconds, so generated-token throughput is eval_count / (eval_duration / 1e9). Record machine, Ollama version, tag, context, placement, and whether the run was cold or warm beside that value.

Quality needs a separate score. Write expected properties before running models, such as "names all three failing assertions" or "returns valid JSON matching schema." A faster tag loses if it misses the required behavior. Keep the smallest tag that passes those checks and meets your latency budget.

Preserve a run receipt

A useful receipt has enough context to repeat the comparison later:

gemma4-run-receipt.txt

machine: <CPU, GPU, and memory>
ollama_version: <ollama --version>
tag: gemma4:e4b
context: 8192
processor: <ollama ps PROCESSOR>
prompt_pack: <path or commit>
sampling: temperature=0
warm_state: warm
raw_response: gemma4-e4b-run.json

Repeat the same receipt for QAT and 12B. If only one field changes, the comparison can answer a specific question. If several fields change, treat it as a new experiment rather than a continuation.

What to do next

Gemma 4 is a strong local pick when you want an Apache 2.0 family, laptop-scale tags, a 12B middle lane, QAT checkpoints, a MoE workstation option, and a simple local API.^{[1]Reference 1Gemma 4 Model Cardhttps://ai.google.dev/gemma/docs/core/model_card_4}^{[4]Reference 4Introducing Gemma 4 12B: a unified, encoder-free multimodal modelhttps://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/}^{[6]Reference 6Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiencyhttps://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/}^{[9]Reference 9OpenAI compatibility - Ollamahttps://docs.ollama.com/api/openai-compatibility} Start with gemma4:e4b, use QAT when memory is tight, test 12B with margin, and try 26B only when ollama ps keeps model plus context on the fast path.^{[7]Reference 7gemma4https://ollama.com/library/gemma4}^{[10]Reference 10Context length - Ollamahttps://docs.ollama.com/context-length}

PreviousBest AI Plans for OpenClaw in 2026 NextvLLM vs SGLang vs TensorRT-LLM vs Ollama: Choosing an Inference Engine in 2026

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Gemma 4 Model Card

Gemma Team, Google DeepMind · 2026

gemma4 tags

Ollama · 2026

Ollama GitHub Repository

Ollama Team · 2026

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Google · 2026

Accelerating Gemma 4: faster inference with multi-token prediction drafters

Google · 2026

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

Google · 2026

gemma4

Ollama · 2026

FAQ - Ollama

Ollama · 2026

OpenAI compatibility - Ollama

Ollama · 2026

Context length - Ollama

Ollama · 2026

Blog

Local LLMOllamaGemma 4MoE+1

Run Gemma 4 Locally with Ollama

Gemma 4 now has a 12B laptop lane, MTP drafters, and QAT checkpoints. Pick an Ollama tag, tune context, and keep local inference on the fast memory path.

LeetLLM TeamApril 2, 2026Updated July 11, 202613 min read

Pick the right tag

First choice	Use it when	Move away when
`gemma4:e2b-it-qat`	Memory is the hard limit and you need a smoke-test lane	Quality misses the task after the runtime stays on fast memory
`gemma4:e4b`	You want the default laptop assistant and a simple baseline	E4B is stable and a measured quality gap justifies 12B
`gemma4:12b`	You have headroom after context and process overhead	`ollama ps` shows CPU offload or latency breaks the budget
`gemma4:26b`	You have workstation memory and want to test the MoE path	The smaller tag meets the same eval target with better latency
`gemma4:31b`	You specifically need the dense large-model comparison	Its 20 GB artifact leaves too little room for context and cache

Fit rule: Keep the smallest tag that passes your eval and remains on fast memory at the context length you actually use.

Fit beats size

Use a conservative fit order:

8-12 GB: start with gemma4:e2b-it-qat or gemma4:e2b.
16 GB: test gemma4:e4b-it-qat, gemma4:e4b, then gemma4:12b only if E4B has margin.
24 GB: try gemma4:12b before gemma4:26b-a4b-it-qat or gemma4:26b.
32 GB and above: test gemma4:26b before dense gemma4:31b.

What QAT changes

Normal post-training quantization compresses a trained model after training. That often works well, but the model didn't learn under those low-precision constraints.

What MTP changes

Standard decoding generates one token, appends it to context, then runs the model again. That loop is reliable, but slow when the next few tokens are predictable.

Install and run

terminal

curl -fsSL https://ollama.com/install.sh | sh

ollama --version
ollama pull gemma4:e4b
ollama run gemma4:e4b

Swap in gemma4:e4b-it-qat when memory is tight. Try gemma4:12b only after E4B stays fast under your real context length.

Verify the service before testing quality

Separate installation failures from model failures. Check the CLI, local HTTP service, downloaded tag, and loaded process in that order:

verify-gemma4.sh

ollama --version
curl -s http://localhost:11434/api/tags
ollama pull gemma4:e4b
ollama run gemma4:e4b "Reply with exactly: local-ready"
ollama ps

Use the local API

Runtime boundary: Google documents how Gemma 4 expects supported modalities to be ordered. Ollama documentation and tag metadata determine which of those modalities your local endpoint can accept.

Tune context before downgrading

Verify allocation

Common failure modes

Common failures usually come from mixing up bottlenecks:

Latency climbs after first tokens: KV cache probably pushed the model into offload; lower num_ctx and check ollama ps.
Larger tag feels worse than smaller tag: prefer the smaller model that stays fully on GPU.
Standard tag almost fits: try the matching QAT tag.
MTP doesn't improve first-token time: measure time to first token and tokens per second separately because MTP helps decode throughput, not prompt ingestion.
Audio workflow fails: verify Ollama runtime support before building the feature.

Debugging habit: separate model fit, context fit, and decode speed. QAT helps model fit. num_ctx helps context fit. MTP helps decode speed when your runtime exposes a drafter path.

Diagnose one layer at a time

Use symptoms to choose the next measurement:

Symptom	Check	Likely next move
Connection refused on port 11434	Ollama service status	Start or restart the service
Pull stops repeatedly	Free disk space and network error	Free space, then retry the same tag
First prompt crashes	`num_ctx` and system memory	Restart at a shorter context
First token is slow	Prompt length and `load_duration`	Warm the model, then shorten the prompt
Decode starts fast and degrades	`ollama ps` placement	Reduce context or use a smaller tag
QAT fits but quality drops	Fixed task-level acceptance tests	Keep standard E4B or test 12B with headroom

Practice

Build a reproducible benchmark

Use three prompt classes instead of one favorite prompt:

A short extraction request tests instruction following with little prompt cost.
A repository explanation request tests a realistic context packet.
A fixed generation request tests sustained decode speed.

This shell request disables streaming so one JSON object contains the complete timing record:

benchmark-gemma4.sh

curl -s http://localhost:11434/api/generate \
  -d '{
    "model": "gemma4:e4b",
    "prompt": "List three checks for a failing unit test.",
    "stream": false,
    "options": {
      "num_ctx": 8192,
      "temperature": 0
    }
  }' > gemma4-e4b-run.json

Inspect the fields without inventing a machine-independent target:

inspect-gemma4-run.sh

jq '{
  model,
  done,
  prompt_eval_count,
  eval_count,
  total_duration,
  load_duration,
  prompt_eval_duration,
  eval_duration
}' gemma4-e4b-run.json

Preserve a run receipt

A useful receipt has enough context to repeat the comparison later:

gemma4-run-receipt.txt

machine: <CPU, GPU, and memory>
ollama_version: <ollama --version>
tag: gemma4:e4b
context: 8192
processor: <ollama ps PROCESSOR>
prompt_pack: <path or commit>
sampling: temperature=0
warm_state: warm
raw_response: gemma4-e4b-run.json

Repeat the same receipt for QAT and 12B. If only one field changes, the comparison can answer a specific question. If several fields change, treat it as a new experiment rather than a continuation.

What to do next

PreviousBest AI Plans for OpenClaw in 2026 NextvLLM vs SGLang vs TensorRT-LLM vs Ollama: Choosing an Inference Engine in 2026

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Gemma 4 Model Card

Gemma Team, Google DeepMind · 2026

gemma4 tags

Ollama · 2026

Ollama GitHub Repository

Ollama Team · 2026

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Google · 2026

Accelerating Gemma 4: faster inference with multi-token prediction drafters

Google · 2026

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

Google · 2026

gemma4

Ollama · 2026

FAQ - Ollama

Ollama · 2026

OpenAI compatibility - Ollama

Ollama · 2026

Context length - Ollama

Ollama · 2026