Qwen3.5 in Ollama spans primary aliases from 0.8B to 122B plus explicit quantized variants. This guide shows how to choose the right local tag, keep context size realistic, and expose it through Ollama's OpenAI-compatible API.

Imagine you're prototyping a customer-support assistant for a small SaaS app. You'd like to test it on real account data, but you don't want to send private customer data and refund histories to a third-party API. A local model running on your own machine keeps the data inside your network, avoids a per-token API bill during experiments, and still gives you a capable large language model (LLM) to work with.
That's the practical case for running Qwen3.5 through Ollama. The value isn't only avoiding API bills; it's learning how inference engines, model weights, memory budgets, and context windows fit together. Those same concepts show up in technical interviews for AI engineering roles, and they matter when you later move to production serving stacks like vLLM or TGI.
This guide teaches the operator view: how to choose a model tag that fits your hardware, how Ollama serves it, and how to keep the setup bounded once it is running.
Different local path: This guide uses Ollama's Qwen3.5 tags, Modelfiles, and local API. The newer Qwen3.6 Unsloth GGUF guide uses pinned GGUF files, llama.cpp commands, explicit quant choices, and a lower-level runtime view for coding and agent workflows.[1]
The official Ollama library page for Qwen3.5 lists seven primary local aliases that range from pocket-sized to server-class:[2]
| Primary alias | Published size | Model max context | Input modes |
|---|---|---|---|
qwen3.5:0.8b | 1.0 GB | 256K | Text, Image |
qwen3.5:2b | 2.7 GB | 256K | Text, Image |
qwen3.5:4b | 3.4 GB | 256K | Text, Image |
qwen3.5:9b | 6.6 GB | 256K | Text, Image |
qwen3.5:27b | 17 GB | 256K | Text, Image |
qwen3.5:35b | 24 GB | 256K | Text, Image |
qwen3.5:122b | 81 GB | 256K | Text, Image |
Those seven rows are the easy-to-read front door, not the whole catalog. The tags view on the same page also exposes explicit variants such as qwen3.5:9b-q8_0, qwen3.5:9b-bf16, qwen3.5:9b-mxfp8, qwen3.5:27b-int4, and qwen3.5:35b-a3b-int4.[2]
The alias relationships matter if you care about reproducibility:[2]
qwen3.5:latest is shown as qwen3.5:9b on the cited library pageqwen3.5:9b is the default Q4_K_M artifact, so qwen3.5:9b-q4_K_M is the more explicit pinqwen3.5:27b follows the same pattern with qwen3.5:27b-q4_K_Mqwen3.5:35b maps to qwen3.5:35b-a3b, so the 35B tier in Ollama is the A3B sparse MoE modelThe 35b and 122b tags aren't dense models in the traditional sense. They're based on a Mixture of Experts (MoE) architecture. Think of an MoE model as a large panel of specialists where only a few experts are consulted for each token. The total parameter count is huge, but the active count per token is much smaller: the 35B-A3B model activates 3B parameters, while the 122B-A10B model activates 10B.[3][4] That's why an MoE tag can have a very different runtime profile from a dense checkpoint with the same total parameter count.
That also explains an easy gotcha: the default aliases keep Text, Image input, but many optimized variants are text-only. If you need vision support, check the Input column before pinning -mxfp8, -nvfp4, -mlx-bf16, -int4, -int8, or -coding-* tags.[2]
The current Ollama page marks the Qwen3.5 family as supporting both thinking and non-thinking (fast) modes, plus native tool calling.[2] Thinking mode trades latency and tokens for more deliberate reasoning, so for quick local glue tasks the non-thinking path is usually the better default.
The same page also lists qwen3.5:cloud and qwen3.5:397b-cloud, but those aren't local artifacts and don't belong in a local hardware plan.[2]
It also highlights direct integrations with Claude Code, Codex, OpenCode, and OpenClaw through ollama launch ... --model qwen3.5 commands.[2]
That's why Qwen3.5 is attractive for local workflows. You don't need to wrap raw checkpoints yourself. You get a published model tag, a local server, and an OpenAI-compatible API surface.
The local setup path is short, but each step changes a different part of the system:
Before you pick a tag, it helps to picture what happens when you send a prompt.
Ollama is the inference engine. It manages model tags, downloads weights, and routes requests. Under the hood, it uses a compiled runtime backed by llama.cpp to turn tokens into predictions.[5][6] The Qwen3.5 weights are the fuel (or the knowledge stored in billions of tuned parameters). The engine can't run without fuel, and the fuel can't produce text without an engine.
The request path below separates the local application programming interface (API) from the runtime and memory placement, which is the distinction that matters when a model feels slow but has not crashed.
ollama ps for real context budget and CPU offload.That architecture gives you three immediate benefits:
When you run ollama run qwen3.5:9b, here's what happens under the hood:
The important detail is step 3. Loading a 6.6 GB model file doesn't mean you only need 6.6 GB of free memory. The runtime also needs space for the KV cache, which stores intermediate attention calculations so the model doesn't recompute everything from scratch on every new token. As your conversation grows, the cache grows too.
Use the published Ollama artifact size as the first filter, then leave extra headroom for context, KV cache, and the rest of your system. A 6.6 GB model file isn't the same thing as "only 6.6 GB required."
The alias guide below is a sizing map, not a promise. Treat it as the first filter before you test context length and memory placement.
Use this conservative sizing advice:
| Hardware budget | Safe Qwen3.5 choice | Why |
|---|---|---|
| 8 GB unified memory / VRAM | 0.8b or 2b | Fast enough for basic local chat, classification, and glue tasks |
| 12 GB | 4b | Balanced option for laptops and entry GPUs |
| 16 GB | 9b | Mainstream local choice for coding and agent experiments |
| 24 GB | 27b if you accept lower throughput | Bigger jump in quality, but much tighter memory budget |
| 32 GB+ | 35b | Workstation-class local deployment |
| 128 GB+ | 122b | Server-class machine only |
If you want one starting recommendation for many developers, use qwen3.5:9b. It's small enough to be practical on 16 GB hardware and large enough to be useful for coding, search, and automation experiments.
Here's a quick worked example. Suppose you have a MacBook Pro with 16 GB of unified memory. The operating system and your IDE might already consume 4 GB. That leaves roughly 12 GB for the model. The 9b artifact is 6.6 GB on disk, but runtime also needs room for loaded weights, the KV cache, prompt tokens, image tokens, and system overhead. Starting at 9b with a modest context window (4K to 8K) is the reasonable first test. If you tried 27b (17 GB on disk), the model would likely spill partially onto CPU, and token generation would slow down dramatically.
You can turn the rough memory math into a local sanity check before downloading a larger tag:
1def has_starting_headroom(available_gb: float, artifact_gb: float, reserve_gb: float = 3.0) -> bool:
2 return available_gb - artifact_gb >= reserve_gb
3
4fits_9b = has_starting_headroom(available_gb=12.0, artifact_gb=6.6)
5fits_27b = has_starting_headroom(available_gb=12.0, artifact_gb=17.0)
6print("9B starting budget:", fits_9b)
7print("27B starting budget:", fits_27b)
8print("9B has a plausible starting budget; 27B does not on this machine.")19B starting budget: True
227B starting budget: False
39B has a plausible starting budget; 27B does not on this machine.If you later benchmark different quantizations, keep the size-tier alias and the explicit artifact separate in your notes. qwen3.5:9b is a family-facing convenience tag. qwen3.5:9b-q8_0, qwen3.5:9b-bf16, or qwen3.5:9b-mxfp8 are the exact variants you pin when you want apples-to-apples comparisons.[2]
Treat the alias as a starting point and the exact tag as the deployment artifact. That keeps local experiments convenient without making production scripts drift.
The official project publishes platform-specific install commands in the main repository.[5]
macOS / Linux
1curl -fsSL https://ollama.com/install.sh | shWindows (PowerShell)
1irm https://ollama.com/install.ps1 | iexManual downloads are also available if you prefer the desktop app route.[5]
Once the service is running, confirm it responds:
1ollama --version
2curl http://localhost:11434/api/tagsStart with the exact tag you have room for.
1ollama pull qwen3.5:9b
2ollama run qwen3.5:9bIf you run ollama run qwen3.5, Ollama resolves to the tag that the cited library page marked as latest when this article was last updated on May 26, 2026: qwen3.5:9b.[2]
That's convenient for exploration. Pin the tag explicitly when you're building tooling or scripts. latest is a moving alias, so it's weak for reproducibility.
Once you start benchmarking, pin the exact artifact rather than only the size tier:
1ollama pull qwen3.5:9b-q8_0
2ollama pull qwen3.5:35b-a3b-int4The first locks the 9B tier to Q8_0. The second makes the 35B MoE tier explicit and picks a 20 GB text-only artifact instead of the 24 GB text-and-image alias.[2]
Ollama lets you define a Modelfile, a recipe that bakes configuration into a new model tag. This helps when you want to reuse the same context length, temperature, and system prompt across sessions without typing them every time.
Create a file named Qwen-Dev.Modelfile:
1FROM qwen3.5:9b
2PARAMETER num_ctx 8192
3PARAMETER temperature 0.3
4SYSTEM """You are a precise coding assistant. Respond in Markdown. Be concise."""Then build and run it:
1ollama create qwen-dev -f Qwen-Dev.Modelfile
2ollama run qwen-devNow every time you run qwen-dev, you'll get the 8K context, low temperature, and system prompt baked in. This pattern helps when you're building local agent workflows or connecting Ollama to external tools.
The official Qwen3.5 Ollama page documents direct launch commands for several tools:[2]
1ollama launch claude --model qwen3.5
2ollama launch codex --model qwen3.5
3ollama launch opencode --model qwen3.5
4ollama launch openclaw --model qwen3.5That doesn't mean Qwen3.5 instantly becomes the right model for every tool. It means the integration surface is clean enough that you can try it before building custom glue code.
For your own scripts, Ollama publishes compatibility for parts of the OpenAI API, including /v1/chat/completions and /v1/responses.[7] That means you can point existing SDK-based code at your local server with only a base URL change.
Here's a minimal Python example that asks Qwen3.5 to explain a retry loop. It assumes the Ollama server is running, qwen3.5:9b has already been pulled, and the Python openai package is installed in your project environment:
1from openai import OpenAI
2
3client = OpenAI(
4 base_url="http://localhost:11434/v1/",
5 api_key="ollama",
6)
7
8response = client.chat.completions.create(
9 model="qwen3.5:9b",
10 messages=[
11 {"role": "system", "content": "You are a precise coding assistant."},
12 {"role": "user", "content": "Explain how to implement a retry loop with backoff."},
13 ],
14)
15
16print(response.choices[0].message.content)For local development, that matters. You can test the same application shape against a local model first, then swap the base URL later if you move to a hosted deployment. One caveat matters: Ollama's OpenAI-compatibility docs describe /v1/responses support as non-stateful, so fields like previous_response_id and conversation aren't available for carrying server-side history.[7]
The api_key="ollama" placeholder is there because the OpenAI SDK expects a value. On the local Ollama endpoint, that value is ignored.[7]
The Qwen3.5 model page advertises 256K because that's the model maximum. Ollama's context-length docs describe a separate runtime default that depends on available VRAM: under 24 GiB defaults to 4K, 24 to 48 GiB defaults to 32K, and 48 GiB or more defaults to 256K.[2][8]
That distinction matters. "This model supports 256K" isn't the same statement as "my laptop starts every session at 256K."
For agents, coding tools, and other long-context workloads, Ollama's docs now recommend at least 64K when the machine can sustain it.[8] For many laptop-class setups, start smaller and move up only when the task needs it.
The FAQ examples still use 4096 when showing override syntax, while the dedicated context-length page documents the current VRAM-based defaults.[9][8]
Inside an interactive ollama run session, change the active context like this:[9]
1/set parameter num_ctx 32768For the native Ollama API, pass num_ctx per request:[9]
1curl http://localhost:11434/api/chat -d '{
2 "model": "qwen3.5:9b",
3 "messages": [
4 {
5 "role": "user",
6 "content": "Summarize the memory trade-offs of long local context."
7 }
8 ],
9 "options": {
10 "num_ctx": 32768
11 }
12}'If you're going through /v1/chat/completions or /v1/responses, Ollama's OpenAI-compatibility docs say you don't set context size in the request body. Create a derived model instead, or set a server-wide default when starting ollama serve.[7][8]
1FROM qwen3.5:9b
2PARAMETER num_ctx 327681ollama create qwen3.5-9b-32k -f Modelfile
2ollama run qwen3.5-9b-32kFor CLI or service deployments, you can also set a server-wide default before starting Ollama:[8]
1OLLAMA_CONTEXT_LENGTH=64000 ollama serveBefore assuming a larger tag is the problem, inspect where Ollama placed it and what context it allocated:[8][9]
1ollama psIf the model is partly on CPU, latency usually gets ugly long before the tag "fails" outright.
Context length is only one knob. Ollama's FAQ also documents four server settings that change local behavior fast: OLLAMA_KEEP_ALIVE for model residency, OLLAMA_NUM_PARALLEL for per-model concurrency, OLLAMA_FLASH_ATTENTION=1 to reduce long-context memory pressure, and OLLAMA_KV_CACHE_TYPE to quantize the KV cache when Flash Attention is enabled.[9]
The common tuning mistake is raising context and parallelism together. Memory pressure scales with both, so tune one dimension at a time and check placement after each change.
1OLLAMA_FLASH_ATTENTION=1 \
2OLLAMA_KV_CACHE_TYPE=q8_0 \
3OLLAMA_KEEP_ALIVE=30m \
4ollama serveUse q8_0 first if you need the extra headroom. Ollama's FAQ describes it as roughly half the memory of f16 with a very small precision loss, while q4_0 drops to about one quarter of f16 with a more noticeable quality trade-off at larger contexts.[9]
If you raise parallelism, do it with your eyes open. Ollama's FAQ says required RAM scales with OLLAMA_NUM_PARALLEL * OLLAMA_CONTEXT_LENGTH.[9] A setup that works at 32K for one request can fall over fast at the same context for four concurrent requests.
Use qwen3.5:2b or qwen3.5:4b.
These fit:
Use qwen3.5:9b.
This is the mainstream local Qwen tag in this family if you want something that can help with:
Use qwen3.5:27b or qwen3.5:35b only if you have the memory budget.
Don't buy a larger tag because the benchmark chart looks impressive. If the model spills into CPU memory, user experience will collapse long before the raw capability difference pays back.
Cause: You picked a tag that's too large for your available memory budget, or you pushed context high enough that the model no longer stays cleanly on GPU.
Fix: Check ollama ps first. If the model is split across CPU and GPU, step down one size class or cut context length before you keep tuning.
Cause: Your context setting is probably too aggressive for the machine.
Fix: Reduce context length first. Long local sessions are often memory problems pretending to be model-quality problems.
Cause: Expecting a 122B-class experience from a laptop GPU.
Fix: Run 9b or 27b locally, and only reach for cloud-hosted frontier tiers when the task needs them.
Cause: Relying on moving defaults like latest or the size-tier alias without pinning the exact quantization.
Fix: Pin exact artifact tags such as qwen3.5:9b-q4_K_M or qwen3.5:35b-a3b-int4 in scripts and clients. Avoid relying on a moving default.
Here's a concrete exercise to test whether the memory concepts have stuck.
Scenario: You have a laptop with an RTX 3060 (12 GB VRAM). The OS and desktop environment use about 2 GB. You want to run a local Qwen3.5 model and leave enough headroom for a 4K context window and the KV cache.
Question: Which of these tags is the safest choice?
qwen3.5:9b (6.6 GB on disk)qwen3.5:27b (17 GB on disk)qwen3.5:35b-a3b-int4 (smaller than the default 35b alias)Think about it before reading on.
The 27b artifact is 17 GB before it even loads into VRAM. With only 10 GB free, that'll spill to CPU immediately. The 35b-a3b-int4 variant is smaller than the default 35B alias, but it is still a 20 GB artifact. The safest pick is qwen3.5:9b. At 6.6 GB on disk, it has room to expand in memory plus the KV cache without leaving the GPU.
Bonus question: If you later upgrade to a 24 GB GPU and want to run qwen3.5:27b with an 8K context, what single environment variable can you set to reduce KV cache memory pressure before starting Ollama?
Answer: OLLAMA_FLASH_ATTENTION=1 enables a memory-efficient attention implementation that reduces memory pressure as context grows.[9]
ollama ps shows CPU involvement, is the first fix to lower temperature, lower context, or step down a size class? Why?qwen3.5:35b-a3b be worth testing instead of staying on qwen3.5:9b, even if both technically run?CONTEXT Ollama actually allocates on your machine?By now you can:
These skills are the foundation for local agent development. The next logical step in the LeetLLM path is to build a small retrieval-augmented generation (RAG) loop that feeds documents from your local filesystem into the model through the API. That's where local inference stops being a demo and starts being a tool you can ship.
If you're deciding quickly:
qwen3.5:9b.[2]ollama ps.[8][9]27b or 35b only if your machine has clear headroom.That setup gives you a fast, useful local Qwen workflow without turning the whole exercise into memory debugging.
Qwen3.6
Qwen Team · 2026
qwen3.5
Ollama · 2026
Qwen3.5-35B-A3B
Qwen Team · 2026
Qwen3.5-122B-A10B
Qwen Team · 2026
Ollama GitHub Repository
Ollama Team · 2026
llama.cpp: Inference of LLaMA model in pure C/C++
Gerganov, G. · 2023
OpenAI compatibility - Ollama
Ollama · 2026
Context length - Ollama
Ollama · 2026
FAQ - Ollama
Ollama · 2026