Qwen3.5 is available in Ollama from 0.8B to 122B. This guide shows how to choose the right local tag, fit it to your memory budget, and expose it through Ollama's OpenAI-compatible API.
Qwen3.5 is one of the best local-model families you can run through Ollama right now because the official model page covers a very wide size range and keeps the API story simple. Ollama exposes tags from 0.8B all the way to 122B, with published artifact sizes and a consistent 256K context window across the local tags.[1]
That means the real problem is not "can I run Qwen3.5 locally?" It is "which tag fits my machine without turning the experience into sludge?"
This guide stays strict about that question. It uses published Ollama tag sizes and official Ollama docs, then adds conservative deployment advice on top.[1][2][3][4]
The current Ollama library page lists these Qwen3.5 tags for local use:[1]
| Tag | Published size | Context window | Input modes |
|---|---|---|---|
qwen3.5:0.8b | 1.0 GB | 256K | Text, Image |
qwen3.5:2b | 2.7 GB | 256K | Text, Image |
qwen3.5:4b | 3.4 GB | 256K | Text, Image |
qwen3.5:9b | 6.6 GB | 256K | Text, Image |
qwen3.5:27b | 17 GB | 256K | Text, Image |
qwen3.5:35b | 24 GB | 256K | Text, Image |
qwen3.5:122b | 81 GB | 256K | Text, Image |
The same page also highlights direct integrations with Claude Code, Codex, OpenCode, and OpenClaw through ollama launch ... --model qwen3.5 commands.[1]
That is why Qwen3.5 is attractive for local workflows. You do not need to invent a fragile wrapper around a raw checkpoint. You get a published model tag, a local server, and an OpenAI-compatible API surface.
Use the published Ollama artifact size as the first filter, then leave extra headroom for context, KV cache, and the rest of your system. A 6.6 GB model file is not the same thing as "only 6.6 GB required."
Here is the conservative sizing advice I would use:
| Hardware budget | Safe Qwen3.5 choice | Why |
|---|---|---|
| 8 GB unified memory / VRAM | 0.8b or 2b | Fast enough for basic local chat, classification, and glue tasks |
| 12 GB | 4b | Good balance for laptops and entry GPUs |
| 16 GB | 9b | The best mainstream local choice for real coding and agent experiments |
| 24 GB | 27b if you accept lower throughput | Bigger jump in quality, but much tighter memory budget |
| 32 GB+ | 35b | Workstation-class local deployment |
| 80 GB+ | 122b | Server-class machine only |
If you want one default recommendation for most developers, use qwen3.5:9b. It is small enough to be practical on 16 GB hardware and large enough to be useful for real coding, search, and automation work.
The official project is straightforward to install and run locally.[2]
bash1curl -fsSL https://ollama.com/install.sh | sh
If you are on macOS or Windows, download the native app from Ollama instead.[2]
Once the service is running, confirm it responds:
bash1ollama --version 2curl http://localhost:11434/api/tags
Start with the exact tag you have room for.
bash1ollama pull qwen3.5:9b 2ollama run qwen3.5:9b
If you just run ollama run qwen3.5, Ollama resolves to the default local tag shown on the library page, which is the 9b variant at the time of writing.[1]
That is convenient, but I still recommend pinning the tag explicitly when you are building tooling or scripts. "Latest" is easy for experimentation and bad for reproducibility.
One of the nicest things about the official Qwen3.5 Ollama page is that it documents direct launch commands for several coding tools:[1]
bash1ollama launch claude --model qwen3.5 2ollama launch codex --model qwen3.5 3ollama launch opencode --model qwen3.5 4ollama launch openclaw --model qwen3.5
That does not mean Qwen3.5 instantly becomes the best model for every tool. It means the integration surface is clean enough that you can try it without building custom glue code first.
Ollama is not a hosted inference API. It is a local model server wrapped around a compiled inference runtime.
Under the hood, Ollama uses llama.cpp and GGUF-style local model packaging, which is why it runs across Macs, Linux workstations, and consumer GPUs without asking you to stand up a full datacenter inference stack.[4]
That architecture gives you three immediate benefits:
Ollama publishes compatibility for parts of the OpenAI API, including /v1/chat/completions and /v1/responses.[3]
That means you can point existing SDK-based code at your local server with only a base URL change:
python1from openai import OpenAI 2 3client = OpenAI( 4 base_url="http://localhost:11434/v1/", 5 api_key="ollama", 6) 7 8response = client.chat.completions.create( 9 model="qwen3.5:9b", 10 messages=[ 11 {"role": "system", "content": "You are a precise coding assistant."}, 12 {"role": "user", "content": "Explain how to implement a retry loop with backoff."}, 13 ], 14) 15 16print(response.choices[0].message.content)
For local development, that is a big deal. You can test the same application shape against a local model first, then swap the base URL later if you move to a hosted deployment.
Yes, the published local tags advertise 256K context.[1]
No, that does not mean you should casually run every laptop session at 256K.
Long context increases memory pressure and hurts latency. The safe way to use local Qwen3.5 is:
num_ctxA practical starting point looks like this:
bash1ollama run qwen3.5:9b
Then set a smaller context in a Modelfile or request options if your machine gets tight on memory.
Use qwen3.5:2b or qwen3.5:4b.
These are good fits for:
Use qwen3.5:9b.
This is the best mainstream local Qwen tag if you want something that can still help with:
Use qwen3.5:27b or qwen3.5:35b only if you actually have the memory budget.
Do not buy a larger tag just because the benchmark chart looks impressive. If the model spills badly, your user experience will collapse long before the raw capability difference pays back.
That usually means you picked a tag that is too large for your available memory budget.
The fix is not to keep tuning forever. The fix is usually to step down one size class.
Your context setting is probably too aggressive for the machine.
Reduce context length first. Long local sessions are often memory problems pretending to be model-quality problems.
That is exactly what the smaller tags are for. Run 9b or 27b locally, and only reach for cloud-hosted frontier tiers when the task really needs them.
Pin exact tags such as qwen3.5:9b in scripts and clients. Avoid relying on a moving default.
Qwen3.5 is a strong local default if you want:
It is a worse choice if your main goal is squeezing the absolute largest possible model onto marginal hardware. In that case, the right move is usually to choose a smaller tag cleanly instead of forcing a bad fit.
If you are deciding quickly:
qwen3.5:9b.[1]27b or 35b if your machine has obvious headroom.That is the setup that gives you the best odds of a fast, stable, useful local Qwen workflow without turning the whole exercise into memory debugging.