LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Posts
BlogRun Qwen3.5 Locally with Ollama
🏷️ Local LLM🏷️ Ollama🏷️ Qwen3.5🏷️ Tutorial🏷️ GPU Inference

Run Qwen3.5 Locally with Ollama

Qwen3.5 in Ollama spans primary aliases from 0.8B to 122B plus explicit quantized variants. This guide shows how to choose the right local tag, keep context size realistic, and expose it through Ollama's OpenAI-compatible API.

LeetLLM TeamMarch 2, 2026Updated May 26, 202617 min read
Run Qwen3.5 Locally with Ollama cover image

Imagine you're prototyping a customer-support assistant for a small SaaS app. You'd like to test it on real account data, but you don't want to send private customer data and refund histories to a third-party API. A local model running on your own machine keeps the data inside your network, avoids a per-token API bill during experiments, and still gives you a capable large language model (LLM) to work with.

That's the practical case for running Qwen3.5 through Ollama. The value isn't only avoiding API bills; it's learning how inference engines, model weights, memory budgets, and context windows fit together. Those same concepts show up in technical interviews for AI engineering roles, and they matter when you later move to production serving stacks like vLLM or TGI.

This guide teaches the operator view: how to choose a model tag that fits your hardware, how Ollama serves it, and how to keep the setup bounded once it is running.

Different local path: This guide uses Ollama's Qwen3.5 tags, Modelfiles, and local API. The newer Qwen3.6 Unsloth GGUF guide uses pinned GGUF files, llama.cpp commands, explicit quant choices, and a lower-level runtime view for coding and agent workflows.[1]

What Ollama Ships

The official Ollama library page for Qwen3.5 lists seven primary local aliases that range from pocket-sized to server-class:[2]

Primary aliasPublished sizeModel max contextInput modes
qwen3.5:0.8b1.0 GB256KText, Image
qwen3.5:2b2.7 GB256KText, Image
qwen3.5:4b3.4 GB256KText, Image
qwen3.5:9b6.6 GB256KText, Image
qwen3.5:27b17 GB256KText, Image
qwen3.5:35b24 GB256KText, Image
qwen3.5:122b81 GB256KText, Image

Those seven rows are the easy-to-read front door, not the whole catalog. The tags view on the same page also exposes explicit variants such as qwen3.5:9b-q8_0, qwen3.5:9b-bf16, qwen3.5:9b-mxfp8, qwen3.5:27b-int4, and qwen3.5:35b-a3b-int4.[2]

The alias relationships matter if you care about reproducibility:[2]

  • as of May 26, 2026, qwen3.5:latest is shown as qwen3.5:9b on the cited library page
  • qwen3.5:9b is the default Q4_K_M artifact, so qwen3.5:9b-q4_K_M is the more explicit pin
  • qwen3.5:27b follows the same pattern with qwen3.5:27b-q4_K_M
  • qwen3.5:35b maps to qwen3.5:35b-a3b, so the 35B tier in Ollama is the A3B sparse MoE model

The 35b and 122b tags aren't dense models in the traditional sense. They're based on a Mixture of Experts (MoE) architecture. Think of an MoE model as a large panel of specialists where only a few experts are consulted for each token. The total parameter count is huge, but the active count per token is much smaller: the 35B-A3B model activates 3B parameters, while the 122B-A10B model activates 10B.[3][4] That's why an MoE tag can have a very different runtime profile from a dense checkpoint with the same total parameter count.

That also explains an easy gotcha: the default aliases keep Text, Image input, but many optimized variants are text-only. If you need vision support, check the Input column before pinning -mxfp8, -nvfp4, -mlx-bf16, -int4, -int8, or -coding-* tags.[2]

The current Ollama page marks the Qwen3.5 family as supporting both thinking and non-thinking (fast) modes, plus native tool calling.[2] Thinking mode trades latency and tokens for more deliberate reasoning, so for quick local glue tasks the non-thinking path is usually the better default.

The same page also lists qwen3.5:cloud and qwen3.5:397b-cloud, but those aren't local artifacts and don't belong in a local hardware plan.[2]

It also highlights direct integrations with Claude Code, Codex, OpenCode, and OpenClaw through ollama launch ... --model qwen3.5 commands.[2]

That's why Qwen3.5 is attractive for local workflows. You don't need to wrap raw checkpoints yourself. You get a published model tag, a local server, and an OpenAI-compatible API surface.

The local setup path is short, but each step changes a different part of the system:

Qwen3.5 local setup flow from choosing an Ollama tag to pulling the model, setting context budget, using local API, and inspecting placement with ollama ps. Qwen3.5 local setup flow from choosing an Ollama tag to pulling the model, setting context budget, using local API, and inspecting placement with ollama ps.
Local setup is easier when you separate five jobs: pick a tag, pull it, set context, call the local endpoint, then verify real placement.

How Local Inference Works

Before you pick a tag, it helps to picture what happens when you send a prompt.

Ollama is the inference engine. It manages model tags, downloads weights, and routes requests. Under the hood, it uses a compiled runtime backed by llama.cpp to turn tokens into predictions.[5][6] The Qwen3.5 weights are the fuel (or the knowledge stored in billions of tuned parameters). The engine can't run without fuel, and the fuel can't produce text without an engine.

The request path below separates the local application programming interface (API) from the runtime and memory placement, which is the distinction that matters when a model feels slow but has not crashed.

Local Qwen3.5 request path from app to Ollama to runtime to memory placement, with one rule card explaining that ollama ps reveals allocated context and CPU offload. Local Qwen3.5 request path from app to Ollama to runtime to memory placement, with one rule card explaining that ollama ps reveals allocated context and CPU offload.
Ollama exposes one local endpoint, but useful performance depends on what the runtime actually allocates. Check ollama ps for real context budget and CPU offload.

That architecture gives you three immediate benefits:

  1. No network dependency for inference
  2. A local endpoint for tooling
  3. A clean stepping stone before moving to vLLM or another production engine

When you run ollama run qwen3.5:9b, here's what happens under the hood:

  1. Ollama checks if the weights are already cached locally
  2. If not, it downloads the artifact for the exact tag you requested
  3. It loads the weights into GPU memory (or CPU RAM if the GPU is too small)
  4. It starts a local HTTP server on port 11434
  5. Your prompt travels through that server, into the inference runtime, and tokens stream back

The important detail is step 3. Loading a 6.6 GB model file doesn't mean you only need 6.6 GB of free memory. The runtime also needs space for the KV cache, which stores intermediate attention calculations so the model doesn't recompute everything from scratch on every new token. As your conversation grows, the cache grows too.

Pick the Right Tag for Your Hardware

Use the published Ollama artifact size as the first filter, then leave extra headroom for context, KV cache, and the rest of your system. A 6.6 GB model file isn't the same thing as "only 6.6 GB required."

The alias guide below is a sizing map, not a promise. Treat it as the first filter before you test context length and memory placement.

Primary Qwen3.5 Ollama aliases from 0.8B through 122B, with 9B highlighted as mainstream local default. Primary Qwen3.5 Ollama aliases from 0.8B through 122B, with 9B highlighted as mainstream local default.
Use published artifact size as the first filter, then verify real placement with `ollama ps` after you change context.

Use this conservative sizing advice:

Hardware budgetSafe Qwen3.5 choiceWhy
8 GB unified memory / VRAM0.8b or 2bFast enough for basic local chat, classification, and glue tasks
12 GB4bBalanced option for laptops and entry GPUs
16 GB9bMainstream local choice for coding and agent experiments
24 GB27b if you accept lower throughputBigger jump in quality, but much tighter memory budget
32 GB+35bWorkstation-class local deployment
128 GB+122bServer-class machine only

If you want one starting recommendation for many developers, use qwen3.5:9b. It's small enough to be practical on 16 GB hardware and large enough to be useful for coding, search, and automation experiments.

Here's a quick worked example. Suppose you have a MacBook Pro with 16 GB of unified memory. The operating system and your IDE might already consume 4 GB. That leaves roughly 12 GB for the model. The 9b artifact is 6.6 GB on disk, but runtime also needs room for loaded weights, the KV cache, prompt tokens, image tokens, and system overhead. Starting at 9b with a modest context window (4K to 8K) is the reasonable first test. If you tried 27b (17 GB on disk), the model would likely spill partially onto CPU, and token generation would slow down dramatically.

You can turn the rough memory math into a local sanity check before downloading a larger tag:

pick-the-right-tag-for-your-hardware.py
1def has_starting_headroom(available_gb: float, artifact_gb: float, reserve_gb: float = 3.0) -> bool: 2 return available_gb - artifact_gb >= reserve_gb 3 4fits_9b = has_starting_headroom(available_gb=12.0, artifact_gb=6.6) 5fits_27b = has_starting_headroom(available_gb=12.0, artifact_gb=17.0) 6print("9B starting budget:", fits_9b) 7print("27B starting budget:", fits_27b) 8print("9B has a plausible starting budget; 27B does not on this machine.")
Output
19B starting budget: True 227B starting budget: False 39B has a plausible starting budget; 27B does not on this machine.

If you later benchmark different quantizations, keep the size-tier alias and the explicit artifact separate in your notes. qwen3.5:9b is a family-facing convenience tag. qwen3.5:9b-q8_0, qwen3.5:9b-bf16, or qwen3.5:9b-mxfp8 are the exact variants you pin when you want apples-to-apples comparisons.[2]

Treat the alias as a starting point and the exact tag as the deployment artifact. That keeps local experiments convenient without making production scripts drift.

Install Ollama

The official project publishes platform-specific install commands in the main repository.[5]

macOS / Linux

terminal
1curl -fsSL https://ollama.com/install.sh | sh

Windows (PowerShell)

install-ollama.ps1
1irm https://ollama.com/install.ps1 | iex

Manual downloads are also available if you prefer the desktop app route.[5]

Once the service is running, confirm it responds:

terminal-2
1ollama --version 2curl http://localhost:11434/api/tags

Pull and Run Qwen3.5

Start with the exact tag you have room for.

terminal-3
1ollama pull qwen3.5:9b 2ollama run qwen3.5:9b

If you run ollama run qwen3.5, Ollama resolves to the tag that the cited library page marked as latest when this article was last updated on May 26, 2026: qwen3.5:9b.[2]

That's convenient for exploration. Pin the tag explicitly when you're building tooling or scripts. latest is a moving alias, so it's weak for reproducibility.

Once you start benchmarking, pin the exact artifact rather than only the size tier:

terminal-4
1ollama pull qwen3.5:9b-q8_0 2ollama pull qwen3.5:35b-a3b-int4

The first locks the 9B tier to Q8_0. The second makes the 35B MoE tier explicit and picks a 20 GB text-only artifact instead of the 24 GB text-and-image alias.[2]

Customize Behavior with a Modelfile

Ollama lets you define a Modelfile, a recipe that bakes configuration into a new model tag. This helps when you want to reuse the same context length, temperature, and system prompt across sessions without typing them every time.

Create a file named Qwen-Dev.Modelfile:

Qwen-Dev.Modelfile
1FROM qwen3.5:9b 2PARAMETER num_ctx 8192 3PARAMETER temperature 0.3 4SYSTEM """You are a precise coding assistant. Respond in Markdown. Be concise."""

Then build and run it:

terminal-5
1ollama create qwen-dev -f Qwen-Dev.Modelfile 2ollama run qwen-dev

Now every time you run qwen-dev, you'll get the 8K context, low temperature, and system prompt baked in. This pattern helps when you're building local agent workflows or connecting Ollama to external tools.

Talk to Qwen3.5 From Code

The official Qwen3.5 Ollama page documents direct launch commands for several tools:[2]

terminal-6
1ollama launch claude --model qwen3.5 2ollama launch codex --model qwen3.5 3ollama launch opencode --model qwen3.5 4ollama launch openclaw --model qwen3.5

That doesn't mean Qwen3.5 instantly becomes the right model for every tool. It means the integration surface is clean enough that you can try it before building custom glue code.

For your own scripts, Ollama publishes compatibility for parts of the OpenAI API, including /v1/chat/completions and /v1/responses.[7] That means you can point existing SDK-based code at your local server with only a base URL change.

Here's a minimal Python example that asks Qwen3.5 to explain a retry loop. It assumes the Ollama server is running, qwen3.5:9b has already been pulled, and the Python openai package is installed in your project environment:

talk-to-qwen35-from-code.py
1from openai import OpenAI 2 3client = OpenAI( 4 base_url="http://localhost:11434/v1/", 5 api_key="ollama", 6) 7 8response = client.chat.completions.create( 9 model="qwen3.5:9b", 10 messages=[ 11 {"role": "system", "content": "You are a precise coding assistant."}, 12 {"role": "user", "content": "Explain how to implement a retry loop with backoff."}, 13 ], 14) 15 16print(response.choices[0].message.content)

For local development, that matters. You can test the same application shape against a local model first, then swap the base URL later if you move to a hosted deployment. One caveat matters: Ollama's OpenAI-compatibility docs describe /v1/responses support as non-stateful, so fields like previous_response_id and conversation aren't available for carrying server-side history.[7]

The api_key="ollama" placeholder is there because the OpenAI SDK expects a value. On the local Ollama endpoint, that value is ignored.[7]

What 256K Means

The Qwen3.5 model page advertises 256K because that's the model maximum. Ollama's context-length docs describe a separate runtime default that depends on available VRAM: under 24 GiB defaults to 4K, 24 to 48 GiB defaults to 32K, and 48 GiB or more defaults to 256K.[2][8]

That distinction matters. "This model supports 256K" isn't the same statement as "my laptop starts every session at 256K."

For agents, coding tools, and other long-context workloads, Ollama's docs now recommend at least 64K when the machine can sustain it.[8] For many laptop-class setups, start smaller and move up only when the task needs it.

The FAQ examples still use 4096 when showing override syntax, while the dedicated context-length page documents the current VRAM-based defaults.[9][8]

Inside an interactive ollama run session, change the active context like this:[9]

terminal-7
1/set parameter num_ctx 32768

For the native Ollama API, pass num_ctx per request:[9]

terminal-8
1curl http://localhost:11434/api/chat -d '{ 2 "model": "qwen3.5:9b", 3 "messages": [ 4 { 5 "role": "user", 6 "content": "Summarize the memory trade-offs of long local context." 7 } 8 ], 9 "options": { 10 "num_ctx": 32768 11 } 12}'

If you're going through /v1/chat/completions or /v1/responses, Ollama's OpenAI-compatibility docs say you don't set context size in the request body. Create a derived model instead, or set a server-wide default when starting ollama serve.[7][8]

Modelfile
1FROM qwen3.5:9b 2PARAMETER num_ctx 32768
terminal-9
1ollama create qwen3.5-9b-32k -f Modelfile 2ollama run qwen3.5-9b-32k

For CLI or service deployments, you can also set a server-wide default before starting Ollama:[8]

terminal-10
1OLLAMA_CONTEXT_LENGTH=64000 ollama serve

Before assuming a larger tag is the problem, inspect where Ollama placed it and what context it allocated:[8][9]

terminal-11
1ollama ps

If the model is partly on CPU, latency usually gets ugly long before the tag "fails" outright.

Context length is only one knob. Ollama's FAQ also documents four server settings that change local behavior fast: OLLAMA_KEEP_ALIVE for model residency, OLLAMA_NUM_PARALLEL for per-model concurrency, OLLAMA_FLASH_ATTENTION=1 to reduce long-context memory pressure, and OLLAMA_KV_CACHE_TYPE to quantize the KV cache when Flash Attention is enabled.[9]

The common tuning mistake is raising context and parallelism together. Memory pressure scales with both, so tune one dimension at a time and check placement after each change.

terminal-12
1OLLAMA_FLASH_ATTENTION=1 \ 2OLLAMA_KV_CACHE_TYPE=q8_0 \ 3OLLAMA_KEEP_ALIVE=30m \ 4ollama serve

Use q8_0 first if you need the extra headroom. Ollama's FAQ describes it as roughly half the memory of f16 with a very small precision loss, while q4_0 drops to about one quarter of f16 with a more noticeable quality trade-off at larger contexts.[9]

If you raise parallelism, do it with your eyes open. Ollama's FAQ says required RAM scales with OLLAMA_NUM_PARALLEL * OLLAMA_CONTEXT_LENGTH.[9] A setup that works at 32K for one request can fall over fast at the same context for four concurrent requests.

Recommended Picks by Use Case

Lightweight local automation

Use qwen3.5:2b or qwen3.5:4b.

These fit:

  • classification
  • rewriting
  • simple extraction
  • offline assistants
  • low-cost API mocks

Serious laptop or 16 GB GPU setup

Use qwen3.5:9b.

This is the mainstream local Qwen tag in this family if you want something that can help with:

  • code explanation
  • shell commands
  • retrieval-augmented Q&A
  • small agent loops
  • structured output tasks

Workstation deployment

Use qwen3.5:27b or qwen3.5:35b only if you have the memory budget.

Don't buy a larger tag because the benchmark chart looks impressive. If the model spills into CPU memory, user experience will collapse long before the raw capability difference pays back.

When Things Go Wrong

Symptom: The model "runs" but feels unusably slow

Cause: You picked a tag that's too large for your available memory budget, or you pushed context high enough that the model no longer stays cleanly on GPU.

Fix: Check ollama ps first. If the model is split across CPU and GPU, step down one size class or cut context length before you keep tuning.

Symptom: The system becomes unstable on long chats

Cause: Your context setting is probably too aggressive for the machine.

Fix: Reduce context length first. Long local sessions are often memory problems pretending to be model-quality problems.

Symptom: You want the largest Qwen model but only have consumer hardware

Cause: Expecting a 122B-class experience from a laptop GPU.

Fix: Run 9b or 27b locally, and only reach for cloud-hosted frontier tiers when the task needs them.

Symptom: Results aren't reproducible across machines

Cause: Relying on moving defaults like latest or the size-tier alias without pinning the exact quantization.

Fix: Pin exact artifact tags such as qwen3.5:9b-q4_K_M or qwen3.5:35b-a3b-int4 in scripts and clients. Avoid relying on a moving default.

Practice: Can You Fit It?

Here's a concrete exercise to test whether the memory concepts have stuck.

Scenario: You have a laptop with an RTX 3060 (12 GB VRAM). The OS and desktop environment use about 2 GB. You want to run a local Qwen3.5 model and leave enough headroom for a 4K context window and the KV cache.

Question: Which of these tags is the safest choice?

  • qwen3.5:9b (6.6 GB on disk)
  • qwen3.5:27b (17 GB on disk)
  • qwen3.5:35b-a3b-int4 (smaller than the default 35b alias)

Think about it before reading on.

The 27b artifact is 17 GB before it even loads into VRAM. With only 10 GB free, that'll spill to CPU immediately. The 35b-a3b-int4 variant is smaller than the default 35B alias, but it is still a 20 GB artifact. The safest pick is qwen3.5:9b. At 6.6 GB on disk, it has room to expand in memory plus the KV cache without leaving the GPU.

Bonus question: If you later upgrade to a 24 GB GPU and want to run qwen3.5:27b with an 8K context, what single environment variable can you set to reduce KV cache memory pressure before starting Ollama?

Answer: OLLAMA_FLASH_ATTENTION=1 enables a memory-efficient attention implementation that reduces memory pressure as context grows.[9]

Quick self-check

Follow-up questions

  1. If a 27B tag launches but ollama ps shows CPU involvement, is the first fix to lower temperature, lower context, or step down a size class? Why?
  2. When would qwen3.5:35b-a3b be worth testing instead of staying on qwen3.5:9b, even if both technically run?
  3. If you are building a private retrieval loop, which matters more first: the model's advertised 256K max context or the runtime CONTEXT Ollama actually allocates on your machine?

Where This Leads Next

By now you can:

  1. Pick a Qwen3.5 tag that fits your hardware without guessing
  2. Explain the difference between a model's maximum context window and the runtime context Ollama allocates
  3. Customize a local model with a Modelfile for repeatable behavior
  4. Connect a Python script to your local endpoint through the OpenAI-compatible API
  5. Diagnose slowdowns by checking whether the model has spilled to CPU

These skills are the foundation for local agent development. The next logical step in the LeetLLM path is to build a small retrieval-augmented generation (RAG) loop that feeds documents from your local filesystem into the model through the API. That's where local inference stops being a demo and starts being a tool you can ship.

If you're deciding quickly:

  1. Install Ollama.[5]
  2. Pull qwen3.5:9b.[2]
  3. Start with a realistic context size and confirm placement with ollama ps.[8][9]
  4. Use the OpenAI-compatible local endpoint for development.[7]
  5. Move up to 27b or 35b only if your machine has clear headroom.

That setup gives you a fast, useful local Qwen workflow without turning the whole exercise into memory debugging.

PreviousThe Million-Token Era: What 1M Context Windows ChangeNextHow to Build an AI Agent from Scratch
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Qwen3.6

Qwen Team · 2026

qwen3.5

Ollama · 2026

Qwen3.5-35B-A3B

Qwen Team · 2026

Qwen3.5-122B-A10B

Qwen Team · 2026

Ollama GitHub Repository

Ollama Team · 2026

llama.cpp: Inference of LLaMA model in pure C/C++

Gerganov, G. · 2023

OpenAI compatibility - Ollama

Ollama · 2026

Context length - Ollama

Ollama · 2026

FAQ - Ollama

Ollama · 2026