LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Posts
BlogRun Qwen3.5 Locally with Ollama
🏷️ Local LLM🏷️ Ollama🏷️ Qwen3.5🏷️ Tutorial🏷️ GPU Inference

Run Qwen3.5 Locally with Ollama

Qwen3.5 is available in Ollama from 0.8B to 122B. This guide shows how to choose the right local tag, fit it to your memory budget, and expose it through Ollama's OpenAI-compatible API.

LeetLLM TeamMarch 2, 202624 min read

Run Qwen3.5 Locally with Ollama

Qwen3.5 is one of the best local-model families you can run through Ollama right now because the official model page covers a very wide size range and keeps the API story simple. Ollama exposes tags from 0.8B all the way to 122B, with published artifact sizes and a consistent 256K context window across the local tags.[1]

That means the real problem is not "can I run Qwen3.5 locally?" It is "which tag fits my machine without turning the experience into sludge?"

This guide stays strict about that question. It uses published Ollama tag sizes and official Ollama docs, then adds conservative deployment advice on top.[1][2][3][4]

What Ollama Actually Ships

The current Ollama library page lists these Qwen3.5 tags for local use:[1]

TagPublished sizeContext windowInput modes
qwen3.5:0.8b1.0 GB256KText, Image
qwen3.5:2b2.7 GB256KText, Image
qwen3.5:4b3.4 GB256KText, Image
qwen3.5:9b6.6 GB256KText, Image
qwen3.5:27b17 GB256KText, Image
qwen3.5:35b24 GB256KText, Image
qwen3.5:122b81 GB256KText, Image

The same page also highlights direct integrations with Claude Code, Codex, OpenCode, and OpenClaw through ollama launch ... --model qwen3.5 commands.[1]

That is why Qwen3.5 is attractive for local workflows. You do not need to invent a fragile wrapper around a raw checkpoint. You get a published model tag, a local server, and an OpenAI-compatible API surface.

Pick the Right Tag for Your Hardware

Use the published Ollama artifact size as the first filter, then leave extra headroom for context, KV cache, and the rest of your system. A 6.6 GB model file is not the same thing as "only 6.6 GB required."

Qwen3.5 model family overview showing published Ollama tag sizes from 0.8B to 122B, with the 9B tag highlighted as the practical mainstream local choice. Qwen3.5 model family overview showing published Ollama tag sizes from 0.8B to 122B, with the 9B tag highlighted as the practical mainstream local choice.

Here is the conservative sizing advice I would use:

Hardware budgetSafe Qwen3.5 choiceWhy
8 GB unified memory / VRAM0.8b or 2bFast enough for basic local chat, classification, and glue tasks
12 GB4bGood balance for laptops and entry GPUs
16 GB9bThe best mainstream local choice for real coding and agent experiments
24 GB27b if you accept lower throughputBigger jump in quality, but much tighter memory budget
32 GB+35bWorkstation-class local deployment
80 GB+122bServer-class machine only

If you want one default recommendation for most developers, use qwen3.5:9b. It is small enough to be practical on 16 GB hardware and large enough to be useful for real coding, search, and automation work.

Install Ollama

The official project is straightforward to install and run locally.[2]

bash
1curl -fsSL https://ollama.com/install.sh | sh

If you are on macOS or Windows, download the native app from Ollama instead.[2]

Once the service is running, confirm it responds:

bash
1ollama --version 2curl http://localhost:11434/api/tags

Pull and Run Qwen3.5

Start with the exact tag you have room for.

bash
1ollama pull qwen3.5:9b 2ollama run qwen3.5:9b

If you just run ollama run qwen3.5, Ollama resolves to the default local tag shown on the library page, which is the 9b variant at the time of writing.[1]

That is convenient, but I still recommend pinning the tag explicitly when you are building tooling or scripts. "Latest" is easy for experimentation and bad for reproducibility.

Use Qwen3.5 From Other Tools

One of the nicest things about the official Qwen3.5 Ollama page is that it documents direct launch commands for several coding tools:[1]

bash
1ollama launch claude --model qwen3.5 2ollama launch codex --model qwen3.5 3ollama launch opencode --model qwen3.5 4ollama launch openclaw --model qwen3.5

That does not mean Qwen3.5 instantly becomes the best model for every tool. It means the integration surface is clean enough that you can try it without building custom glue code first.

The Local Serving Model

Ollama is not a hosted inference API. It is a local model server wrapped around a compiled inference runtime.

Under the hood, Ollama uses llama.cpp and GGUF-style local model packaging, which is why it runs across Macs, Linux workstations, and consumer GPUs without asking you to stand up a full datacenter inference stack.[4]

Ollama local inference workflow: requests flow from an app or terminal through the Ollama server into llama.cpp, which loads the local Qwen3.5 weights and streams tokens back. Ollama local inference workflow: requests flow from an app or terminal through the Ollama server into llama.cpp, which loads the local Qwen3.5 weights and streams tokens back.

That architecture gives you three immediate benefits:

  1. •No network dependency for inference
  2. •A stable local endpoint for tooling
  3. •A clean stepping stone before moving to vLLM or another production engine

OpenAI-Compatible API

Ollama publishes compatibility for parts of the OpenAI API, including /v1/chat/completions and /v1/responses.[3]

That means you can point existing SDK-based code at your local server with only a base URL change:

python
1from openai import OpenAI 2 3client = OpenAI( 4 base_url="http://localhost:11434/v1/", 5 api_key="ollama", 6) 7 8response = client.chat.completions.create( 9 model="qwen3.5:9b", 10 messages=[ 11 {"role": "system", "content": "You are a precise coding assistant."}, 12 {"role": "user", "content": "Explain how to implement a retry loop with backoff."}, 13 ], 14) 15 16print(response.choices[0].message.content)

For local development, that is a big deal. You can test the same application shape against a local model first, then swap the base URL later if you move to a hosted deployment.

Keep Context in Check

Yes, the published local tags advertise 256K context.[1]

No, that does not mean you should casually run every laptop session at 256K.

Long context increases memory pressure and hurts latency. The safe way to use local Qwen3.5 is:

  • •start with a much smaller num_ctx
  • •only raise it when the task genuinely needs it
  • •remember that "fits in theory" and "pleasant to use" are different standards

A practical starting point looks like this:

bash
1ollama run qwen3.5:9b

Then set a smaller context in a Modelfile or request options if your machine gets tight on memory.

Recommended Picks by Use Case

Lightweight local automation

Use qwen3.5:2b or qwen3.5:4b.

These are good fits for:

  • •classification
  • •rewriting
  • •simple extraction
  • •offline assistants
  • •low-cost API mocks

Serious laptop or 16 GB GPU setup

Use qwen3.5:9b.

This is the best mainstream local Qwen tag if you want something that can still help with:

  • •code explanation
  • •shell commands
  • •retrieval-augmented Q&A
  • •small agent loops
  • •structured output tasks

Workstation deployment

Use qwen3.5:27b or qwen3.5:35b only if you actually have the memory budget.

Do not buy a larger tag just because the benchmark chart looks impressive. If the model spills badly, your user experience will collapse long before the raw capability difference pays back.

Common Problems

The model "runs" but feels unusably slow

That usually means you picked a tag that is too large for your available memory budget.

The fix is not to keep tuning forever. The fix is usually to step down one size class.

The system becomes unstable on long chats

Your context setting is probably too aggressive for the machine.

Reduce context length first. Long local sessions are often memory problems pretending to be model-quality problems.

You want the largest Qwen model but only have consumer hardware

That is exactly what the smaller tags are for. Run 9b or 27b locally, and only reach for cloud-hosted frontier tiers when the task really needs them.

You need reproducibility

Pin exact tags such as qwen3.5:9b in scripts and clients. Avoid relying on a moving default.

When Qwen3.5 Is the Right Local Choice

Qwen3.5 is a strong local default if you want:

  • •one family with many hardware tiers
  • •long-context local experimentation
  • •simple OpenAI-compatible local serving
  • •direct integration with developer tools

It is a worse choice if your main goal is squeezing the absolute largest possible model onto marginal hardware. In that case, the right move is usually to choose a smaller tag cleanly instead of forcing a bad fit.

Practical Recommendation

If you are deciding quickly:

  1. •Install Ollama.[2]
  2. •Pull qwen3.5:9b.[1]
  3. •Use the OpenAI-compatible local endpoint for development.[3]
  4. •Only move up to 27b or 35b if your machine has obvious headroom.

That is the setup that gives you the best odds of a fast, stable, useful local Qwen workflow without turning the whole exercise into memory debugging.

References

qwen3.5

Ollama · 2026

Ollama GitHub Repository

Ollama Team · 2026

llama.cpp: Inference of LLaMA model in pure C/C++

Gerganov, G. · 2023

OpenAI compatibility - Ollama

Ollama · 2026

Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail