LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

ยฉ 2026 LeetLLM. All rights reserved.

All Posts
BlogRun Qwen3.5 Locally with Ollama
๐Ÿท๏ธ Local LLM๐Ÿท๏ธ Ollama๐Ÿท๏ธ Qwen3.5๐Ÿท๏ธ Tutorial๐Ÿท๏ธ GPU Inference

Run Qwen3.5 Locally with Ollama

Alibaba's Qwen3.5 is one of the most capable open-source model families available right now. This guide walks you through running it on your own GPU using Ollama, with a focus on getting the most out of 16GB VRAM hardware like the RTX 5070 Ti.

LeetLLM TeamMarch 2, 202610 min read

Run Qwen3.5 Locally on Your GPU with Ollama

Imagine having a capable AI assistant that answers in a fraction of a second, never hits a rate limit, keeps your conversations completely private, and costs nothing beyond your electricity bill. That's what running a local LLM gets you. With Alibaba's Qwen3.5 release in February 2026, the models you can run at home have gotten genuinely excellent.

This guide walks you through running Qwen3.5 on consumer hardware using Ollama, the easiest and most polished tool for local model inference. We'll use the RTX 5070 Ti (16GB VRAM) as our reference, but the steps and model choices apply to any modern GPU or Apple Silicon Mac with comparable memory.

What Is Qwen3.5?

Qwen3.5 is Alibaba's latest generation of open-weight multimodal models, released in February 2026.[1] The family spans model sizes from 0.8B to 397B parameters and is available for free download under an open-source license on Hugging Face and the Ollama model library.

What makes Qwen3.5 notable isn't just raw benchmark performance, though those numbers are strong. The architecture combines two significant advances:

Unified vision-language training. Unlike earlier models where vision capability was bolted on after the fact, Qwen3.5 trains on multimodal tokens from the start. The result: the same model weights handle text, images, and video understanding, with no separate vision encoder to manage.

Hybrid Mixture-of-Experts architecture. Every Qwen3.5 local model uses a sparse Mixture-of-Experts (MoE) architecture. In a standard "dense" transformer, every parameter activates for every token. In an MoE model, the network is divided into many specialized sub-networks called experts, and a learned router selects only a few of them per token. For Qwen3.5, this means the 9B model has 9 billion total parameters but activates far fewer on each forward pass, which is why memory bandwidth requirements are lower than a naive byte count would suggest.

This matters practically: MoE models are faster per token than their parameter count implies, and they can reach a higher capability ceiling for a given VRAM budget. The tradeoff is that the full weight file still needs to live in VRAM even though only a fraction is active at any moment. If you want the full technical story (routing algorithms, load balancing, expert collapse), our Mixture of Experts article covers it from first principles.

The model also sports a 256K context window, support for 201 languages, and strong reasoning and coding capabilities. In the Qwen team's benchmarks, the 397B flagship model is competitive with GPT-5.2 and Gemini-3 Pro on most tasks.[2]

For local inference, what matters most is the smaller models in the family.

Qwen3.5 model family overview showing available sizes (0.8B to 122B) and VRAM requirements, with the 9B model highlighted as the sweet spot for 16GB GPUs. Qwen3.5 model family overview showing available sizes (0.8B to 122B) and VRAM requirements, with the 9B model highlighted as the sweet spot for 16GB GPUs.

Choosing the Right Model for Your Hardware

GPU memory (VRAM) is the hard constraint for local inference. You need the entire model to fit in VRAM for fast generation; models that spill into system RAM drop to a fraction of the speed.

Here's what's available via Ollama and what each variant requires:

ModelSize on DiskVRAM NeededWho It's For
qwen3.5:0.8b1.0 GB~2 GBTesting, embedded systems
qwen3.5:2b2.7 GB~4 GBLow-end laptops, always-on helpers
qwen3.5:4b3.4 GB~5 GB6-8 GB VRAM cards (RTX 3060 etc.)
qwen3.5:9b6.6 GB~8 GBSweet spot for 12-16 GB cards
qwen3.5:27b17 GB~18 GB24 GB cards (RTX 3090, 4090)
qwen3.5:35b24 GB~25 GB24 GB+ or multi-GPU
qwen3.5:122b81 GB80+ GBMulti-GPU workstations

For a 16 GB VRAM card like the RTX 5070 Ti, RTX 4090, or RTX 5080, the right choice is qwen3.5:9b. At 6.6 GB on disk, it fits entirely in VRAM with room to spare for a long context window. You'll get 80-120 tokens per second on Blackwell and Hopper-generation GPUs, which makes conversations feel instant.

๐Ÿ’ก Key insight: The qwen3.5 tag without a size suffix defaults to the 9B model. That's the one Alibaba recommends as the general-purpose choice. Running ollama run qwen3.5 on a 16 GB card gives you the best experience.

The 27B model at 17 GB won't fit in 16 GB VRAM without layer offloading. Community benchmarks on 16 GB GPUs show it drops to roughly 7 tokens per second when transformer layers spill to CPU, a 10x slowdown from the 9B.[3] This is a property of how MoE models interact with VRAM limits: Ollama offloads at the layer level, not the expert level, so the GPU sits idle waiting on CPU memory transfers for every token. The 9B model runs at 80+ tok/s on the same card. Bigger isn't always better when the bottleneck is memory bandwidth.

โš ๏ธ Common mistake: Trying to run qwen3.5:27b on a 16 GB card and expecting it to work well. When VRAM overflows, Ollama offloads entire transformer layers to CPU, creating a massive GPU-CPU data transfer bottleneck on every token. The result: 5-10x slower generation than a smaller model that fits completely in VRAM.

Why Ollama?

The local inference space has several good tools: llama.cpp (the underlying engine), LM Studio (a GUI), vLLM (production-grade serving), and LocalAI (API-compatible server). Ollama wraps llama.cpp with a clean CLI, a REST API, and automatic model management.[4]

Here's why it's the right starting point:

  • โ€ขOne-command model management. ollama run qwen3.5 downloads the model, sets up the quantization, starts the server, and drops you into a chat shell.
  • โ€ขOpenAI-compatible API. Any tool that works with the OpenAI API works with Ollama by pointing it at http://localhost:11434/v1. This includes Claude Code, OpenCode, Cursor, and hundreds of libraries.
  • โ€ขCross-platform. Works on Linux, macOS (including Apple Silicon), and Windows via WSL2.
  • โ€ขSupports Blackwell GPUs. Starting with Ollama v0.17.x, CUDA 12.8 is the runtime, which enables native sm_120 support for RTX 5000-series cards.
Diagram Diagram

The diagram above shows the data path: your prompt goes through the Ollama CLI into the llama.cpp inference engine, which loads the model weights into GPU VRAM and streams tokens back to you. No cloud endpoint, no API key, no data leaving your machine.

Setting Up Ollama

Install Ollama

The installation is a single command on Linux and macOS.

Linux:

bash
1curl -fsSL https://ollama.com/install.sh | sh

This script detects your GPU (NVIDIA, AMD, or Apple Silicon) and installs the appropriate CUDA or ROCm runtime. On NVIDIA systems, it will use CUDA 12.8 if available, which is required for RTX 5000-series (Blackwell) cards.

macOS:

Download the app from ollama.com and drag it into your Applications folder. Ollama runs as a menu bar app and automatically uses Metal for Apple Silicon acceleration.

Windows:

Download the installer from ollama.com. It runs through WSL2, so you'll need WSL installed first (wsl --install in an admin PowerShell).

Verify the installation:

bash
1ollama --version 2# Should print: ollama version X.Y.Z

RTX 5070 Ti / Blackwell Users: Check Your CUDA Version

The RTX 5070 Ti uses NVIDIA's Blackwell architecture with compute capability 12.0 (sm_120). Older release binaries of Ollama bundled CUDA 12.4, which doesn't include sm_120 support. If you installed an older version, the model will run on CPU rather than GPU.

To check which runtime you have:

bash
1ollama ps # shows running models and hardware 2nvidia-smi # shows GPU usage during inference

If Ollama shows your model on CPU, upgrade to the latest version:

bash
1curl -fsSL https://ollama.com/install.sh | sh 2# Or on Linux: sudo apt upgrade ollama (if installed via apt)

The latest releases include CUDA 12.8 with full Blackwell support.

Running Your First Qwen3.5 Chat

With Ollama installed, one command pulls the model and starts an interactive session:

bash
1ollama run qwen3.5

The first run downloads the 9B model (about 6.6 GB), which takes a few minutes on a typical internet connection. Once downloaded, the model loads into VRAM in a second or two and you'll see an interactive prompt:

>>> Send a message (/? for help)

Try asking it something technical:

>>> Explain the KV cache in three sentences.

You should see tokens streaming in at 80+ tokens per second on a 16 GB GPU, which feels like real-time. If you see fewer than 10 tok/s, the model may be running on CPU โ€” check the CUDA version note above.

To exit the session, type /bye or press Ctrl+D. The model stays loaded in VRAM for a configurable keep-alive period (default: 5 minutes), so your next request starts instantly.

The REST API

The real power of Ollama isn't the CLI โ€” it's the REST API, which lets you integrate local AI into any application. The server starts automatically on port 11434 when you run any model.

Here's how to send a chat completion request from the command line. This sends a JSON payload with the model name and a message, and streams the response back:

bash
1curl http://localhost:11434/api/chat \ 2 -d '{ 3 "model": "qwen3.5", 4 "messages": [{"role": "user", "content": "What is a transformer?"}] 5 }'

Or using the OpenAI-compatible endpoint (works with most existing tooling):

bash
1curl http://localhost:11434/v1/chat/completions \ 2 -H "Content-Type: application/json" \ 3 -d '{ 4 "model": "qwen3.5", 5 "messages": [{"role": "user", "content": "What is a transformer?"}] 6 }'

In Python, you can use either the official ollama package or the standard openai library:

python
1# Using the ollama package 2from ollama import chat 3 4response = chat( 5 model='qwen3.5', 6 messages=[{'role': 'user', 'content': 'Explain RLHF in simple terms.'}], 7) 8print(response.message.content)

The ollama Python package (pip install ollama) is the easiest option for new projects. For projects already using the openai package, swap the base URL:

python
1# Using the openai library (base_url overrides the endpoint) 2from openai import OpenAI 3 4client = OpenAI( 5 base_url='http://localhost:11434/v1', 6 api_key='ollama', # can be anything, Ollama ignores it 7) 8 9response = client.chat.completions.create( 10 model='qwen3.5', 11 messages=[{'role': 'user', 'content': 'Explain RLHF in simple terms.'}], 12) 13print(response.choices[0].message.content)
Ollama inference workflow: prompt flows from terminal or application through the Ollama server, into llama.cpp, then loads model weights from GPU VRAM to generate tokens. Ollama inference workflow: prompt flows from terminal or application through the Ollama server, into llama.cpp, then loads model weights from GPU VRAM to generate tokens.

Understanding Quantization

The 9B model you downloaded with ollama run qwen3.5 is already a quantized version, stored in GGUF format. Here's what that means.

A neural network stores its knowledge as millions of floating-point numbers called weights. At full precision (float32), a 9B model would need about 36 GB of VRAM โ€” far beyond consumer card capacity. Quantization reduces that by representing each weight with fewer bits, trading a tiny amount of quality for a massive reduction in memory.

The default Ollama models use Q4_K_M quantization: each weight is stored in approximately 4 bits (rather than 32), with a "K" grouping strategy and medium-precision scale factors. The math works out to roughly 4.5 bits per weight on average, which is why a 9B model takes 6.6 GB rather than 36 GB.

The GGUF file format is the container that holds these quantized weights. When you run ollama pull qwen3.5, you're downloading a single .gguf file โ€” no separate config files, no Python environment to set up. Ollama handles everything.

Quality impact: for most conversational and coding tasks, Q4_K_M is nearly indistinguishable from the full-precision model. Community benchmarks on Qwen3.5 show perplexity degradation of less than 1% compared to full-precision weights.[3]

๐Ÿ’ก Key insight: Quantization is not a hack โ€” it's how local LLMs work in practice. The quality tradeoff at Q4 is small enough that you won't notice it in normal use, but you get a 4-6x reduction in VRAM requirements. See our Model Quantization deep dive for the full technical story on GPTQ, AWQ, and GGUF.

If you want a slightly higher-quality quantization that still fits in 16 GB, look for Q5_K_M or Q6_K variants in the Ollama library. They use more bits per weight at the cost of a larger model file. For a 9B model, these still fit comfortably in 16 GB VRAM.

Checking What's Running

A few commands you'll use regularly:

bash
1# List all downloaded models 2ollama list 3 4# Show which models are currently loaded in VRAM 5ollama ps 6 7# Pull a specific size variant 8ollama pull qwen3.5:4b 9 10# Remove a model to free disk space 11ollama rm qwen3.5:0.8b

To see GPU utilization while a model is running:

bash
1# In a separate terminal during inference 2nvidia-smi -l 1 # refresh every second

You should see GPU Memory Used jump to ~8 GB when qwen3.5:9b loads, and GPU Utilization spike to 90-100% while tokens are generating.

Stopping the Server

By default, Ollama keeps the loaded model in VRAM after your session ends ("keep-alive" period). If you want to free the VRAM immediately:

bash
1# Stop a specific model (free its VRAM) 2ollama stop qwen3.5 3 4# Or stop the Ollama service entirely 5# On Linux (systemd): 6sudo systemctl stop ollama 7 8# On macOS: quit the menu bar app, or: 9pkill ollama 10 11# On Windows: right-click the system tray icon and quit

For scripting, you can also set the keep-alive to zero when starting a session:

bash
1OLLAMA_KEEP_ALIVE=0 ollama run qwen3.5 2# Model unloads from VRAM as soon as the session ends

Or via the API:

bash
1# Unload a model immediately via API 2curl http://localhost:11434/api/generate \ 3 -d '{"model": "qwen3.5", "keep_alive": 0}'

Integrating with Coding Tools

One practical use case for local LLMs is AI-assisted coding that doesn't send your proprietary code to an external API. Several tools make this easy with Ollama.

Claude Code and OpenCode can both be pointed at a local Ollama endpoint. The configuration is usually an environment variable or a settings file where you set the base URL to http://localhost:11434/v1.

For Claude Code specifically, Ollama ships a launcher shortcut:

bash
1ollama launch claude --model qwen3.5

This starts a Claude Code session configured to use your local Qwen3.5 model.

For Open WebUI, a browser-based chat interface that connects to your local Ollama server, install it separately:

bash
1# Run Open WebUI via Docker 2docker run -d \ 3 -p 3000:8080 \ 4 --add-host=host.docker.internal:host-gateway \ 5 -v open-webui:/app/backend/data \ 6 --name open-webui \ 7 ghcr.io/open-webui/open-webui:main

Then visit http://localhost:3000 in your browser and connect it to http://localhost:11434.

Performance Numbers: What to Expect

On an RTX 5070 Ti (16 GB, Blackwell) running qwen3.5:9b Q4_K_M (validated live on this hardware, Ollama 0.17.5, CUDA 13, driver 581):

MetricValue
VRAM usage (model loaded)~8 GB
VRAM at long context (32K)~10-11 GB
Prompt processing speed (prefill)~500-600 tokens/sec
Generation speed (decode)~90 tokens/sec
Time to first token โ€” cold start60-90s (model load from disk)
Time to first token โ€” warm<0.3 seconds
Context window256K tokens (default 4K)

Prompt processing speed is how fast the GPU reads your input. Generation speed is how fast it produces each new output token โ€” the number you feel most in a live conversation. Our Inference Mechanics article explains the full breakdown of time-to-first-token, tokens-per-second, and how the KV cache works under the hood.

For context on what 90 tok/s feels like: a typical paragraph is 100-150 tokens. You're seeing the AI write a full paragraph in about a second. Fast enough that it never feels like you're waiting.

On Apple Silicon (M4 Pro, 48 GB unified memory), expect 30-50 tok/s for the 9B model, with the option to run the 27B comfortably since Apple Silicon uses unified memory shared between CPU and GPU.

On an AMD Radeon RX 7900 XTX (24 GB), you can run the 9B model at similar speeds to NVIDIA via Ollama's ROCm backend.

๐ŸŽฏ Production tip: If you're running Ollama as a background service for multiple applications, set OLLAMA_NUM_PARALLEL=4 to allow parallel request handling. The default is 1 concurrent request with the CPU โ€” increasing it lets multiple applications query the model simultaneously, though each will be slightly slower.

Platform Notes

Mac Mini (Apple Silicon)

The Mac Mini M4 with 32 GB unified memory is a particularly capable local inference machine. Since Apple's architecture doesn't separate CPU and GPU memory, the full 32 GB is available to the model. You can comfortably run qwen3.5:27b (14-17 GB quantized) at reasonable speed.

Install Ollama via the macOS app at ollama.com. It auto-detects Apple Silicon and uses Metal for GPU acceleration. No CUDA or ROCm setup needed.

bash
1# Recommended for M4 Pro/Max (24+ GB): 2ollama run qwen3.5:27b 3 4# For M4 base (16 GB): 5ollama run qwen3.5 # defaults to 9b

Linux (NVIDIA)

Linux gives you the most control and typically the best performance. Ollama installs as a systemd service that starts on boot.

bash
1# Check service status 2sudo systemctl status ollama 3 4# View logs 5sudo journalctl -u ollama -f 6 7# Set environment variables (e.g., for RTX 5000-series CUDA version) 8sudo systemctl edit ollama 9# Add: [Service] 10# Environment="OLLAMA_FLASH_ATTENTION=1"

Windows

Windows support works via WSL2. Performance is generally comparable to Linux, though memory bandwidth can be slightly lower depending on WSL configuration. For best results, ensure your WSL2 installation has CUDA passthrough enabled and install the NVIDIA CUDA drivers for WSL2 from NVIDIA's website.

A Quick Look at Model Capabilities

The 9B model is a generalist: it handles reasoning, coding, math, and multimodal input (images) in a single package. Here are some things to try:

Code generation:

>>> Write a Python function that implements binary search and explain each step.

Image understanding (drag-and-drop an image into the terminal, or use the API with base64):

>>> Describe what's in this image and identify any technical diagrams.

Multilingual:

>>> Explain gradient descent in Japanese.

Long-context analysis:

bash
1# Pipe a long file into Ollama via stdin 2cat long_document.txt | ollama run qwen3.5 "Summarize the key points."

๐Ÿ”ฌ Research insight: Qwen3.5 was trained with reinforcement learning across what Alibaba calls "million-agent environments." This means the model was explicitly trained to be robust at tool use, multi-step planning, and agentic tasks, not just single-turn question answering. It shows in practice: the model is notably good at staying on task, using structured output formats, and self-correcting when given feedback.[1]

Validating Your Setup

This script uses the Ollama API directly (not the CLI) to get accurate token counts from Ollama's own eval_count metric:

python
1import time, json, urllib.request 2 3def chat_api(messages): 4 body = json.dumps({ 5 "model": "qwen3.5", 6 "messages": messages, 7 "stream": False, 8 "options": {"temperature": 0.6, "num_predict": 300} 9 }).encode() 10 req = urllib.request.Request( 11 "http://localhost:11434/api/chat", 12 data=body, headers={"Content-Type": "application/json"} 13 ) 14 with urllib.request.urlopen(req, timeout=180) as r: 15 return json.loads(r.read()) 16 17# Warm-up (loads model into VRAM on first call) 18print("Loading model into VRAM...") 19chat_api([{"role": "user", "content": "Say: ready"}]) 20 21# Benchmark 22print("Benchmarking generation speed...") 23res = chat_api([{ 24 "role": "user", 25 "content": "List the numbers 1 through 50, one per line. Numbers only." 26}]) 27 28eval_count = res.get("eval_count", 0) 29eval_duration_s = res.get("eval_duration", 1) / 1e9 # nanoseconds to seconds 30tok_s = eval_count / eval_duration_s 31 32print(f"Generated {eval_count} tokens in {eval_duration_s:.1f}s") 33print(f"Rate: {tok_s:.0f} tok/s") 34 35if tok_s < 20: 36 print("WARNING: Speed looks low. Check GPU is in use: nvidia-smi") 37elif tok_s > 50: 38 print("GPU acceleration confirmed.")

Run it:

bash
1python3 validate_ollama.py

Expected output on RTX 5070 Ti (validated):

Loading model into VRAM...
Benchmarking generation speed...
Generated 200 tokens in 2.2s
Rate: 90 tok/s
GPU acceleration confirmed.

๐Ÿ’ก Key insight: Qwen3.5 has a built-in chain-of-thought "thinking mode" that activates automatically for complex reasoning tasks. When it does, you'll see <think>...</think> blocks before the answer โ€” these consume tokens and reduce apparent speed. For simple tasks like the benchmark above, thinking doesn't activate. For complex math, code, or multi-step problems, the extra tokens are doing real work. This is the same test-time compute scaling technique used in models like DeepSeek-R1.

Troubleshooting

Here are the most common issues you'll hit, with the exact error message and the fix.

"pull model manifest: 412" โ€” Ollama version too old

Error: pull model manifest: 412:
The model you are attempting to pull requires a newer version of Ollama.

This error means the Ollama binary on your system is too old to know about Qwen3.5 (released February 2026). Qwen3.5 requires Ollama v0.17 or later.

Fix: Reinstall Ollama using the official script, which always fetches the latest version:

bash
1curl -fsSL https://ollama.com/install.sh | sh

On Linux, after the install completes, restart the service to pick up the new binary:

bash
1sudo systemctl restart ollama 2ollama --version # should show 0.17.x or later

On macOS, download the latest .dmg from ollama.com and reinstall the app.

Model running on CPU instead of GPU

Symptom: Generation speed is 5-10 tok/s instead of 80+. ollama ps shows 100% CPU or nvidia-smi shows 0% GPU utilization during inference.

Cause A: Ollama version doesn't support your GPU architecture. RTX 5000-series (Blackwell, sm_120) requires Ollama v0.17+ which ships with CUDA 13. Earlier builds bundled CUDA 12.4, which has no sm_120 kernels. The model silently falls back to CPU.

To confirm: check the Ollama service logs right after startup:

bash
1sudo journalctl -u ollama -n 50 | grep "inference compute"

If GPU acceleration is working you'll see:

inference compute ... library=CUDA compute=12.0 description="NVIDIA GeForce RTX 5070 Ti"

If it shows library=cpu or the GPU line is absent, upgrade Ollama (see above).

Cause B: NVIDIA drivers not installed or outdated. The driver version must support CUDA 12.8 or later. Check:

bash
1nvidia-smi 2# Look for: Driver Version: 525.xx or later 3# For RTX 5000 Blackwell: Driver Version 570.xx or later

If drivers are old, update them via your distro's package manager or from developer.nvidia.com.

Cause C: CUDA libraries not in the library path (Linux). Verify:

bash
1ldconfig -p | grep libcuda 2# Should show: libcuda.so.1 -> /usr/lib/x86_64-linux-gnu/libcuda.so.1

If nothing appears, install nvidia-cuda-toolkit or add the CUDA lib path to /etc/ld.so.conf.d/.

Very slow context window โ€” speed drops after a few exchanges

Symptom: The first response is fast (80+ tok/s), but by the 5th or 6th message in a conversation your speed drops to 20-30 tok/s.

Cause: Context length. As your conversation grows, the KV cache grows with it. The KV cache stores the computed attention states for every previous token so the model doesn't have to reprocess them โ€” but it costs VRAM proportional to how many tokens are in context. The default context size in Ollama is 4096 tokens. Once the conversation nears that limit, Ollama applies rolling eviction (dropping oldest context), which can cause the model to "forget" earlier messages. Qwen3.5 supports up to 256K context โ€” if you want long sessions, set it explicitly. See Long Context Window Management for a deeper guide.

bash
1# Set 32K context for long coding or document sessions 2OLLAMA_CONTEXT_LENGTH=32768 ollama run qwen3.5

Or via a Modelfile:

FROM qwen3.5
PARAMETER num_ctx 32768
bash
1ollama create qwen3.5-32k -f Modelfile 2ollama run qwen3.5-32k

Note: larger context takes more VRAM. At 32K context the 9B model uses about 10-11 GB on a 16 GB card, still comfortably within budget.

"could not connect to a running Ollama instance"

Warning: could not connect to a running Ollama instance
Error: ollama server not responding

The Ollama API server isn't running. On Linux after a fresh install, start it:

bash
1sudo systemctl start ollama 2sudo systemctl enable ollama # auto-start on boot

On macOS, launch the Ollama app from Applications. On Windows, start Ollama from the Start menu.

You can verify the server is alive:

bash
1curl http://localhost:11434/api/version 2# Returns: {"version":"0.17.5"}

Port 11434 already in use

If you previously ran ollama serve manually as a background process and then installed it as a systemd service, you may end up with two instances fighting over port 11434.

Fix: Kill the manual process and let the service take over:

bash
1pkill -f "ollama serve" # kill the manual instance 2sudo systemctl restart ollama # start the managed service

Model output quality seems low (rambling or off-topic)

Unlike older Qwen models, Qwen3.5 uses test-time reasoning by default โ€” it thinks through its answer before responding. This is great for hard problems, but for simple questions it can produce overly verbose replies with visible <think>...</think> blocks.

To get more direct answers, set a concise system prompt:

bash
1ollama run qwen3.5 2>>> /set system "You are a concise assistant. Answer directly without extended reasoning."

Or tune the temperature: 0.6 works well for factual and coding tasks, 0.9 for creative ones. The default is 0.8. Temperature controls how "random" the model's word choices are โ€” lower means more predictable and focused, higher means more varied and creative.

โš ๏ธ Common mistake: Assuming slow or poor output means you need a bigger model. In most cases, the bottleneck is context length or temperature, not model capability. Try /clear to reset the conversation context and start fresh before switching to a larger variant.

Key Takeaways

  • โ€ขQwen3.5 is one of the strongest open-weight model families available as of early 2026, with competitive performance on math, coding, and multimodal tasks.
  • โ€ขOllama is the easiest way to run it locally: one command installs the runtime, another downloads the model and starts a chat session.
  • โ€ขFor 16 GB VRAM GPUs (RTX 5070 Ti, 4090, 5080), qwen3.5:9b is the right choice. It fits entirely in VRAM, generates at 80-120 tok/s, and handles a 256K context window.
  • โ€ขTrying to run the 27B model on 16 GB causes CPU offloading, which drops speed by 10x. Size alone doesn't win when VRAM is the constraint.
  • โ€ขRTX 5000-series (Blackwell) users need Ollama with CUDA 12.8 (v0.17+) for native GPU support.
  • โ€ขThe OpenAI-compatible API at localhost:11434/v1 lets any existing AI tool switch to a local model with one configuration change.
  • โ€ขStop the model explicitly after use (ollama stop qwen3.5) to free VRAM for other applications.
References

Qwen3.5: A New Generation of Open-Source Multimodal Models

Alibaba Qwen Team ยท 2026

Ollama: Get up and running with large language models

Ollama Contributors ยท 2026

llama.cpp: Inference of LLaMA model in pure C/C++

Gerganov, G. ยท 2023

Ollama VRAM Requirements: Complete 2026 Guide to GPU Memory for Local LLMs

LocalLLM.in ยท 2026

Follow-up: Qwen3.5-35B-A3B โ€” 7 community-requested experiments on RTX 5080 16GB

gaztrab (Reddit r/LocalLLaMA) ยท 2026

Alibaba releases multimodal Qwen3.5 mixture of experts model

SiliconANGLE ยท 2026