Local LLMQwen3.6UnslothGGUF+2

Run Qwen3.6 Locally with Unsloth GGUF

Run Qwen3.6 locally with llama.cpp and Unsloth GGUF. Pick between 27B dense and 35B-A3B MoE, choose a quant, expose a local endpoint, then test MTP after the plain run works.

LeetLLM TeamMay 13, 2026Updated July 11, 202613 min read

You want a private coding assistant for a local repo. It should read files, explain a failing test, and suggest a patch without sending source code to a hosted model. Qwen3.6 plus Unsloth GGUF is a practical local path for that job.

Use llama.cpp as the runtime and Unsloth GGUF as the model source. Goal: choose one model, choose one quant, expose a local OpenAI-compatible endpoint, then tune context and MTP only after the first run works.

Qwen's official repository says Qwen3.6 focuses on stability, agentic coding, repository-level reasoning, and thinking preservation.^{[1]Reference 1Qwen3.6https://github.com/QwenLM/Qwen3.6} Ollama now lists direct Qwen3.6 tags, but the Unsloth GGUF path keeps exact files, quants, llama.cpp flags, and MTP variants visible.^{[2]Reference 2qwen3.6https://ollama.com/library/qwen3.6}^{[3]Reference 3Qwen3.6-35B-A3B-GGUFhttps://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF}^{[4]Reference 4Qwen3.6-27B-GGUFhttps://huggingface.co/unsloth/Qwen3.6-27B-GGUF}

Pick the right first model

Two open-weight Qwen3.6 models matter for local users: Qwen3.6-27B, a dense model that keeps the first debug loop simpler, and Qwen3.6-35B-A3B, a MoE model for coding and agent experiments when you want to test the sparse path.

A dense model runs every parameter path for every token. A Mixture-of-Experts (MoE) model stores many expert blocks but activates only a subset for each token. The Qwen3.6-35B-A3B model card lists 35B total parameters, 3B activated parameters, 256 experts, and 8 routed experts plus 1 shared expert active during inference.^{[5]Reference 5Qwen3.6-35B-A3Bhttps://huggingface.co/Qwen/Qwen3.6-35B-A3B}

That means 35B-A3B doesn't compute like a dense 35B model for every token, but you still need memory for weights, KV cache, runtime buffers, and any vision projector file.

Qwen3.6 first-run chooser that starts from memory budget, then branches to 27B dense for simpler debugging, 35B-A3B for deliberate MoE testing, or smaller quants when memory is tight. — First run should optimize for debugging: try 27B dense when you want fewer variables, try 35B-A3B when you specifically want the MoE path, and use a smaller quant or shorter context first when memory is tight.

Starting lane	Choose it for	Keep fixed during first test
27B dense Q3	A 16 GB planning target and fewer runtime variables	Exact GGUF pin, 4K-8K context, one request
27B dense Q4	A quality comparison with more memory headroom	Same prompt pack and server build
35B-A3B MoE Q4	A deliberate sparse-model test on a larger workstation	Same context, sampling, and acceptance tests
Ollama Qwen3.6 tag	The shortest setup rather than exact artifact control	Explicit tag and observed context allocation

First-run rule: Change one variable at a time. Prove the plain text endpoint before adding longer context, MTP, vision, or parallel slots.

For a 16 GB GPU, start with Qwen3.6-27B-GGUF:UD-Q3_K_XL and a short context. For a 24 GB to 32 GB-class workstation, Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M is a reasonable first MoE target. If you only want the shortest setup, Ollama lists qwen3.6, qwen3.6:27b, and qwen3.6:35b tags with 256K context and text plus image input.^{[2]Reference 2qwen3.6https://ollama.com/library/qwen3.6} The llama.cpp path keeps exact GGUF pins visible, which teaches more about memory and runtime behavior.

Read the quant name

GGUF is the model-file format used by llama.cpp and many local apps. Unsloth's Dynamic 2.0 docs explain that its newer GGUFs use model-specific layer choices and calibration data instead of applying one uniform quantization recipe everywhere.^{[6]Reference 6Unsloth Dynamic 2.0 GGUFshttps://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs} Treat the filename as your deployment label.

Read Qwen3.6-35B-A3B-UD-Q4_K_M.gguf as model family, parameter shape, Unsloth Dynamic quant, rough bit width, llama.cpp quant tier, and GGUF file format. The important parts for deployment are 35B-A3B, Q4, and the exact filename you pin.

Quantization stores weights with fewer bits. Smaller quants need less memory, but very low-bit files can lose coding behavior. Use this rule before downloading:

Q3: first 16 GB dense test
Q4: default useful local coding target
Q5 or Q6: try when you have headroom
Q8 or BF16: quality comparison or server-class memory

Unsloth's Qwen3.6 guide gives a simple memory ladder: 27B needs about 15 GB at 3-bit and 18 GB at 4-bit; 35B-A3B needs about 17 GB at 3-bit and 23 GB at 4-bit.^{[7]Reference 7Qwen3.6 - How to Run Locallyhttps://unsloth.ai/docs/models/qwen3.6} Those are planning numbers. Runtime memory can still climb with context length and concurrency.

Qwen3.6 model picker showing memory-budget gates for 16 GB, 24 GB, and 32 GB systems, with 27B dense as the first debug path and 35B-A3B MoE as a deliberate sparse-model test. — Quantization changes the deployment tier more than the model-name gap does. The 27B dense model reaches a 16 GB planning target at 3-bit. The 35B-A3B model becomes practical around 4-bit on larger machines. BF16 remains server-class.

Run one local endpoint

llama.cpp is the local inference engine behind many GGUF workflows. Qwen's README lists llama.cpp as a supported local path for Qwen3.6 text and vision models, and the Unsloth model cards show direct llama-server -hf ... commands.^{[1]Reference 1Qwen3.6https://github.com/QwenLM/Qwen3.6}^{[3]Reference 3Qwen3.6-35B-A3B-GGUFhttps://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF}

On macOS, brew install llama.cpp is enough for many first runs. On Linux with NVIDIA, build from source so CUDA and remote Hugging Face downloads are explicit. Then start with the 27B dense Q3 pin at short context:

terminal

git clone https://github.com/ggml-org/llama.cpp

cmake -S llama.cpp -B llama.cpp/build \
  -DGGML_CUDA=ON \
  -DLLAMA_CURL=ON

cmake --build llama.cpp/build \
  --config Release \
  -j \
  --target llama-server llama-cli

./llama.cpp/build/bin/llama-server --version

./llama.cpp/build/bin/llama-server \
  -hf unsloth/Qwen3.6-27B-GGUF:UD-Q3_K_XL \
  --ctx-size 8192 \
  --n-gpu-layers 99 \
  --host 127.0.0.1 \
  --port 8080

Without an NVIDIA GPU, remove -DGGML_CUDA=ON and use CPU, Metal, or another supported backend. Apple Silicon users should also test MLX later because Qwen lists MLX as a supported local path.^{[1]Reference 1Qwen3.6https://github.com/QwenLM/Qwen3.6}

Only change one server variable at a time:

16 GB dense test: unsloth/Qwen3.6-27B-GGUF:UD-Q3_K_XL at 4096 to 8192 context.
27B quality test: unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL at 8192 context.
35B-A3B MoE test: unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M at 8192 first, then 32768.

If output is repeated symbols or broken formatting, update llama.cpp before blaming the model. Qwen3.6 runtime support has moved quickly, and stale local binaries are a common hidden variable.

Verify download and cache state

The -hf argument lets llama.cpp resolve and download a file from the named Hugging Face repository.^{[3]Reference 3Qwen3.6-35B-A3B-GGUFhttps://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF} First launch can spend most of its time downloading, so don't count it as model load latency. Preserve the startup log until it shows the selected file and server listening address.

Run the same command a second time before measuring cold model load. A cached second launch should avoid downloading the full file again. If it starts another large transfer, check whether the repository selector changed, the cache directory changed, or the earlier download never completed.

For strict artifact control, resolve the repository revision and filename before the benchmark, then store both in the run receipt. A quant label describes a file family; the repository revision records which published state supplied it.

Keep separate disk and runtime checks:

inspect-qwen36-host.sh

df -h .
du -sh "${HF_HOME:-$HOME/.cache/huggingface}" 2>/dev/null || true
./llama.cpp/build/bin/llama-server --version

Free disk space answers whether the artifact can be stored. It doesn't answer whether weights, KV cache, and buffers fit during inference.

Local Qwen3.6 runtime path from app or tool to llama.cpp server to pinned Unsloth GGUF file and local memory placement, with separate callouts for pinned artifacts and KV-cache growth. — The GGUF file stores weights. Runtime memory pressure comes later from context length, concurrency, buffers, and optional vision files.

Diagram showing Coding client, llama-server, Pinned GGUF, and Weights + KV cache + buffers. — Coding client, llama-server, Pinned GGUF, and Weights + KV cache + buffers.

Point tools at http://127.0.0.1:8080/v1 and test one completion from your normal SDK or client. OpenAI-compatible SDKs usually expect an API key even for local servers. Use a placeholder such as local, set the base URL to the local endpoint, and keep the first prompt short enough to separate runtime problems from task complexity.

Verify the endpoint contract

Start with a health request, then send one non-streaming chat completion:

verify-qwen36-endpoint.sh

curl -fsS http://127.0.0.1:8080/health

curl -fsS http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer local' \
  -d '{
    "model": "local-qwen36",
    "messages": [
      {"role": "user", "content": "Reply with exactly: endpoint-ready"}
    ],
    "temperature": 0,
    "max_tokens": 16,
    "stream": false
  }' > qwen36-smoke.json

The exact response ID and usage counts vary. Validate stable structure and requested content:

check-qwen36-response.sh

jq -e '
  (.choices | length) == 1
  and .choices[0].message.role == "assistant"
  and (.choices[0].message.content | contains("endpoint-ready"))
' qwen36-smoke.json

If health passes but completion fails, read the server log before changing the model. If completion returns malformed content, repeat with the same prompt after confirming the llama.cpp build, selected GGUF, and chat template.

Add MTP after the plain run works

Multi-token prediction (MTP) lets a draft path propose more than one future token while the main model verifies them.^{[8]Reference 8Accelerating Gemma 4: faster inference with multi-token prediction draftershttps://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/} Qwen3.6 model cards list MTP support, and Unsloth publishes dedicated MTP GGUF repositories for both 27B and 35B-A3B.^{[5]Reference 5Qwen3.6-35B-A3Bhttps://huggingface.co/Qwen/Qwen3.6-35B-A3B}^{[9]Reference 9Qwen3.6-27Bhttps://huggingface.co/Qwen/Qwen3.6-27B}^{[10]Reference 10Qwen3.6-27B-MTP-GGUFhttps://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF}^{[11]Reference 11Qwen3.6-35B-A3B-MTP-GGUFhttps://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF}

Use MTP only after the regular GGUF command works. Swap to the matching -MTP-GGUF repo and add --spec-type draft-mtp --spec-draft-n-max 2, the flags shown on the Unsloth cards.^{[10]Reference 10Qwen3.6-27B-MTP-GGUFhttps://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF}^{[11]Reference 11Qwen3.6-35B-A3B-MTP-GGUFhttps://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF} Unsloth now documents MTP as a supported path through its MTP GGUF files and llama.cpp support.^{[12]Reference 12Multi-Token Predictionhttps://unsloth.ai/docs/models/mtp} Treat parallel slots and vision projector support as runtime-specific, because behavior can differ by llama.cpp build, Unsloth Studio, exact repo, and command. Verify -np and --mmproj against your chosen runtime before combining them. Unsloth reports roughly 1.4x to 2.2x faster inference with no accuracy loss on the MTP path, but that isn't guaranteed for your GPU, quant, context, and runtime.^{[7]Reference 7Qwen3.6 - How to Run Locallyhttps://unsloth.ai/docs/models/qwen3.6}

Compare MTP against a control

Measure plain decoding first, then change only the model repository and MTP flags. Keep these fixed:

llama.cpp commit and build flags
base model and quant tier
prompt tokens and output cap
context length and parallel slots
sampling settings and warm-up policy
hardware, power mode, and background load

Run a prompt pack with short and sustained generations. Record time to first token separately from generated-token throughput because speculation can help decode without helping prompt ingestion. Also record accepted draft tokens when the runtime reports them. Low acceptance can erase the expected speedup.

MTP passes only when output still clears the same task checks. Compare tests, JSON validity, and required facts before comparing speed. A faster run that breaks the answer contract is a regression.

Store medians from repeated warm runs, plus every failed run. Don't delete out-of-memory or malformed-output trials: they define the usable operating envelope.

Keep context small at first

Qwen3.6 model cards list a native context length of 262,144 tokens and an extended path up to 1,010,000 tokens with RoPE scaling.^{[5]Reference 5Qwen3.6-35B-A3Bhttps://huggingface.co/Qwen/Qwen3.6-35B-A3B}^{[9]Reference 9Qwen3.6-27Bhttps://huggingface.co/Qwen/Qwen3.6-27B} That's a model capability, not a laptop startup setting.

Context costs memory through the KV cache. A model can load weights successfully, then fail on the first long request because KV allocation pushes it over the edge.

Context ladder

Use this ladder: smoke test at 4K to 8K, local chat at 16K to 32K, coding-agent work at 64K to 128K, and 262K only after memory placement is stable.

The Qwen model cards advise keeping at least 128K when possible to preserve thinking capabilities, but also recommend reducing context if you hit out-of-memory errors.^{[5]Reference 5Qwen3.6-35B-A3Bhttps://huggingface.co/Qwen/Qwen3.6-35B-A3B}^{[9]Reference 9Qwen3.6-27Bhttps://huggingface.co/Qwen/Qwen3.6-27B} For a beginner setup, read 128K as a target to grow toward, not a first command.

Add vision only after text works

The official Qwen3.6 model cards describe both 27B and 35B-A3B as causal language models with vision encoders.^{[5]Reference 5Qwen3.6-35B-A3Bhttps://huggingface.co/Qwen/Qwen3.6-35B-A3B}^{[9]Reference 9Qwen3.6-27Bhttps://huggingface.co/Qwen/Qwen3.6-27B} The Unsloth GGUF repositories include mmproj files for image input. Keep the first run text-only:

Run text-only llama-server
Test /v1/chat/completions
Increase context or switch model pins only after it stays stable
Add the matching mmproj file last

Vision adds another file, more memory pressure, and more syntax variation across runtimes.

Modality rule: Keep text-only as the control run. Add the matching vision projector only after the same server build, GGUF pin, and context length are stable.

Measure memory instead of estimating fit

Planning numbers tell you which download to try. Measure the running process before increasing context or concurrency.

On NVIDIA, sample device memory while the server loads and while a request generates:

watch-qwen36-gpu.sh

nvidia-smi \
  --query-compute-apps=pid,process_name,used_memory \
  --format=csv \
  --loop=1

In another terminal, send the fixed smoke request, then a longer benchmark request. Record idle loaded-model memory, peak prompt-processing memory, and peak generation memory. Use ps -o pid,rss,command -C llama-server to capture host resident memory when CPU offload or unified memory matters.

Repeat after each context increase. Change parallel slots only after the single-request curve is stable. This produces a machine-specific capacity table:

Run	Context	Parallel slots	Peak device memory	Peak host memory	Result
Control	8K	1	measured	measured	pass or fail
Context step	32K	1	measured	measured	pass or fail
Concurrency step	32K	2	measured	measured	pass or fail

Don't copy memory values from another machine into this table. Backend, quant, offload, context, and runtime build all change the result.

Tune for coding

For coding, use narrow prompts with visible acceptance criteria. Name the function or file, list allowed inputs, state the output shape, and ask for code plus one short explanation. Then run tests.

Sampling baseline

Unsloth publishes sampling settings for Qwen3.6. For precise coding in thinking mode, it recommends temperature 0.6, top_p 0.95, and top_k 20; it also warns against greedy temperature 0 in thinking mode because repetition loops can appear.^{[7]Reference 7Qwen3.6 - How to Run Locallyhttps://unsloth.ai/docs/models/qwen3.6}

Use Unsloth's starting point: temperature 0.6, top_p 0.95, top_k 20, max_tokens around 500 to 1500, and context between 16K and 64K for ordinary coding loops. Keep one baseline prompt pack so you can compare Qwen3.6 against Gemma 4, DeepSeek V4, or a hosted coding model without changing the task.

Local iteration is cheap, but code still needs a verifier: your test suite.

When things go wrong

Common failures:

Painfully slow generation: lower context first, then use a smaller quant.
Crash on first prompt: restart at --ctx-size 8192.
Repeated or broken output: rebuild llama.cpp from a current commit.
Image input failure: prove text-only first, then add the matching projector.
Almost-valid tool JSON: reduce temperature, ask for one tool call, and validate externally.

One runtime gotcha matters: Unsloth tracked CUDA 13.2 as a Qwen3.6 GGUF gibberish-output issue and directed users to CUDA 12.8 or 13.0 binaries, or Unsloth Studio, instead of that toolkit.^{[13]Reference 13Qwen3.6 GGUF CUDA 13.2 issuehttps://github.com/unslothai/unsloth/issues/4849} If output looks broken, check the CUDA runtime before tuning sampling.

What to try next

Keep the operating picture small: file size tells you weight storage, context length drives KV-cache pressure, and MoE adds debugging variables even when activated parameters are small. Start with one exact GGUF pin, one context length, and one endpoint. Make that stable before adding MTP, longer context, image input, or coding-agent integration.

Keep a receipt containing llama.cpp commit, build flags, model repository revision, exact filename, context, sampling, GPU placement, prompt-pack revision, raw API responses, and measurement logs. That receipt turns "it ran locally" into a comparison another engineer can reproduce.

PreviousHow We Built LeetLLM NextAI Engineer Portfolio Projects That Get Interviews

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Qwen3.6

Qwen Team · 2026

qwen3.6

Ollama · 2026

Qwen3.6-35B-A3B-GGUF

Unsloth · 2026

Qwen3.6-27B-GGUF

Unsloth · 2026

Qwen3.6-35B-A3B

Qwen Team · 2026

Unsloth Dynamic 2.0 GGUFs

Unsloth · 2026

Qwen3.6 - How to Run Locally

Unsloth · 2026

Accelerating Gemma 4: faster inference with multi-token prediction drafters

Google · 2026

Qwen3.6-27B

Qwen Team · 2026

Qwen3.6-27B-MTP-GGUF

Unsloth · 2026

Qwen3.6-35B-A3B-MTP-GGUF

Unsloth · 2026

Multi-Token Prediction

Unsloth · 2026

Qwen3.6 GGUF CUDA 13.2 issue

Unsloth · 2026

Blog

Local LLMQwen3.6UnslothGGUF+2

Run Qwen3.6 Locally with Unsloth GGUF

Run Qwen3.6 locally with llama.cpp and Unsloth GGUF. Pick between 27B dense and 35B-A3B MoE, choose a quant, expose a local endpoint, then test MTP after the plain run works.

LeetLLM TeamMay 13, 2026Updated July 11, 202613 min read

Pick the right first model

That means 35B-A3B doesn't compute like a dense 35B model for every token, but you still need memory for weights, KV cache, runtime buffers, and any vision projector file.

Starting lane	Choose it for	Keep fixed during first test
27B dense Q3	A 16 GB planning target and fewer runtime variables	Exact GGUF pin, 4K-8K context, one request
27B dense Q4	A quality comparison with more memory headroom	Same prompt pack and server build
35B-A3B MoE Q4	A deliberate sparse-model test on a larger workstation	Same context, sampling, and acceptance tests
Ollama Qwen3.6 tag	The shortest setup rather than exact artifact control	Explicit tag and observed context allocation

First-run rule: Change one variable at a time. Prove the plain text endpoint before adding longer context, MTP, vision, or parallel slots.

Read the quant name

Quantization stores weights with fewer bits. Smaller quants need less memory, but very low-bit files can lose coding behavior. Use this rule before downloading:

Q3: first 16 GB dense test
Q4: default useful local coding target
Q5 or Q6: try when you have headroom
Q8 or BF16: quality comparison or server-class memory

Run one local endpoint

terminal

git clone https://github.com/ggml-org/llama.cpp

cmake -S llama.cpp -B llama.cpp/build \
  -DGGML_CUDA=ON \
  -DLLAMA_CURL=ON

cmake --build llama.cpp/build \
  --config Release \
  -j \
  --target llama-server llama-cli

./llama.cpp/build/bin/llama-server --version

./llama.cpp/build/bin/llama-server \
  -hf unsloth/Qwen3.6-27B-GGUF:UD-Q3_K_XL \
  --ctx-size 8192 \
  --n-gpu-layers 99 \
  --host 127.0.0.1 \
  --port 8080

Only change one server variable at a time:

16 GB dense test: unsloth/Qwen3.6-27B-GGUF:UD-Q3_K_XL at 4096 to 8192 context.
27B quality test: unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL at 8192 context.
35B-A3B MoE test: unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M at 8192 first, then 32768.

If output is repeated symbols or broken formatting, update llama.cpp before blaming the model. Qwen3.6 runtime support has moved quickly, and stale local binaries are a common hidden variable.

Verify download and cache state

Keep separate disk and runtime checks:

inspect-qwen36-host.sh

df -h .
du -sh "${HF_HOME:-$HOME/.cache/huggingface}" 2>/dev/null || true
./llama.cpp/build/bin/llama-server --version

Free disk space answers whether the artifact can be stored. It doesn't answer whether weights, KV cache, and buffers fit during inference.

Verify the endpoint contract

Start with a health request, then send one non-streaming chat completion:

verify-qwen36-endpoint.sh

curl -fsS http://127.0.0.1:8080/health

curl -fsS http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer local' \
  -d '{
    "model": "local-qwen36",
    "messages": [
      {"role": "user", "content": "Reply with exactly: endpoint-ready"}
    ],
    "temperature": 0,
    "max_tokens": 16,
    "stream": false
  }' > qwen36-smoke.json

The exact response ID and usage counts vary. Validate stable structure and requested content:

check-qwen36-response.sh

jq -e '
  (.choices | length) == 1
  and .choices[0].message.role == "assistant"
  and (.choices[0].message.content | contains("endpoint-ready"))
' qwen36-smoke.json

Add MTP after the plain run works

Compare MTP against a control

Measure plain decoding first, then change only the model repository and MTP flags. Keep these fixed:

llama.cpp commit and build flags
base model and quant tier
prompt tokens and output cap
context length and parallel slots
sampling settings and warm-up policy
hardware, power mode, and background load

MTP passes only when output still clears the same task checks. Compare tests, JSON validity, and required facts before comparing speed. A faster run that breaks the answer contract is a regression.

Store medians from repeated warm runs, plus every failed run. Don't delete out-of-memory or malformed-output trials: they define the usable operating envelope.

Keep context small at first

Context costs memory through the KV cache. A model can load weights successfully, then fail on the first long request because KV allocation pushes it over the edge.

Context ladder

Use this ladder: smoke test at 4K to 8K, local chat at 16K to 32K, coding-agent work at 64K to 128K, and 262K only after memory placement is stable.

Add vision only after text works

Run text-only llama-server
Test /v1/chat/completions
Increase context or switch model pins only after it stays stable
Add the matching mmproj file last

Vision adds another file, more memory pressure, and more syntax variation across runtimes.

Modality rule: Keep text-only as the control run. Add the matching vision projector only after the same server build, GGUF pin, and context length are stable.

Measure memory instead of estimating fit

Planning numbers tell you which download to try. Measure the running process before increasing context or concurrency.

On NVIDIA, sample device memory while the server loads and while a request generates:

watch-qwen36-gpu.sh

nvidia-smi \
  --query-compute-apps=pid,process_name,used_memory \
  --format=csv \
  --loop=1

Repeat after each context increase. Change parallel slots only after the single-request curve is stable. This produces a machine-specific capacity table:

Run	Context	Parallel slots	Peak device memory	Peak host memory	Result
Control	8K	1	measured	measured	pass or fail
Context step	32K	1	measured	measured	pass or fail
Concurrency step	32K	2	measured	measured	pass or fail

Don't copy memory values from another machine into this table. Backend, quant, offload, context, and runtime build all change the result.

Tune for coding

For coding, use narrow prompts with visible acceptance criteria. Name the function or file, list allowed inputs, state the output shape, and ask for code plus one short explanation. Then run tests.

Sampling baseline

Local iteration is cheap, but code still needs a verifier: your test suite.

When things go wrong

Common failures:

Painfully slow generation: lower context first, then use a smaller quant.
Crash on first prompt: restart at --ctx-size 8192.
Repeated or broken output: rebuild llama.cpp from a current commit.
Image input failure: prove text-only first, then add the matching projector.
Almost-valid tool JSON: reduce temperature, ask for one tool call, and validate externally.

What to try next

PreviousHow We Built LeetLLM NextAI Engineer Portfolio Projects That Get Interviews

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Qwen3.6

Qwen Team · 2026

qwen3.6

Ollama · 2026

Qwen3.6-35B-A3B-GGUF

Unsloth · 2026

Qwen3.6-27B-GGUF

Unsloth · 2026

Qwen3.6-35B-A3B

Qwen Team · 2026

Unsloth Dynamic 2.0 GGUFs

Unsloth · 2026

Qwen3.6 - How to Run Locally

Unsloth · 2026

Accelerating Gemma 4: faster inference with multi-token prediction drafters

Google · 2026

Qwen3.6-27B

Qwen Team · 2026

Qwen3.6-27B-MTP-GGUF

Unsloth · 2026

Qwen3.6-35B-A3B-MTP-GGUF

Unsloth · 2026

Multi-Token Prediction

Unsloth · 2026

Qwen3.6 GGUF CUDA 13.2 issue

Unsloth · 2026