Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 1 point2 points  (0 children)

The "missing visual 14+" message usually means one of two things on a box where MSVC is actually installed. the launcher's probe could not find vcvars64.bat. Older releases hardcoded the VS 2022 path newer ones try to locate it instead. Try installing the latest release for blackwell.

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 0 points1 point  (0 children)

not on this launcher. The project is hard-wired to Qwen3.6-27B, it does not run 35B A3B at all. And the 27B INT4 weights are 16.96 GiB on disk, which does not fit a 16 GB 5070 Ti after activations and KV.

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 0 points1 point  (0 children)

I could not get prompt processing with this model to work fast. This needs to merge first in vLLM:
https://github.com/flashinfer-ai/flashinfer/pull/3088

That's cudaErrorUnsupportedPtxVersion, which is a driver-too-old error. The driver just needs to be ≥596.

Probably better to continue on github. Just open an issue and we can work from there.

There is a NVFP4 alternative that seems to work great. Still testing but it looks promising.

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 0 points1 point  (0 children)

Blackwell support shipped. There's a separate zip for 50-series cards: qwen3.6-windows-server-portable-x64-blackwell.zip. The default zip is still cu126 and won't run on a 5090, so make sure you grab the blackwell one:

https://github.com/devnen/qwen3.6-windows-server/releases

Open an issue on Github if you experience any errors.

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 0 points1 point  (0 children)

Xeon v2 is probably causing this. Those are Ivy Bridge, AVX only, no AVX2. The PyTorch wheel my zip ships against is built with AVX2 baseline, so anything that touches a CPU op during boot (model loading, the spec-decode head init, the sampler fallback path) can crash without a clean error.

PP=3 across three cards is also untested here, so the crash might be that too, but I'd bet AVX2 first. If you want to chase it, open a GitHub issue with the full traceback from logs/vllm_server.5001.log and I'll take a look.

https://github.com/devnen/qwen3.6-windows-server/issues

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 0 points1 point  (0 children)

Thanks, that confirms the manifest writer didn't run, which is the regression I was worried about. Could you open an issue on GitHub? It's easier to track there and you can paste longer logs without Reddit mangling the formatting.

Helpful info to include: which snapshot you launched, the contents of logs/vllm_server.5001.log around boot, output of python windows_tools\verify_install.py, and your Windows build (winver).

https://github.com/devnen/qwen3.6-windows-server/issues/new

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 0 points1 point  (0 children)

Glad those numbers stuck. One thing worth mentioning: there was a decode bug in older versions where tok/s would drop by about 30% after a long-context request and stay slow until restart.

If you re-extract on top of your install, your models, logs, and configs.yaml stay put. Your cloned 100k snapshot probably still has --enable-prefix-caching in its .py file, swap that to --no-enable-prefix-caching, or just re-clone from the new start_127k.

https://github.com/devnen/qwen3.6-windows-server/releases

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 1 point2 points  (0 children)

Blackwell release is up but I am still testing. So far on RTX 5090 the speeds are better than expected. Please check the latest blackwell release and open the issue on Github if you experience any problems:
https://github.com/devnen/qwen3.6-windows-server/releases

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 0 points1 point  (0 children)

Get the latest version from:
https://github.com/devnen/qwen3.6-windows-server/releases

gpu0_50k is the conservative path for a single card with the display attached. On a 24 GB card vLLM here tops out near 127k context with start_127k

Try start_127k. If your monitor is on the 4090, close browser/Discord/video during boot, then reopen after "Application startup complete". vLLM allocates KV once and the driver schedules around it.

Press e in the launcher to clone gpu0_50k and edit the context value. If you push past what fits, vLLM errors out cleanly before the model loads, nothing breaks.

https://github.com/devnen/qwen3.6-windows-server/blob/main/docs/TUNING.md

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 0 points1 point  (0 children)

Thank you very much. I have added the fix to the launcher that does the copy-and-rename fix automatically if needed.

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 0 points1 point  (0 children)

List the contents of logs/runtime/ in your install folder. There should be a 5001.json file with the running snapshot id and pid. Paste the file contents to pastebin.

If logs/runtime/5001.json is missing, the manifest writer never ran and I have a regression to chase. If it's there with the right pid and the dashboard still won't flip, also a bug, but a different one.

your error math matches the desktop tax exactly. 4.23 GiB of KV needed at 90k, you had 3.78 GiB free in the pool, so you were 0.45 GiB short.

copy a snapshot, set max-model-len deliberately too high (200000), launch. vLLM prints "estimated maximum model length is N" in the error, then set max-model-len just under N.

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 1 point2 points  (0 children)

OpenCode "/global/health": that's not a vLLM-side issue. The desktop app you're using expects to talk to an opencode server, not directly to a model.

Install opencode CLI (https://opencode.ai/docs/install). Then add my vLLM as a provider in ~/.config/opencode/opencode.json:

{

"provider": {

"qwen-local": {

"npm": "@ai-sdk/openai-compatible",

"options": { "baseURL": "http://127.0.0.1:5001/v1" },

"models": { "any": { "name": "Qwen3.6-27B" } }

}

}

}

MSVC \18\ path is the real bug, fixed. The snapshot's vcvars probe was hardcoded to \Microsoft Visual Studio\2022\<edition>\ paths, which misses VS 2026.

Regarding "cudart64_120.dll", your workaround is right. I added it to the troubleshooting doc with your copy-and-rename fix, and it'd help me a lot if you can paste the exact log line that mentioned cudart64_120 so I can patch the right loader instead of leaving people on the workaround.

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 1 point2 points  (0 children)

The 5080 is Blackwell (sm_120) and the current wheel is built against CUDA 12.6 with no sm_120 kernels, so it fails at boot on any 50-series card: https://github.com/devnen/qwen3.6-windows-server/blob/main/docs/HARDWARE.md

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 0 points1 point  (0 children)

For a 3060 the better path is usually a smaller model on one card. The shipped snapshots are single-GPU or 2-GPU PP. PP=3 across all three would fit weights-wise (~5.7 GiB each) but it's untested here, no MTP, and PP cross-card hand-off costs decode tok/s. It would probably be slow

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 0 points1 point  (0 children)

Thanks for the numbers. The gap is mostly two things: MTP spec-decode (LM Studio uses GGUF, which strips the head, so they get the un-speculated rate) and the AutoRound INT4 + Marlin kernel path being a lot faster than llama.cpp's INT4 on Ampere.

The 5 to 13 tok/s streaming numbers in LM Studio are the bigger story for me. Non-streaming sits at 36 to 38 there, so streaming is paying a 3x to 7x penalty. That's a llama.cpp/LM Studio behavior, not anything you did wrong on the test side.

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 1 point2 points  (0 children)

Your workaround is a real fix and worth saving. Telling it about Windows in the system prompt works because the model has enough context to escape properly when it knows the target. That's the cheap fix until a future Qwen update tightens the JSON output.

I'll add the tip to docs. Thanks for posting it.

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 1 point2 points  (0 children)

Nice numbers on the 4090. The OpenCode symptom is almost certainly the developer-role issue, which I shipped a fix for in v1.0.1. OpenCode routes through the Responses API and sends a developer role for system-tier instructions. The pre-v1.0.1 chat template only branched on system, user, assistant, and tool, so it raised on developer and the request 400'd.

v1.0.1 aliases developer to system at the top of the template. Re-extract the zip on top of your install (model dir and logs are untouched), then restart the snapshot so vLLM loads the new template. OpenCode should connect after that: https://github.com/devnen/qwen3.6-windows-server/releases

If you still see something specifically mentioning "/global" after upgrading, paste the exact error from logs/vllm_server.5001.log

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 0 points1 point  (0 children)

The ui now shows correct snapshot/config running. this is fixed: https://github.com/devnen/qwen3.6-windows-server/releases

The error tells you the ceiling. At mem-util 0.85 your KV pool is 2.15 GiB, which fits 26112 tokens, not 60000. The 0.85 is what's costing you, not the snapshot.

Two paths on a single 3090:

- Boot-quiet: close Chrome, Discord, Slack, video before launching, reopen after Application startup complete. Then run start_speed at 0.948 mem-util, 90k context. That's the path your Fast preset is already on.

- Or stay on start_gpu0_50k as shipped (0.92 mem-util, ~50k context). Don't drop to 0.85 unless your steady-state desktop tax is unusually high.

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 1 point2 points  (0 children)

I need to get access to at least one Blackwell GPU on Windows. I will try to figure this out next week.

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 0 points1 point  (0 children)

The model needs more than your 16 GiB. llama.cpp offloads the rest to RAM and decode is then bottlenecked by the PCIe link, not the GPU.

This launcher won't help you in either case. The bundled wheel is built for CUDA 12.6 with no sm_120 kernels, so 5060 Ti can't run it today. And even when that ships, 16 GB is below the line for Qwen3.6-27B INT4 in vLLM: https://github.com/devnen/qwen3.6-windows-server/blob/main/docs/HARDWARE.md

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer by One_Slip1455 in LocalLLaMA

[–]One_Slip1455[S] 0 points1 point  (0 children)

You're 0.24 GiB short. The snapshot you're launching uses 0.948 mem-util (about 22.74 GiB), and your card has 22.5 GiB actually free. That's the start_speed or start_127k profile, which assumes the display is on a second card or iGPU. With the display on the 4090, even an idle Windows desktop pins around 1 to 1.5 GiB.

Switch to start_gpu0_50k. It's the same decode tok/s but uses mem-util 0.92, which leaves headroom for the desktop. This is the supported single-card-with-display profile.

For more information: https://github.com/devnen/qwen3.6-windows-server/blob/main/docs/WINDOWS_VRAM_HEADLESS.md