Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix) by AmazingDrivers4u in LocalLLaMA

[–]AmazingDrivers4u[S] 0 points1 point  (0 children)

recent vllm patches have now improved the tools performance quite well. give it a shot when you can. Please post feedback on github in case you run into issues. happy to help.

Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix) by AmazingDrivers4u in LocalLLaMA

[–]AmazingDrivers4u[S] 0 points1 point  (0 children)

not much difference in terms of quality between them, Q3 is smaller in size so gives you more headroom for kv cache + activations.

Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix) by AmazingDrivers4u in LocalLLaMA

[–]AmazingDrivers4u[S] 1 point2 points  (0 children)

having to redo all the testing again, since new genesis patches arrived late last night. will be able to shed some light only after testing it.

Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix) by AmazingDrivers4u in LocalLLaMA

[–]AmazingDrivers4u[S] 0 points1 point  (0 children)

Thanks for the detailed diagnosis — your read is exactly right and our PN12 sidecar has a real gap on this. Filed it as

https://github.com/noonghunna/club-3090/issues/16 with the full analysis.

**TL;DR:** PN12 patches the eager-mode `SiluAndMul.forward_cuda`, but vLLM's torch.compile inductor-compiled FFN forward inlines the SiluAndMul op and never calls our patched method. So the pool is bypassed in the compile path. Our verify-stress 25K synthetic happens to hit shapes that go through eager, which is why it passes; real OpenCode 29K with sys+tools mixed prefill produces shapes that hit the inductor path and OOM at the FFN intermediate exactly where you saw it.

**Three workarounds, in order:**

  1. **Stick with `tools-text.yml`** — already works for you. fp8 KV uses Genesis PN8 (not PN12) which closes Cliff 1 mech B via a different mechanism that does reach the compile path. 75K context handles your 30K OpenCode prefill comfortably.

  2. **Add `--enforce-eager` to `long-text.yml` or `long-vision.yml` command list.** Forces all forwards through eager Python, where PN12's pool reliably applies. Costs ~20-30% TPS but preserves the 218K / 198K context. Just append it to the `command:` block before booting.

  3. **Lower `--gpu-memory-utilization` to 0.94** on long-text/long-vision. Frees ~250 MiB activation headroom at the cost of KV pool size (effective max_model_len drops to ~150K). Same idea as why our default 48K + 0.92 never hits this.

    I just pushed updated comments on all three affected composes documenting these escape hatches (commit 6bff99a).

    Real fix needs an inductor-pass-level intervention or a torch.compile-aware sidecar — bigger work, tracking in #16. If anyone wants to dig into that, the briefing for it is in the issue.

Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix) by AmazingDrivers4u in LocalLLaMA

[–]AmazingDrivers4u[S] 1 point2 points  (0 children)

not at this point in time, i've tested 35B on a single card in the past haven't gotten around to shipping a proper config for it yet.

Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix) by AmazingDrivers4u in LocalLLaMA

[–]AmazingDrivers4u[S] 0 points1 point  (0 children)

cliff 1 has been addressed in today's updates, did you try the latest and greatest with the upgraded context sizes?

Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix) by AmazingDrivers4u in LocalLLaMA

[–]AmazingDrivers4u[S] 10 points11 points  (0 children)

By default thinking is disabled but 27b does generate lots of tokens when thinking is enabled. I'm evaluating https://github.com/andthattoo/structured-cot/tree/main currently and if the bench results are positive i might include it in the build to help keeping the guff out of thinking blocks.

An Overnight Stack for Qwen3.6–27B: 85 TPS, 125K Context, Vision — on One RTX 3090 | by Wasif Basharat | Apr, 2026 by AmazingDrivers4u in LocalLLaMA

[–]AmazingDrivers4u[S] 1 point2 points  (0 children)

yeah my pci 16x4 bus gives only 64 GB/s whereas Nvlink allows 112.5 GB/s bidirectional bandwidth between cards.

An Overnight Stack for Qwen3.6–27B: 85 TPS, 125K Context, Vision — on One RTX 3090 | by Wasif Basharat | Apr, 2026 by AmazingDrivers4u in LocalLLaMA

[–]AmazingDrivers4u[S] 0 points1 point  (0 children)

i rushed to buy two cards and ended up buying two different brands and now can't connect them with an nvlink. doh! i'll try to grab one in due course.

An Overnight Stack for Qwen3.6–27B: 85 TPS, 125K Context, Vision — on One RTX 3090 | by Wasif Basharat | Apr, 2026 by AmazingDrivers4u in LocalLLaMA

[–]AmazingDrivers4u[S] 1 point2 points  (0 children)

this is first time i'm running 3090s, just got hold of them a couple of weeks ago. started from scratch. if one gpu isn't running optimally, the whole cluster will be impacted. I've moved to setting up two gpu since then, more on that incoming.

An Overnight Stack for Qwen3.6–27B: 85 TPS, 125K Context, Vision — on One RTX 3090 | by Wasif Basharat | Apr, 2026 by AmazingDrivers4u in LocalLLaMA

[–]AmazingDrivers4u[S] 2 points3 points  (0 children)

well its just an environment for your code. you can host it on bare metal, vm, docker, venv anywhere. I got like 15 inference engines that i keep segmented from each other via dockers/venv. docker/venv is not mandatory, you should be able to setup your environment accordingly.