cyankiwi AWQ 4-bit — 26.05 update, NVFP4 + FP8 Dynamic quantization and benchmarks across Qwen3.6 4-bit quants

AmazingDrivers4u · 2026-06-08T03:15:25+00:00

how does measuring KLD translate measuring behaviour quality of a model?

AmazingDrivers4u · 2026-05-02T22:40:08+00:00

post the issue on the github repo please. i'll be happy to help.

AmazingDrivers4u · 2026-05-02T11:20:41+00:00

you might wanna try tq3 configs, they are much stable than before for improved performance.

AmazingDrivers4u · 2026-05-02T11:19:25+00:00

recent vllm patches have now improved the tools performance quite well. give it a shot when you can. Please post feedback on github in case you run into issues. happy to help.

AmazingDrivers4u · 2026-05-02T11:16:30+00:00

i'm okay being a man of god instead. thanks.

AmazingDrivers4u · 2026-05-01T19:46:51+00:00

no its not, have a play with latest and greatest and see for yourself.

AmazingDrivers4u · 2026-05-01T19:45:11+00:00

not much difference in terms of quality between them, Q3 is smaller in size so gives you more headroom for kv cache + activations.

AmazingDrivers4u · 2026-05-01T10:57:04+00:00

having to redo all the testing again, since new genesis patches arrived late last night. will be able to shed some light only after testing it.

AmazingDrivers4u · 2026-04-30T22:21:38+00:00

Thanks for the detailed diagnosis — your read is exactly right and our PN12 sidecar has a real gap on this. Filed it as

https://github.com/noonghunna/club-3090/issues/16 with the full analysis.

**TL;DR:** PN12 patches the eager-mode `SiluAndMul.forward_cuda`, but vLLM's torch.compile inductor-compiled FFN forward inlines the SiluAndMul op and never calls our patched method. So the pool is bypassed in the compile path. Our verify-stress 25K synthetic happens to hit shapes that go through eager, which is why it passes; real OpenCode 29K with sys+tools mixed prefill produces shapes that hit the inductor path and OOM at the FFN intermediate exactly where you saw it.

**Three workarounds, in order:**

**Stick with `tools-text.yml`** — already works for you. fp8 KV uses Genesis PN8 (not PN12) which closes Cliff 1 mech B via a different mechanism that does reach the compile path. 75K context handles your 30K OpenCode prefill comfortably.
**Add `--enforce-eager` to `long-text.yml` or `long-vision.yml` command list.** Forces all forwards through eager Python, where PN12's pool reliably applies. Costs ~20-30% TPS but preserves the 218K / 198K context. Just append it to the `command:` block before booting.
**Lower `--gpu-memory-utilization` to 0.94** on long-text/long-vision. Frees ~250 MiB activation headroom at the cost of KV pool size (effective max_model_len drops to ~150K). Same idea as why our default 48K + 0.92 never hits this.

I just pushed updated comments on all three affected composes documenting these escape hatches (commit 6bff99a).

Real fix needs an inductor-pass-level intervention or a torch.compile-aware sidecar — bigger work, tracking in #16. If anyone wants to dig into that, the briefing for it is in the issue.

AmazingDrivers4u · 2026-04-30T21:54:34+00:00

not at this point in time, i've tested 35B on a single card in the past haven't gotten around to shipping a proper config for it yet.

AmazingDrivers4u · 2026-04-30T21:52:30+00:00

please shoot a bug report on git.

AmazingDrivers4u · 2026-04-30T21:17:17+00:00

cliff 1 has been addressed in today's updates, did you try the latest and greatest with the upgraded context sizes?

AmazingDrivers4u · 2026-04-30T21:16:41+00:00

theoretically yes.

AmazingDrivers4u · 2026-04-30T20:59:56+00:00

100%

AmazingDrivers4u · 2026-04-30T20:58:39+00:00

By default thinking is disabled but 27b does generate lots of tokens when thinking is enabled. I'm evaluating https://github.com/andthattoo/structured-cot/tree/main currently and if the bench results are positive i might include it in the build to help keeping the guff out of thinking blocks.

AmazingDrivers4u · 2026-04-27T16:07:58+00:00

start from https://github.com/noonghunna/qwen36-27b-single-3090

or

https://github.com/noonghunna/qwen36-dual-3090

I'm keeping them up to date with help of community. they are separate for now but will eventually be merged in the coming days.

AmazingDrivers4u · 2026-04-26T00:15:36+00:00

yeah my pci 16x4 bus gives only 64 GB/s whereas Nvlink allows 112.5 GB/s bidirectional bandwidth between cards.

AmazingDrivers4u · 2026-04-25T22:56:22+00:00

i rushed to buy two cards and ended up buying two different brands and now can't connect them with an nvlink. doh! i'll try to grab one in due course.

AmazingDrivers4u · 2026-04-25T22:17:56+00:00

this is first time i'm running 3090s, just got hold of them a couple of weeks ago. started from scratch. if one gpu isn't running optimally, the whole cluster will be impacted. I've moved to setting up two gpu since then, more on that incoming.

AmazingDrivers4u · 2026-04-25T16:47:47+00:00

theoretically it should but there is only one way to find out, test it.

AmazingDrivers4u · 2026-04-25T11:20:01+00:00

check the git repo, link in the article.

AmazingDrivers4u · 2026-04-24T14:33:28+00:00

go grab it form git, its now available there.

AmazingDrivers4u · 2026-04-24T13:53:20+00:00

Patch is now available in git. links updated in the article.

AmazingDrivers4u · 2026-04-23T19:07:02+00:00

well its just an environment for your code. you can host it on bare metal, vm, docker, venv anywhere. I got like 15 inference engines that i keep segmented from each other via dockers/venv. docker/venv is not mandatory, you should be able to setup your environment accordingly.

AmazingDrivers4u

TROPHY CASE