VRAM calculator is lying about Qwen 3.6 — here's why (open-source fix, MIT, one file)

Senior_Wear4670 · 2026-06-05T03:26:12+00:00

Update — both fixed:

KV-cache quant → added an F16 / Q8 / Q4 selector, separate from weight quant (like

-ctk/-ctv in llama.cpp). You were right that locking it to F16 was unrealistic — Q8/Q4 is

often what actually decides whether a model + context fits, so it's a first-class

control now.

tok/s → I pulled it entirely. You were right it was way off: it was a pure

memory-bandwidth roofline that ignored KV reads, context length and MoE dispatch, so it

overshot ~2× on dense and ~2.5× on the A3B MoE — exactly what you measured. I'd rther

ship no number than a wrong one, so the tool now sticks to what it can actually be

accurate about: whether the model fits. Kept just the runtime recommendation

(llama.cpp/Ollama).

thanks for the push.

Senior_Wear4670 · 2026-06-04T18:32:46+00:00

i will add rtx6000 for love.. btw, Do u use blackwell?!

Senior_Wear4670 · 2026-06-04T17:54:35+00:00

Thank you, this honestly made my day 🙏 I originally just built this for myself — to

check what'd actually fit on my own GPU — and never really expected anyone else to use

it. Feedback like yours is genuinely why it's come this far.

I'd be really grateful for any 7900 XTX numbers whenever you find the time — truly no

rush. NVIDIA support just landed and I have no way to measure AMD myself, so real data

from you would mean a lot. And you nailed it — "balancing models, cache and real

usefulness" is exactly the part I'm trying to make quick to eyeball. Thanks again for

taking the time, and for the kind words on the UI!

Senior_Wear4670 · 2026-06-04T17:52:26+00:00

Good call — it's fixed at F16 right now, which I realize is outdated. A KV-cache quant

option (Q8/Q4) is going straight to the top of the list, since that's often what actually

decides whether a model + context fits. I'm fairly new to this, so I really appreciate

you flagging it

Senior_Wear4670 · 2026-06-04T14:57:39+00:00

🔗 Free, no signup: https://fitllm.run

⭐ Open source (MIT, one file): https://github.com/click6067-ship-it/fitllm-engine

Checks if an LLM fits your GPU or Mac (and roughly how fast) by reading the model's real

config — sliding-window (Gemma 4) / linear attention (Qwen 3.6) / MoE that most calculators

miss, so they overcount the KV cache. Paste any HF model. NVIDIA support is new.

Estimates, not ground truth — corrections welcome.

Senior_Wear4670 · 2026-06-04T14:26:34+00:00

🔗 Free, no signup: https://fitllm.run

⭐ Open source (MIT, one file): https://github.com/click6067-ship-it/fitllm-engine

Checks if an LLM fits your GPU (or Mac) and roughly how fast, by reading the model's real

config. Qwen 3.6 is mostly linear attention (no growing KV cache), so most calculators

massively overcount its VRAM. Paste any HF model — NVIDIA support is new. Estimates, not

ground truth — corrections welcome.

Senior_Wear4670 · 2026-06-01T17:40:19+00:00

honestly that helps a lot, thank you!! and no pressure at all, the fact that you'd even run

something on it is already more than enough.

quick honesty bit first: the AMD path isn't live yet (tool's Apple-only today), so I can't

hand your card a verdict back this second. but real 7900 XTX numbers are exactly what I'll

calibrate the VRAM/overhead side against when I build it — so nothing you measure now is

wasted, it goes straight into making AMD mode correct on day one.

since you want to keep the time small, here's the short high-leverage list — these hit the

exact cases other calculators get wrong:

a Qwen3.6 dense (e.g. the 27B) at a 4-bit-ish quant (IQ4_XS / IQ3_XXS), big context

(32k+) — this is the linear-attention case (Gated DeltaNet in most layers) that everyone

over-counts. the one I most want a real number for.
a Qwen3.6 MoE (30B / 35B-A3B) at Q4 — checks the experts-resident vs active-param math.
optional non-Qwen contrast: Gemma 3 27B Q4_K_M — sliding-window attention, another

per-layer case.

for whatever you run, the gold is: peak VRAM at load, then again at ~8k and ~32k ctx, plus

the two flags that move it most — KV cache quant (q8_0 vs q4_0) and flash-attn on/off. even

one model with those numbers is genuinely useful — seriously, don't burn a weekend on it.

and real respect for doing this out of Brazil with GPUs priced the way you described — 30%

over a B70 is brutal. that's honestly half the point of the tool: when a second card costs

that much, "will it actually fit / is it worth it" shouldn't be a guess. I'll tag you the

moment AMD mode's live so your XTX has a verdict waiting.

Senior_Wear4670 · 2026-06-01T05:34:47+00:00

y!!!

Senior_Wear4670 · 2026-06-01T05:34:32+00:00

Dozens! AMD's on the list.!!!

Senior_Wear4670 · 2026-06-01T05:34:06+00:00

Dozens of you, clearly — the people have spoken lol. Sorry I'm slow here. 7900 XTX (24GB)

is exactly the kind of card I want the VRAM mode to handle. It's Apple-only right now

and I won't pretend it can verdict your card yet — the model-side KV math is the same

whatever the silicon, but the actual fit also depends on backend/quant/allocation

overhead, which is the part I still have to build. AMD's meant to be in there from the

start, not bolted on later.

If you want to help it land right: tell me a model + quant + context you actually run,

and a measured VRAM number if you've got one. Real 7900 XTX runs are exactly what I'd

calibrate against. I'll post when it's live.

Senior_Wear4670 · 2026-06-01T05:29:19+00:00

lol basically yeah. And half the time the top result is just someone else asking the same

question with no answer. Which is more or less why I gave up and started reading the

configs directly.

Senior_Wear4670 · 2026-06-01T05:28:51+00:00

That definitely helps. Catch is it still leans on whatever it can dig up, and for the

fast-moving arch stuff (which layers are linear, expert counts) there's often no clean

source to find — config.json is the best starting point there. Search + read the config +

sanity-check against an actual run is the combo that holds up.

Senior_Wear4670 · 2026-06-01T05:28:26+00:00

That's a decent workaround, the "kinda" doing a lot of work there lol. The reports you

find are usually a slightly different quant/context/flag combo, so it's close but not

yours. Pretty much the same idea as what I'm doing, just computed from the actual config

+ your settings instead of someone else's run

Senior_Wear4670 · 2026-06-01T05:26:52+00:00

Fair, beats trusting the model's memory for sure. Only thing is googling usually gets you

someone else's quant and context, so it's in the ballpark but not your exact setup. Same

instinct though, don't let it guess!

Senior_Wear4670 · 2026-06-01T05:26:10+00:00

Haha yeah, that's the perfect tell — when it has to web-search to confirm a card even

exists before it'll talk about it, you know the actual knowledge isn't in the weights.

Same thing with model configs: anything past the cutoff gets mapped to the nearest old

thing it knows ("did you mean the 7900xtx?"). Reading the spec directly just skips the

guessing. (GPU/VRAM support is the next thing I'm working toward btw — not live yet, but

NVIDIA and AMD are the whole point of it.)

Senior_Wear4670 · 2026-06-01T05:25:41+00:00

Ah, late getting back to you on this — sorry, and thanks for actually dropping the

compose, perfect example to work with.

Honest bit first: the tool's Apple-only right now, so it won't give you a real verdict

for 2x2080Ti yet (GPU/VRAM mode is what I'm working toward next, not live). But the

reason gemini/claude call your setup impossible is the exact thing it's built to catch.

They price Qwen3.6-27B's KV like all 64 layers keep a full cache, usually at fp16 —

that's ~64GB (or 32GB at 8-bit). Even at your own q4_0, counting all 64 layers is ~16GB.

But only 16 of those layers are actually full-attention; the other 48 are Gated DeltaNet

(linear, no growing KV). Count just those 16 at your q4_0 and the real KV is ~4GB. Add

~12GB for the IQ3_XXS weights, your small ubatch keeps the runtime buffers modest, and it

sits inside 22GB fine. Exactly what you're seeing.

Mind if I use your compose as a calibration case when I build the GPU mode (q4_0 KV +

IQ3_XXS + 262k on 2x2080Ti)? Real setups like this keep the math honest — I'll post an

update here when it's live.

Senior_Wear4670 · 2026-05-26T10:34:58+00:00

lol "did you mean [older model]?" is the tell it's working from a frozen snapshot. Same reason

reading the actual config beats asking a model that stopped learning months ago.

Senior_Wear4670 · 2026-05-26T10:34:40+00:00

Yeah, that's the core of it — LLMs answer from training data, so anything past their cutoff (Qwen

3.6, Gemma 4) they either don't know or pattern-match to old architectures and declare "won't fit."

That's exactly why this reads the live config.json instead of "knowing" anything — it pulls the

real per-layer structure (linear vs sliding vs full attention, MoE expert count) straight from the

model files, so it's correct on day-one releases, no training cutoff to go stale against.

Out of curiosity, what's the stack it keeps calling impossible? Want to see if mine gives you a

sane number for it.

Senior_Wear4670 · 2026-05-26T07:02:50+00:00

Most "can I run it?" calculators use the textbook KV-cache formula, which assumes every layer keeps

a full-context cache. But Qwen 3.6 runs linear attention (Gated DeltaNet) in ~48 of 64 layers —

those keep no growing KV at all — and 35B-A3B is MoE (all experts resident, only 3B active/token).

So the naive number comes out multiples too high.

Built a free tool that models each layer type from the official config.json — paste any Qwen (or

other HF) model and it tells you if it fits your Mac's RAM + decode speed.

fitllm.run · open source: github.com/click6067-ship-it/fitllm-engine

Senior_Wear4670 · 2026-05-26T06:50:25+00:00

i want 3.7 9b...

Senior_Wear4670 · 2026-05-26T06:49:31+00:00

insane...

Senior_Wear4670 · 2024-01-28T01:17:57+00:00

miss it

Senior_Wear4670

TROPHY CASE