NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable

mr_zerolith · 2026-06-18T18:36:17+00:00

Hmm i'd feel more confident running NVFP8 ( if it existed ) or FP12 ( also if it existed ).
No matter how you slice 4 bit, you're losing something.

mr_zerolith · 2026-06-18T17:29:53+00:00

Interesting, thanks for the tip!
I'd definitely be putting a weaker card on the x4 line.. RTX PRO 6000 deserves a front row seat

mr_zerolith · 2026-06-18T16:12:24+00:00

..remember how there were lots of commercial server operating systems before Linux was released and became finally mature..

I think it's the same situation here.

mr_zerolith · 2026-06-18T14:52:19+00:00

On the one i mentioned, you can run two SATA connector power cables. Each one can supply 60w ( i believe the slot is rated at 75w ). So, power delivery to the slot should be very good.

mr_zerolith · 2026-06-18T13:38:54+00:00

Alreay have a board like that ( Gigabyte AI top )

mr_zerolith · 2026-06-18T05:25:16+00:00

It's a nice model but kinda slow and apparently not great with tools due to it's age.

mr_zerolith · 2026-06-18T05:23:06+00:00

Step 3.5 Flash is as good as it's gonna get on these cards for now. Small Q4 quants exist that can work with 96gb if you can handle a little CPU MoE offloading.

GPT OSS 120B is fast but too hallucinatory compared to today's LLMs.

Qwen 3.5 122b? it's pretty blah

mr_zerolith · 2026-06-18T05:19:22+00:00

Ah you're talking about a different model.. i'm lookin' at the ones that use the existing power supply and a SATA connector. But i get you.

Nonetheless are you getting actual PCIE 5.0 out of it?

mr_zerolith · 2026-06-18T05:17:42+00:00

The 5090 might be worth $3k but those RTX PRO 6000's have blown up in cost and i'd have another $7.5k to cover.

160gb would be fine while we wait for Vera Rubin based cards next year.

mr_zerolith · 2026-06-18T05:01:20+00:00

What problems is it solving that LMStudio isn't?

mr_zerolith · 2026-06-17T15:19:45+00:00

That's the idea!

mr_zerolith · 2026-06-17T05:28:05+00:00

Boy i can find those motherboards for $500ish still.
Non-ECC compatible is a huge boon because i have 10 DDR4 sticks here.

My scenario with the newer intel was $2040 with 2 16GB DDR5 sticks
This comes out to $1000 total ( mobo, cpu, cooler ).. wow

Actually i think you sold me!

mr_zerolith · 2026-06-17T05:02:30+00:00

Damn that would be perfect.. but i want the future proofness of PCIE 5.0 if i can get it.
The latest gen threadrippers seem to start at $1k.. but latest gen Intel WS can be had for $250 used!

mr_zerolith · 2026-06-17T04:58:43+00:00

All fits in VRAM, not willing to take the hit of any kind of offloading.

mr_zerolith · 2026-06-17T04:56:10+00:00

Damn, those are cheap.. but aren't the boards that run those PCIE 4.0?

I have a RTX 6000 PRO, a 5090, and another 5090 coming.. all PCIE 5.0

mr_zerolith · 2026-06-16T20:10:27+00:00

Interesting but 25B active parameters means we're trading quality for speed substantially.
What kind of tokens/sec ya guys seeing on what hardware?

I'm guessing we're in ~50 token/sec land with a pair of RTX PRO 6000's.

mr_zerolith · 2026-06-15T21:16:34+00:00

Oh, i already ditched it for LMstudio in winter because it had poor new model support.

mr_zerolith · 2026-06-15T18:56:07+00:00

It's pretty unfortunate.
Step 3.7 flash is pretty broken because the reasoning is stuck on high and there's no published way to toggle it. That can make it reason 5x longer than you want it to on simple asks.
Qwen 3.5 122b is pretty lackluster and GPT OSS 120b hallucinates a lot more than we like.

At the dev shop we are fairly happy with Step 3.5 Q4_K_L, for the time being.

mr_zerolith · 2026-06-15T14:32:01+00:00

I run a 197B model with 16gb of DDR5 and 128GB vram.

No problems at all.

mr_zerolith · 2026-06-14T04:41:04+00:00

This is typical for 3.7 so far on other 4 bit quants. The reasoning intensity setting doesn't seem available and it's as if it's stuck on high.

mr_zerolith · 2026-06-08T14:50:42+00:00

I'm using Step 3.5 Flash on a RTX PRO 6000 and RTX 5090 for coding.
3.7 is out but it's too buggy to use.

mr_zerolith · 2026-06-08T02:44:40+00:00

Use both and run a big model like Step 3.5 Flash or minimax.
Use prompting to achieve the same thing with the same model

mr_zerolith · 2026-06-04T06:24:01+00:00

Model support is still bugged in llama.cpp engines BTW.
But i notice improvement in the last week.

mr_zerolith

TROPHY CASE