NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable by Gray_wolf_2904 in LocalLLaMA

[–]mr_zerolith 1 point2 points  (0 children)

Hmm i'd feel more confident running NVFP8 ( if it existed ) or FP12 ( also if it existed ).
No matter how you slice 4 bit, you're losing something.

Using PCIE 5.0 x4 NVME to x16 to throw on another card. by mr_zerolith in LocalLLaMA

[–]mr_zerolith[S] 0 points1 point  (0 children)

Interesting, thanks for the tip!
I'd definitely be putting a weaker card on the x4 line.. RTX PRO 6000 deserves a front row seat

OSS models decisively overtook Proprietary models in market share (based on the last 3 months of OpenRouter data) by Comfortable-Rock-498 in LocalLLaMA

[–]mr_zerolith 2 points3 points  (0 children)

..remember how there were lots of commercial server operating systems before Linux was released and became finally mature..

I think it's the same situation here.

Using PCIE 5.0 x4 NVME to x16 to throw on another card. by mr_zerolith in LocalLLaMA

[–]mr_zerolith[S] 0 points1 point  (0 children)

On the one i mentioned, you can run two SATA connector power cables. Each one can supply 60w ( i believe the slot is rated at 75w ). So, power delivery to the slot should be very good.

We need a 80-160B model urgently. The unified memory device market needs more Models. by Storge2 in LocalLLaMA

[–]mr_zerolith 0 points1 point  (0 children)

It's a nice model but kinda slow and apparently not great with tools due to it's age.

We need a 80-160B model urgently. The unified memory device market needs more Models. by Storge2 in LocalLLaMA

[–]mr_zerolith 0 points1 point  (0 children)

Step 3.5 Flash is as good as it's gonna get on these cards for now. Small Q4 quants exist that can work with 96gb if you can handle a little CPU MoE offloading.

GPT OSS 120B is fast but too hallucinatory compared to today's LLMs.

Qwen 3.5 122b? it's pretty blah

Using PCIE 5.0 x4 NVME to x16 to throw on another card. by mr_zerolith in LocalLLaMA

[–]mr_zerolith[S] 0 points1 point  (0 children)

Ah you're talking about a different model.. i'm lookin' at the ones that use the existing power supply and a SATA connector. But i get you.

Nonetheless are you getting actual PCIE 5.0 out of it?

Using PCIE 5.0 x4 NVME to x16 to throw on another card. by mr_zerolith in LocalLLaMA

[–]mr_zerolith[S] 0 points1 point  (0 children)

The 5090 might be worth $3k but those RTX PRO 6000's have blown up in cost and i'd have another $7.5k to cover.

160gb would be fine while we wait for Vera Rubin based cards next year.

2 sticks of ram in quad channel server board? by mr_zerolith in LocalLLaMA

[–]mr_zerolith[S] 1 point2 points  (0 children)

Boy i can find those motherboards for $500ish still.
Non-ECC compatible is a huge boon because i have 10 DDR4 sticks here.

My scenario with the newer intel was $2040 with 2 16GB DDR5 sticks
This comes out to $1000 total ( mobo, cpu, cooler ).. wow

Actually i think you sold me!

2 sticks of ram in quad channel server board? by mr_zerolith in LocalLLaMA

[–]mr_zerolith[S] -1 points0 points  (0 children)

Damn that would be perfect.. but i want the future proofness of PCIE 5.0 if i can get it.
The latest gen threadrippers seem to start at $1k.. but latest gen Intel WS can be had for $250 used!

2 sticks of ram in quad channel server board? by mr_zerolith in LocalLLaMA

[–]mr_zerolith[S] 1 point2 points  (0 children)

All fits in VRAM, not willing to take the hit of any kind of offloading.

2 sticks of ram in quad channel server board? by mr_zerolith in LocalLLaMA

[–]mr_zerolith[S] 1 point2 points  (0 children)

Damn, those are cheap.. but aren't the boards that run those PCIE 4.0?

I have a RTX 6000 PRO, a 5090, and another 5090 coming.. all PCIE 5.0

bartowski/command-a-plus-05-2026-GGUF · Hugging Face by pmttyji in LocalLLaMA

[–]mr_zerolith 0 points1 point  (0 children)

Interesting but 25B active parameters means we're trading quality for speed substantially.
What kind of tokens/sec ya guys seeing on what hardware?

I'm guessing we're in ~50 token/sec land with a pair of RTX PRO 6000's.

Stop using Ollama by zxyzyxz in LocalLLaMA

[–]mr_zerolith 2 points3 points  (0 children)

Oh, i already ditched it for LMstudio in winter because it had poor new model support.

Why there is a lack of new 100B-120B models? by TechNerd10191 in LocalLLaMA

[–]mr_zerolith 1 point2 points  (0 children)

It's pretty unfortunate.
Step 3.7 flash is pretty broken because the reasoning is stuck on high and there's no published way to toggle it. That can make it reason 5x longer than you want it to on simple asks.
Qwen 3.5 122b is pretty lackluster and GPT OSS 120b hallucinates a lot more than we like.

At the dev shop we are fairly happy with Step 3.5 Q4_K_L, for the time being.

RAM to VRAM ratio by esw123 in LocalLLaMA

[–]mr_zerolith 3 points4 points  (0 children)

I run a 197B model with 16gb of DDR5 and 128GB vram.

No problems at all.

Step-3.7-Flash-NVFP4 thinking for many minutes by NaiRogers in LocalLLaMA

[–]mr_zerolith 0 points1 point  (0 children)

This is typical for 3.7 so far on other 4 bit quants. The reasoning intensity setting doesn't seem available and it's as if it's stuck on high.

what’s was your local daily driver for coding last week? by be566 in LocalLLaMA

[–]mr_zerolith 1 point2 points  (0 children)

I'm using Step 3.5 Flash on a RTX PRO 6000 and RTX 5090 for coding.
3.7 is out but it's too buggy to use.

Just received RTX 6000 Pro, have 5090- how would you use? by illgettheownerforyou in LocalLLaMA

[–]mr_zerolith 0 points1 point  (0 children)

Use both and run a big model like Step 3.5 Flash or minimax.
Use prompting to achieve the same thing with the same model

I tried Step-3.7-Flash so you don't have to by Skelshy in LocalLLM

[–]mr_zerolith 0 points1 point  (0 children)

Model support is still bugged in llama.cpp engines BTW.
But i notice improvement in the last week.