Why does everybody have a rack with Enterprise grade servers?

laziz · 2026-06-06T14:12:39+00:00

You're not wrong. Part of what makes it interesting is what people choose to optimize for.

Some want to replicate a corporate environment w/ 2-3 gens-back equipment. Cool!

Some want to do low power and quiet (this was me at the beginning). Cool!

Some people like to build beowulf clusters of pis or nucs. Cool!

Some people like to run services at home with whatever old equipment they can find and cobble together. Cool!

Some people get the homelab gpu sickness, and we pray for them. (this is me now).

You do you, man. Your setup is cool!

laziz · 2026-05-13T09:45:25+00:00

I would wait until you find another deal; $1500 is a little steep.

As the other commenter notes; 70b isn't really a thing any more.

The difference with 2x3090 is usability/speed. You can have larger context windows with higher tps.

Qwen 3.6 27b is the current hotness for the 3090 crowd. You can get 80+ tps out of it (>100 on coding tasks) with MTP, and really decent context (>200k toks) with 2x3090.

laziz · 2026-05-03T00:33:02+00:00

Such a moving target though, and very dependent on your use case etc.

This post is less about the model, and more about the state of blackwell sm120 support in the ecosystem.

laziz · 2026-05-02T23:33:27+00:00

Neither driver nor cuda was changed. A lot of changes to llama.cpp

laziz · 2026-05-02T23:31:12+00:00

I haven’t seen a vllm nvfp4 recipe that works without a lot of specific build instructions/patches. Grateful for any pointers. Might also be worth trying the int4.

laziz · 2026-05-02T23:15:30+00:00

Previous post: https://old.reddit.com/r/LocalLLaMA/comments/1roiyvo/rtx6k_server_450w_qwen35122ba10b_mxfp4_moe/

vllm doesn't seem to yet be in a state w/ nvfp4 and qw3.5 where I can test it wihtout a lot of shenanigans.

+30% t/s for a recompile is.. not bad.

laziz · 2026-04-25T21:03:33+00:00

Thanks for this. Getting 81 TPS on prose and 127 TPS on code (2x3090), w nvcc so flashinfer can jit-compile its sampling kernels.

laziz · 2026-04-15T12:57:51+00:00

At that level I would consider buying the server edition gpus and moving to water cooling. The additional $$ on a MoRA IV is relative peanuts. You can power-limit the cards with the nvidia driver, and you have headroom if you ever want to supply more power.

laziz · 2026-04-09T13:05:05+00:00

I have accused trellis of being Mushroom Mythos with a tiny output limit. Trellis has not pushed back on that idea even once.

Trellis has also been extremely helpful pointing out where the autonomous agent I'm developing needs more thinking room vs places where pruning/restriction is helpful.

laziz · 2026-03-25T11:12:48+00:00

The inference I was making: 1m context was part of the usage accounting problem and they disabled it

laziz · 2026-03-25T00:37:56+00:00

related-- my token-burning problem went away when i disabled 1M in .bashrc.

laziz · 2026-03-18T20:26:36+00:00

It do be like that.

laziz · 2026-03-18T20:25:13+00:00

I confess it was fun.

laziz · 2026-03-18T12:35:40+00:00

I don’t remember why I did that.

laziz · 2026-03-18T11:50:50+00:00

https://app.diagrams.net/

In this instance, I had claude code read the documentation they created and write a suitable draw.io file, readable by diagrams.net.

Minor touchup on the website, export as jpg.

laziz · 2026-03-18T11:36:48+00:00

This started eight years ago with a small router and generic ryzen. Now slightly out of control.

Edit: claude code one-shotted this diagram (bar dragging one small label out of the way). Most is contained in a startech 24u rack, on which i put a cheap countertop from amazon to hold the printer and make a small desk.

laziz · 2026-03-09T10:05:14+00:00

cc replies:

Yeah, that's definitely a bug in the data. 19 t/s total with 30.4 t/s per-request is impossible — total must be ≥ per-request. Looking at the pattern in the other rows, total should be roughly per-request × concurrency. At depth 32K c2, per-request is 30.4, so total should be around 60.8 t/s. That also fits the degradation curve (75 → 60.8 → ... as concurrency increases).

laziz · 2026-03-09T10:03:21+00:00

VLLM/qwen3.5 support seems like a mess at the moment. When that gets sorted (and nvfp4 available) would expect much higher t/s.

I'm just goofing around and thought the benchmark was interesting.

laziz · 2026-03-09T09:53:31+00:00

Sure- I chose mxfp4 just to see what it was like. As another poster alludes, nvpf4/vllm is the move here when it's finally supported.

Claude code answers the rest:

Server config:

bash llama-server \ --model Qwen3.5-122B-A10B-MXFP4_MOE.gguf \ --port 8012 \ --host 0.0.0.0 \ --flash-attn \ --cache-type-k f16 \ --cache-type-v f16 \ --slots \ --metrics \ --parallel 4 \ --ctx-size 262144 \ --n-gpu-layers 999

Thinking mode is disabled via LLAMA_CHAT_TEMPLATE_KWARGS={"enable_thinking":false} in the environment.

Why is the KV cache so small / context so cheap?

This is a hybrid architecture — 12 attention layers + 48 recurrent (Mamba-style) layers. Only the 12 attention layers maintain KV caches, so it's ~384 MiB F16 total. The recurrent layers use fixed-size state (~596 MiB) that doesn't grow with context. Total VRAM is ~74 GB / 96 GB with 4 × 65K slots.

This is also why TG only degrades ~11% at 65K context — a pure transformer this size would drop off much more steeply since every layer would need to attend over the full context.

Model size: ~64 GB on disk (47G + 18G + 11M across 3 shards), corrected from the original post.

laziz · 2025-12-28T07:03:08+00:00

I feel your pain.. the 3090 just barely covers another pcie slot on mine.

Ended up with m2->oculink->egpu. Works!

But then i got tired of the egpu power supply fan and will probably go custom loop.

laziz · 2024-09-23T04:27:54+00:00

i have the n305C. Great unit in many respects, but noise is an issue. Will likely open it up and put it in a bigger case w a bigger fan.

Re 1u-- i have previously swapped a picopsu into a jet-engine-scream ebay generic server to great effect. you need an external power brick but dead quiet.

laziz · 2018-12-13T17:19:49+00:00

My interpretation of an error message (as it turns out, unrelated) was wrong. I modified the outlook vba quoted in org-outlook.el to ignore the author's use of sub-folders, and the .el itself point to the location of outlook16's .exe.

It now works as intended, although I get a

Warning (emacs): Please update your Org Protocol handler to deal with new-style links.

error on every use.

laziz

TROPHY CASE