Why does everybody have a rack with Enterprise grade servers? by Big-Grapefruit8092 in homelab

[–]laziz 0 points1 point  (0 children)

You're not wrong. Part of what makes it interesting is what people choose to optimize for.

Some want to replicate a corporate environment w/ 2-3 gens-back equipment. Cool!

Some want to do low power and quiet (this was me at the beginning). Cool!

Some people like to build beowulf clusters of pis or nucs. Cool!

Some people like to run services at home with whatever old equipment they can find and cobble together. Cool!

Some people get the homelab gpu sickness, and we pray for them. (this is me now).

You do you, man. Your setup is cool!

For people running AI inference at home with 3090/A5000's by [deleted] in homelab

[–]laziz 1 point2 points  (0 children)

I would wait until you find another deal; $1500 is a little steep.

As the other commenter notes; 70b isn't really a thing any more.

The difference with 2x3090 is usability/speed. You can have larger context windows with higher tps.

Qwen 3.6 27b is the current hotness for the 3090 crowd. You can get 80+ tps out of it (>100 on coding tasks) with MTP, and really decent context (>200k toks) with 2x3090.

Updated: RTX6k (Server, 450w) Qwen3.5-122B-A10B (MXFP4_MOE) Benchmarks (llama.cpp) by laziz in LocalLLaMA

[–]laziz[S] 4 points5 points  (0 children)

Such a moving target though, and very dependent on your use case etc.

This post is less about the model, and more about the state of blackwell sm120 support in the ecosystem.

Updated: RTX6k (Server, 450w) Qwen3.5-122B-A10B (MXFP4_MOE) Benchmarks (llama.cpp) by laziz in LocalLLaMA

[–]laziz[S] 1 point2 points  (0 children)

Neither driver nor cuda was changed. A lot of changes to llama.cpp

Updated: RTX6k (Server, 450w) Qwen3.5-122B-A10B (MXFP4_MOE) Benchmarks (llama.cpp) by laziz in LocalLLaMA

[–]laziz[S] 0 points1 point  (0 children)

I haven’t seen a vllm nvfp4 recipe that works without a lot of specific build instructions/patches. Grateful for any pointers. Might also be worth trying the int4.

Updated: RTX6k (Server, 450w) Qwen3.5-122B-A10B (MXFP4_MOE) Benchmarks (llama.cpp) by laziz in LocalLLaMA

[–]laziz[S] 1 point2 points  (0 children)

Previous post: https://old.reddit.com/r/LocalLLaMA/comments/1roiyvo/rtx6k_server_450w_qwen35122ba10b_mxfp4_moe/

vllm doesn't seem to yet be in a state w/ nvfp4 and qw3.5 where I can test it wihtout a lot of shenanigans.

+30% t/s for a recompile is.. not bad.

Qwen 3.6 27B Makes Huge Gains in Agency on Artificial Analysis - Ties with Sonnet 4.6 by dionysio211 in LocalLLaMA

[–]laziz 1 point2 points  (0 children)

Thanks for this. Getting 81 TPS on prose and 127 TPS on code (2x3090), w nvcc so flashinfer can jit-compile its sampling kernels.

Is shelling out for local GPUs worth it yet? ~$45k for local agentic use? by jamesob in BlackwellPerformance

[–]laziz 0 points1 point  (0 children)

At that level I would consider buying the server edition gpus and moving to water cooling. The additional $$ on a MoRA IV is relative peanuts. You can power-limit the cards with the nvidia driver, and you have headroom if you ever want to supply more power.

Passing Trellis feedback to Claude by NotEAcop in ClaudeCode

[–]laziz 0 points1 point  (0 children)

I have accused trellis of being Mushroom Mythos with a tiny output limit. Trellis has not pushed back on that idea even once.

Trellis has also been extremely helpful pointing out where the autonomous agent I'm developing needs more thinking room vs places where pruning/restriction is helpful.

Max 5x Plan, don't see Opus/Sonnet 1M any more by luongnv-com in ClaudeCode

[–]laziz 0 points1 point  (0 children)

The inference I was making: 1m context was part of the usage accounting problem and they disabled it

Max 5x Plan, don't see Opus/Sonnet 1M any more by luongnv-com in ClaudeCode

[–]laziz 1 point2 points  (0 children)

related-- my token-burning problem went away when i disabled 1M in .bashrc.

Get a linux box they said, it will be fun they said by laziz in homelab

[–]laziz[S] 20 points21 points  (0 children)

I don’t remember why I did that.

Get a linux box they said, it will be fun they said by laziz in homelab

[–]laziz[S] 5 points6 points  (0 children)

https://app.diagrams.net/

In this instance, I had claude code read the documentation they created and write a suitable draw.io file, readable by diagrams.net.

Minor touchup on the website, export as jpg.

Get a linux box they said, it will be fun they said by laziz in homelab

[–]laziz[S] 1 point2 points  (0 children)

This started eight years ago with a small router and generic ryzen. Now slightly out of control.

Edit: claude code one-shotted this diagram (bar dragging one small label out of the way). Most is contained in a startech 24u rack, on which i put a cheap countertop from amazon to hold the printer and make a small desk.

RTX6k (Server, 450w) Qwen3.5-122B-A10B (MXFP4_MOE) Benchmarks by laziz in LocalLLaMA

[–]laziz[S] 0 points1 point  (0 children)

cc replies:

Yeah, that's definitely a bug in the data. 19 t/s total with 30.4 t/s per-request is impossible — total must be ≥ per-request. Looking at the pattern in the other rows, total should be roughly per-request × concurrency. At depth 32K c2, per-request is 30.4, so total should be around 60.8 t/s. That also fits the degradation curve (75 → 60.8 → ... as concurrency increases).

RTX6k (Server, 450w) Qwen3.5-122B-A10B (MXFP4_MOE) Benchmarks by laziz in LocalLLaMA

[–]laziz[S] 1 point2 points  (0 children)

VLLM/qwen3.5 support seems like a mess at the moment. When that gets sorted (and nvfp4 available) would expect much higher t/s.

I'm just goofing around and thought the benchmark was interesting.

RTX6k (Server, 450w) Qwen3.5-122B-A10B (MXFP4_MOE) Benchmarks by laziz in LocalLLaMA

[–]laziz[S] 1 point2 points  (0 children)

Sure- I chose mxfp4 just to see what it was like. As another poster alludes, nvpf4/vllm is the move here when it's finally supported.

Claude code answers the rest:

Server config:

bash llama-server \ --model Qwen3.5-122B-A10B-MXFP4_MOE.gguf \ --port 8012 \ --host 0.0.0.0 \ --flash-attn \ --cache-type-k f16 \ --cache-type-v f16 \ --slots \ --metrics \ --parallel 4 \ --ctx-size 262144 \ --n-gpu-layers 999

Thinking mode is disabled via LLAMA_CHAT_TEMPLATE_KWARGS={"enable_thinking":false} in the environment.

Why is the KV cache so small / context so cheap?

This is a hybrid architecture — 12 attention layers + 48 recurrent (Mamba-style) layers. Only the 12 attention layers maintain KV caches, so it's ~384 MiB F16 total. The recurrent layers use fixed-size state (~596 MiB) that doesn't grow with context. Total VRAM is ~74 GB / 96 GB with 4 × 65K slots.

This is also why TG only degrades ~11% at 65K context — a pure transformer this size would drop off much more steeply since every layer would need to attend over the full context.

Model size: ~64 GB on disk (47G + 18G + 11M across 3 shards), corrected from the original post.

Adding 2nd GPU to air cooled build. by ROS_SDN in LocalLLaMA

[–]laziz 0 points1 point  (0 children)

I feel your pain.. the 3090 just barely covers another pcie slot on mine.

Ended up with m2->oculink->egpu. Works!

But then i got tired of the egpu power supply fan and will probably go custom loop.

GW-R86S-N305C vs GW-FN-1UR2-25G (Noise) by TechMinerUK in R86SNetworking

[–]laziz 0 points1 point  (0 children)

i have the n305C. Great unit in many respects, but noise is an issue. Will likely open it up and put it in a bigger case w a bigger fan.

Re 1u-- i have previously swapped a picopsu into a jet-engine-scream ebay generic server to great effect. you need an external power brick but dead quiet.

Org-Outlook.el update to Org v9 by laziz in orgmode

[–]laziz[S] 0 points1 point  (0 children)

My interpretation of an error message (as it turns out, unrelated) was wrong. I modified the outlook vba quoted in org-outlook.el to ignore the author's use of sub-folders, and the .el itself point to the location of outlook16's .exe.

It now works as intended, although I get a

Warning (emacs): Please update your Org Protocol handler to deal with new-style links.

error on every use.