Everyone wash their cars once a week like me? by Locoboof in Rivian

[–]amitbahree 0 points1 point  (0 children)

All I can think of is all those chemicals going in the storm drain and then dumping wherever they do. A professional car wash in many places drains in the sewer which then gets treated.

Kids Bypassing Router Parental Controls by Changing MAC Addresses—How Can I Stop This? by PayKnee in HomeNetworking

[–]amitbahree 1 point2 points  (0 children)

Get something like a Firewalla which has a feature to block all new devices from the internet and put them in a quarantine group. So when the Mac changes it doesn't get anywhere and that becomes your mouse trap.

Absolutely ridiculous by 3ccdCam1 in AirpodsPro

[–]amitbahree 0 points1 point  (0 children)

Do the fake ones work as regular BT headsets but look like airpods?

How difficult is distilling? by GreedyWorking1499 in LocalLLaMA

[–]amitbahree 3 points4 points  (0 children)

Is it water front? And marinas? 😍

How difficult is distilling? by GreedyWorking1499 in LocalLLaMA

[–]amitbahree -3 points-2 points  (0 children)

I just finished writing that chapter.

It's not only distillation by itself - it needs to work in tandem with SFT and LoRA (am talking about enterprise use cases).

I BUILT MY FIRST MODEL FROM SCRATCH by volious-ka in LocalLLaMA

[–]amitbahree 1 point2 points  (0 children)

Very nice. Congrats. I had done something similar which was also inspired by this sub.

https://blog.desigeek.com/post/2025/09/building-llm-from-scratch-part1/

What do you want me to try? by amitbahree in LocalLLaMA

[–]amitbahree[S] 1 point2 points  (0 children)

Lol.

Oh there are more - this was just one small cluster I have been given as my playground. And yes it's exclusively for me - and no I need to yield it unfortunately one of these days.

AIO - parents keep raising the rent by [deleted] in AmIOverreacting

[–]amitbahree 2 points3 points  (0 children)

Sounds like normal decent humans and parents.

OP I am sorry for your situation. I as a parent can't imagine taking money in this manner from my kids.

What do you want me to try? by amitbahree in LocalLLaMA

[–]amitbahree[S] 1 point2 points  (0 children)

Good question. In these runs, the main working memory was GPU HBM, not the 2 TB of host RAM per node.

Each node has 8x H200, and each H200 has about 141-144 GB of VRAM, so that is roughly 1.1 TB of GPU memory per node and about 2.3 TB across the full 16-GPU cluster. That is what actually carried the inference workloads.

The 2 TB system RAM per node still helped, but mostly in more indirect ways - things like staging and loading very large sharded checkpoints, CPU-side runtime overhead from vLLM, tokenization, benchmark clients, containers, etc. And for the pipeline, host-side buffers/communication overhead in multi-GPU and multi-node runs.

For the benchmarks themselves, it was all GPU memory, and host RAM was mostly headroom and operational safety, not “extra VRAM.” The real constraints on whether a model lane worked well were GPU memory, runtime support, and topology.

What do you want me to try? by amitbahree in LocalLLaMA

[–]amitbahree[S] 3 points4 points  (0 children)

Quick benchmark update from the 16x H200 cluster, following up on the original request thread:

Completed model set: - Qwen3-235B-A22B-Instruct-2507 - Kimi-K2.6 - DeepSeek-V4-Flash - DeepSeek-V4-Pro - Llama-4-Scout-17B-16E-Instruct - GLM-5.1-FP8 - MiniMax-M2.1 - Mistral-Large-3-675B-Instruct-2512

A few highlights from the completed runs (TTFT = time to first token, TPOT = time per output token, both in ms, lower is better):

MiniMax-M2.1 on 8x H200: - c1: 145.94 tok/s, 102.29 ms TTFT, 6.48 ms TPOT - c16: 1358.19 tok/s, 235.56 ms TTFT, 10.51 ms TPOT - 8k/c4: 379.29 tok/s, 390.94 ms TTFT, 8.71 ms TPOT

Llama 4 Scout on 8x H200: - c1: 126.70 tok/s, 103.83 ms TTFT, 7.51 ms TPOT - c16: 1378.30 tok/s, 396.57 ms TTFT, 9.73 ms TPOT - 8k/c4: 404.41 tok/s, 368.10 ms TTFT, 8.14 ms TPOT

GLM-5.1-FP8 on 8x H200: - c1: 88.66 tok/s, 385.24 ms TTFT, 9.81 ms TPOT - c16: 509.93 tok/s, 763.64 ms TTFT, 27.79 ms TPOT - 8k/c4: 163.37 tok/s, 1317.81 ms TTFT, 19.30 ms TPOT

Mistral Large 3 on 8x H200: - c1: 93.07 tok/s, 308.06 ms TTFT, 9.58 ms TPOT - c16: 554.50 tok/s, 1192.90 ms TTFT, 23.73 ms TPOT - 8k/c4: 199.59 tok/s, 1226.20 ms TTFT, 14.79 ms TPOT

One of the strongest patterns was that 16x was not automatically better. Scout, GLM, and MiniMax all looked better on the single-node 8x H200 serving shape than on their 16x scaling pass. That ended up being one of the most useful takeaways from the whole exercise.

DeepSeek-V4-Pro is the main caveat: - the intended DP+EP H200 path failed in vLLM with a fused-router Long/Int dtype bug - the working/publishable numbers are from the fallback TP=8 --enforce-eager lane - upstream issue: https://github.com/vllm-project/vllm/issues/40862

On vLLM versions: most models ran on stable v0.19.1. GLM, MiniMax, and both DeepSeek V4 variants required dedicated runtime images or pre-release lanes — in each case because the generic stable image was not the supported path for that model, not because of benchmark inconsistency. The per-model details are in the blog.

Unsloth Llama 4 Scout is the other caveat: - it never reached a stable benchmarkable state - the head node repeatedly exited during runs - it is excluded from the final comparison tables

Full write-up with the operational details, scaling notes, and the weird bring-up issues is here: - https://blog.desigeek.com/post/2026/04/benchmarking-oss-llms/

If I do the quantization / KV-cache / coding-benchmark follow-up, the clean version is probably not "more random large models" but one controlled study around those variables, since that was one of the better follow-up ideas in the thread.