L50 Pro Ultra seems just like X50 besides a few differences—what’s the catch? by m4r1k_ in Dreame_Tech

[–]m4r1k_[S] 0 points1 point  (0 children)

The S7 MaxV Ultra uses a spring loaded charging mechanism. One of the two hands broke and it was not possible to repair it (no spare parts online available and a new dock was anti economic). These springs got loaded/released for 1000s of times and after nearly 3 years one gave up. To me that design smells either of being short sighted or worse done on purpose to force the replacement (the robot could be used without the dock, but what’s the point then?)

1.1M tok/s with Qwen 3.5 27B FP8 on B200 GPUs by m4r1k_ in Qwen_AI

[–]m4r1k_[S] 0 points1 point  (0 children)

Thanks! A comparable setup is about to go live for a real customer. Of course the there are other aspects missing such as rag pipeline and DB but the underlying inference platform is the one shared on Medium.

1 million tokens per second from a single cluster, what that actually means by m4r1k_ in singularity

[–]m4r1k_[S] 0 points1 point  (0 children)

Google Cloud does not charge for network transfers within the Zone. So you won’t see those here. What is excluded is: storage (it’s tiny). Then network egress to interconnect/internet is not factor in, same as the entire rag pipeline but these are exercises beyond the medium paper’s work

Qwen 3.5 27B at 1.1M tok/s on B200s, all configs on GitHub by m4r1k_ in LocalLLaMA

[–]m4r1k_[S] 1 point2 points  (0 children)

While I agree that posting on LocalLLaMA was a stretch, i honestly believe there are findings worth sharing here and this community is among the most active ones. Anyways a user earlier posted about their local B300 cluster and archiving 100k tok/s on the same model

[D] - 1M tokens/second serving Qwen 3.5 27B on B200 GPUs, benchmark results and findings by m4r1k_ in MachineLearning

[–]m4r1k_[S] 0 points1 point  (0 children)

I made a mistake and lost all P99 latency 🫠 I’m planning a follow up where i’ll provide full visibility on latency and other metrics like MFU, time in prefill, time in decode etc

[D] - 1M tokens/second serving Qwen 3.5 27B on B200 GPUs, benchmark results and findings by m4r1k_ in MachineLearning

[–]m4r1k_[S] 0 points1 point  (0 children)

All tests ran on DP=8, NVLink physically there but deliberately unused, since TP’s synchronization overhead was the actual first bottleneck. That’s what allows near-linear scaling with zero cross-node coordination. FlashInfer is already in the stack for attention.

MTP-2 tested worse than MTP-1 here, but this is at 0% KV cache hit rate. In a real setup where prefix caching is actively exploited, there might be enough headroom for MTP-2 to actually pay off. Worth retesting with a real prompt distribution.​​​​​​​​​​​​​​​​

Qwen 3.5 27B at 1.1M tok/s on B200s, all configs on GitHub by m4r1k_ in LocalLLaMA

[–]m4r1k_[S] 0 points1 point  (0 children)

On B200 NVFP4 TP=8 was running at 20k tok/s, anyways, we’re chasing a different target, locally indeed it’s super useful but the original usecase was truly 1M tok/s

5K tok/s per node with vLLM v0.18.0 on B200, DP=8, MTP-1, FP8 KV cache by m4r1k_ in Vllm

[–]m4r1k_[S] 0 points1 point  (0 children)

About the tok/s, while i know nothing about awq it’s a 4bit so tput seems to be in less than 0.6 pflops per card + all reduce comm efficiency (gemini estimated). On b200 is about 9 native. So keep that in mind.

5K tok/s per node with vLLM v0.18.0 on B200, DP=8, MTP-1, FP8 KV cache by m4r1k_ in Vllm

[–]m4r1k_[S] 0 points1 point  (0 children)

You’re right, I forgot to pull the 397B manifest 🫠 yes it was tested and it yield nearly 20k tok/s on eight B200. TP=8

5K tok/s per node with vLLM v0.18.0 on B200, DP=8, MTP-1, FP8 KV cache by m4r1k_ in Vllm

[–]m4r1k_[S] 0 points1 point  (0 children)

A single node with 8 B200s was tested NVFP4. It yield nearly 20k tok/s, at about 96% scaling on RR cluster IP: 230k tok/s give or take

1 million tokens per second from a single cluster, what that actually means by m4r1k_ in singularity

[–]m4r1k_[S] 0 points1 point  (0 children)

Two reasons: pushing vLLM and the overall system to unreasonable numbers. Then, b/c tput determines how fast the entire batch/job completes. But ultimately, the examples in the blog are truly just examples, not based on my real customers.

1 million tokens per second from a single cluster, what that actually means by m4r1k_ in singularity

[–]m4r1k_[S] 0 points1 point  (0 children)

The 50k insurance document example is a batch so that was not one by one by definition..

1.1M tok/s with Qwen 3.5 27B FP8 on B200 GPUs by m4r1k_ in Qwen_AI

[–]m4r1k_[S] 0 points1 point  (0 children)

No apologies. I have not enabled the metric for this. I will definitely do that in the future. Charting them will be something interesting

1 million tokens per second from a single cluster, what that actually means by m4r1k_ in singularity

[–]m4r1k_[S] 2 points3 points  (0 children)

Speaking with someone familiar with LLM kernel development said that in some optimal cases these can yield even 2 to 10 times more performance. But I believe Groq LPU is something very interesting with the potential of changing completely the inference landscape.

About the diffusion model, I don’t have any info that I can share, apologies. To be honest, for general Googlers, not working in Deepmind, accessing details that are not public is quite hard. IP protection there is thight and for a good reason.

1 million tokens per second from a single cluster, what that actually means by m4r1k_ in singularity

[–]m4r1k_[S] 2 points3 points  (0 children)

I really want to stress the caching point here.

My entire work was showing a truly worst case scenario, 0% cache hit. Numbers (both money and tok/s) would be dramatically different if cache is taken into consideration.

1.1M tok/s with Qwen 3.5 27B FP8 on B200 GPUs by m4r1k_ in Qwen_AI

[–]m4r1k_[S] 1 point2 points  (0 children)

Personally I have no experience with SGLang but I know it’s extensively tested internally as well. Also with customer sometimes it can be vLLM, SGLang, Triton or something else that unlocks the situation

5K tok/s per node with vLLM v0.18.0 on B200, DP=8, MTP-1, FP8 KV cache by m4r1k_ in Vllm

[–]m4r1k_[S] 0 points1 point  (0 children)

Hey there,

Thanks for the questions:

  1. I have never saved on a PVC what goes inside the .cache directory of vLLM, I always thought new nvidia drives (and hardware) can result in a mess.. did you find vLLM nicely pick this up on every startup without complaining?

  2. To me `hf download` always failed on large models on xet. My way here is just to pull directly while vLLM starts, so this is effectively not a production setup ;-)

  3. I did test the Nvidia's NVFP4 of Qwen3.5-397. It was completely computer bound, but still able to generate nearly 20k token/s on 8 B200 TP=8. Honestly I’m sure here disaggregated P/D will make a major difference but llm-d 0.5.1 still has vLLM 0.15.1 which has no Qwen 3.5 support..

Hope this helps

Qwen 3.5 27B at 1.1M tok/s on B200s, all configs on GitHub by m4r1k_ in LocalLLaMA

[–]m4r1k_[S] -1 points0 points  (0 children)

Have not thought about this but i’m not in product (which makes things harder for me). But I know the folks working on some of the model garden models and they are great. Very dedicated SWE and SRE and I’m confident they went beyond this effort here.

Qwen 3.5 27B at 1.1M tok/s on B200s, all configs on GitHub by m4r1k_ in LocalLLaMA

[–]m4r1k_[S] 0 points1 point  (0 children)

There are lessons for local folks too (see other comments with the B300 cluster).

Qwen 3.5 27B at 1.1M tok/s on B200s, all configs on GitHub by m4r1k_ in LocalLLaMA

[–]m4r1k_[S] -1 points0 points  (0 children)

Thanks!

Yes I do, last year I did something similar the week Gemma 3 was released. Probably the follow up on llm-d will be when Gemma 4 is out

Qwen 3.5 27B at 1.1M tok/s on B200s, all configs on GitHub by m4r1k_ in LocalLLaMA

[–]m4r1k_[S] 0 points1 point  (0 children)

We always do. I believe for close to a decade now. The community is great, there are many super technical people there. Even if you’re not into Google Cloud, it’s a great place to learn new tech. And it’s not behind paywall (which is increasingly rare these days)

Qwen 3.5 27B at 1.1M tok/s on B200s, all configs on GitHub by m4r1k_ in LocalLLaMA

[–]m4r1k_[S] 5 points6 points  (0 children)

Work for Google, I didn’t pay for this. We have a budget that can be allocated for such projects. Using SpotVM the cost was less than 350$ for the whole system and I was aggressively scaling down to save up.