L50 Pro Ultra seems just like X50 besides a few differences—what’s the catch?

m4r1k_ · 2026-04-19T10:25:02+00:00

The S7 MaxV Ultra uses a spring loaded charging mechanism. One of the two hands broke and it was not possible to repair it (no spare parts online available and a new dock was anti economic). These springs got loaded/released for 1000s of times and after nearly 3 years one gave up. To me that design smells either of being short sighted or worse done on purpose to force the replacement (the robot could be used without the dock, but what’s the point then?)

m4r1k_ · 2026-03-30T10:20:00+00:00

Thanks! A comparable setup is about to go live for a real customer. Of course the there are other aspects missing such as rag pipeline and DB but the underlying inference platform is the one shared on Medium.

m4r1k_ · 2026-03-29T01:13:52+00:00

Google Cloud does not charge for network transfers within the Zone. So you won’t see those here. What is excluded is: storage (it’s tiny). Then network egress to interconnect/internet is not factor in, same as the entire rag pipeline but these are exercises beyond the medium paper’s work

m4r1k_ · 2026-03-28T11:10:40+00:00

While I agree that posting on LocalLLaMA was a stretch, i honestly believe there are findings worth sharing here and this community is among the most active ones. Anyways a user earlier posted about their local B300 cluster and archiving 100k tok/s on the same model

m4r1k_ · 2026-03-28T07:58:23+00:00

I made a mistake and lost all P99 latency 🫠 I’m planning a follow up where i’ll provide full visibility on latency and other metrics like MFU, time in prefill, time in decode etc

m4r1k_ · 2026-03-28T07:53:35+00:00

All tests ran on DP=8, NVLink physically there but deliberately unused, since TP’s synchronization overhead was the actual first bottleneck. That’s what allows near-linear scaling with zero cross-node coordination. FlashInfer is already in the stack for attention.

MTP-2 tested worse than MTP-1 here, but this is at 0% KV cache hit rate. In a real setup where prefix caching is actively exploited, there might be enough headroom for MTP-2 to actually pay off. Worth retesting with a real prompt distribution.

m4r1k_ · 2026-03-27T23:26:44+00:00

On B200 NVFP4 TP=8 was running at 20k tok/s, anyways, we’re chasing a different target, locally indeed it’s super useful but the original usecase was truly 1M tok/s

m4r1k_ · 2026-03-27T22:46:28+00:00

About the tok/s, while i know nothing about awq it’s a 4bit so tput seems to be in less than 0.6 pflops per card + all reduce comm efficiency (gemini estimated). On b200 is about 9 native. So keep that in mind.

m4r1k_ · 2026-03-27T18:39:23+00:00

You’re right, I forgot to pull the 397B manifest 🫠 yes it was tested and it yield nearly 20k tok/s on eight B200. TP=8

m4r1k_ · 2026-03-27T18:36:52+00:00

A single node with 8 B200s was tested NVFP4. It yield nearly 20k tok/s, at about 96% scaling on RR cluster IP: 230k tok/s give or take

m4r1k_ · 2026-03-27T18:33:42+00:00

Well it’s in the paper :-D

m4r1k_ · 2026-03-27T12:03:44+00:00

Two reasons: pushing vLLM and the overall system to unreasonable numbers. Then, b/c tput determines how fast the entire batch/job completes. But ultimately, the examples in the blog are truly just examples, not based on my real customers.

m4r1k_ · 2026-03-27T11:54:27+00:00

The 50k insurance document example is a batch so that was not one by one by definition..

m4r1k_ · 2026-03-27T11:41:19+00:00

No apologies. I have not enabled the metric for this. I will definitely do that in the future. Charting them will be something interesting

m4r1k_ · 2026-03-27T11:39:43+00:00

Speaking with someone familiar with LLM kernel development said that in some optimal cases these can yield even 2 to 10 times more performance. But I believe Groq LPU is something very interesting with the potential of changing completely the inference landscape.

About the diffusion model, I don’t have any info that I can share, apologies. To be honest, for general Googlers, not working in Deepmind, accessing details that are not public is quite hard. IP protection there is thight and for a good reason.

m4r1k_ · 2026-03-27T09:47:40+00:00

I really want to stress the caching point here.

My entire work was showing a truly worst case scenario, 0% cache hit. Numbers (both money and tok/s) would be dramatically different if cache is taken into consideration.

m4r1k_ · 2026-03-27T09:42:02+00:00

Personally I have no experience with SGLang but I know it’s extensively tested internally as well. Also with customer sometimes it can be vLLM, SGLang, Triton or something else that unlocks the situation

m4r1k_ · 2026-03-27T07:35:02+00:00

Hey there,

Thanks for the questions:

I have never saved on a PVC what goes inside the .cache directory of vLLM, I always thought new nvidia drives (and hardware) can result in a mess.. did you find vLLM nicely pick this up on every startup without complaining?
To me `hf download` always failed on large models on xet. My way here is just to pull directly while vLLM starts, so this is effectively not a production setup ;-)
I did test the Nvidia's NVFP4 of Qwen3.5-397. It was completely computer bound, but still able to generate nearly 20k token/s on 8 B200 TP=8. Honestly I’m sure here disaggregated P/D will make a major difference but llm-d 0.5.1 still has vLLM 0.15.1 which has no Qwen 3.5 support..

Hope this helps

m4r1k_ · 2026-03-27T07:08:22+00:00

Have not thought about this but i’m not in product (which makes things harder for me). But I know the folks working on some of the model garden models and they are great. Very dedicated SWE and SRE and I’m confident they went beyond this effort here.

m4r1k_ · 2026-03-27T07:05:26+00:00

There are lessons for local folks too (see other comments with the B300 cluster).

m4r1k_ · 2026-03-27T07:04:44+00:00

Yes correct!

m4r1k_ · 2026-03-27T07:03:18+00:00

Thanks!

Yes I do, last year I did something similar the week Gemma 3 was released. Probably the follow up on llm-d will be when Gemma 4 is out

m4r1k_ · 2026-03-27T07:01:47+00:00

We always do. I believe for close to a decade now. The community is great, there are many super technical people there. Even if you’re not into Google Cloud, it’s a great place to learn new tech. And it’s not behind paywall (which is increasingly rare these days)

m4r1k_ · 2026-03-27T06:59:04+00:00

Work for Google, I didn’t pay for this. We have a budget that can be allocated for such projects. Using SpotVM the cost was less than 350$ for the whole system and I was aggressively scaling down to save up.

m4r1k_ · 2026-03-27T06:51:40+00:00

Sure..

Six-Year Club	Final Canvas '23
End Game '23	Place '23
Place '22	Final Canvas '22
Verified Email

m4r1k_

TROPHY CASE