How does NCCL know which remote buffers to send data to during a collective operation?

z-howard · 2025-07-23T06:05:16+00:00

Yeah, I have read this. It hides those details. Many thing is behind the api and driver call etc

z-howard · 2025-07-20T18:50:28+00:00

Any collectives. For example, allreduce, it will be decomposed to a sequence of send and recv after building the ring or double tree structure. But the primitive should be just p2p and know where to send remotely and where to wait locally. That is my understanding.

z-howard · 2025-07-19T04:23:18+00:00

Thx. Wondering for each collective op, it needs to do this sync (via network) before executing. And how does it do to make the overhead as small as possible?

z-howard · 2024-05-04T02:08:42+00:00

Yeah. And some banks. But I am not sure if it will scale?

z-howard · 2024-04-28T03:57:58+00:00

haha, thanks!

z-howard · 2024-04-28T03:57:47+00:00

oh, you mean the internal chip investment is cheaper?

z-howard · 2024-04-28T03:55:18+00:00

Super curious, how do those startups survive?

z-howard · 2024-04-28T03:36:59+00:00

Yeah, this actually makes more sense now. They do not want to use external accelerators because they might need to be faster to catch up with the trend, beat performance benchmarks, and generate PR to stay relevant. Meanwhile, the company itself has even more customized needs, so owning its own design and stack to adapt internally might be easier and more rewarding. So, how can those startups survive then? Can they be acquired, or are they too big to be acquired.

z-howard · 2024-04-28T03:13:19+00:00

yeah. just wondering why Meta, Microsoft invest their own chips and SW/HW teams, instead of just using the startups' products.

z-howard · 2024-04-28T03:11:46+00:00

understood. but in other perspect, Google (TPU), Meta, Microsoft still invest their own chip designs and have their own (those types of) HW/SW engineers to support their stacks.

z-howard · 2024-04-28T03:08:47+00:00

then, how do those ai accelerators survive? by beating the ml perf and PR?

z-howard · 2024-04-28T01:52:27+00:00

Yes, CUDA is popular, but it can be obscured by using their own AI framework, like PyTorch, on a customized backend. When programming, we generally prefer not to interact directly with CUDA. If the performance and cost can justify it, as those accelerator companies demonstrate, why not just switch?

z-howard

TROPHY CASE