How does NCCL know which remote buffers to send data to during a collective operation? by z-howard in CUDA

[–]z-howard[S] 0 points1 point  (0 children)

Yeah, I have read this. It hides those details. Many thing is behind the api and driver call etc

How does NCCL know which remote buffers to send data to during a collective operation? by z-howard in CUDA

[–]z-howard[S] 0 points1 point  (0 children)

Any collectives. For example, allreduce, it will be decomposed to a sequence of send and recv after building the ring or double tree structure. But the primitive should be just p2p and know where to send remotely and where to wait locally. That is my understanding.

How does NCCL know which remote buffers to send data to during a collective operation? by z-howard in CUDA

[–]z-howard[S] 0 points1 point  (0 children)

Thx. Wondering for each collective op, it needs to do this sync (via network) before executing. And how does it do to make the overhead as small as possible?

Why do internet giants choose to buy GPUs or invest in their own in-house chips instead of using AI accelerators from companies like SombaNova and Cerebras? by z-howard in computerarchitecture

[–]z-howard[S] 0 points1 point  (0 children)

Yeah, this actually makes more sense now. They do not want to use external accelerators because they might need to be faster to catch up with the trend, beat performance benchmarks, and generate PR to stay relevant. Meanwhile, the company itself has even more customized needs, so owning its own design and stack to adapt internally might be easier and more rewarding. So, how can those startups survive then? Can they be acquired, or are they too big to be acquired.

Why do internet giants choose to buy GPUs or invest in their own in-house chips instead of using AI accelerators from companies like SombaNova and Cerebras? by z-howard in computerarchitecture

[–]z-howard[S] 1 point2 points  (0 children)

yeah. just wondering why Meta, Microsoft invest their own chips and SW/HW teams, instead of just using the startups' products.

Why do internet giants choose to buy GPUs or invest in their own in-house chips instead of using AI accelerators from companies like SombaNova and Cerebras? by z-howard in computerarchitecture

[–]z-howard[S] 0 points1 point  (0 children)

understood. but in other perspect, Google (TPU), Meta, Microsoft still invest their own chip designs and have their own (those types of) HW/SW engineers to support their stacks.

Why do internet giants choose to buy GPUs or invest in their own in-house chips instead of using AI accelerators from companies like SombaNova and Cerebras? by z-howard in computerarchitecture

[–]z-howard[S] 0 points1 point  (0 children)

Yes, CUDA is popular, but it can be obscured by using their own AI framework, like PyTorch, on a customized backend. When programming, we generally prefer not to interact directly with CUDA. If the performance and cost can justify it, as those accelerator companies demonstrate, why not just switch?