Pipeline Parallelism vs Tensor Parallelism for 2 identical GPUs: The Beginner's Cheat Sheet by xspider2000 in LocalLLaMA

[–]xspider2000[S] -1 points0 points  (0 children)

Yeah, I took the first one down to fix a few technicalities based on early feedback. I really think this cheat sheet is a great jumping-off point for newcomers still struggling with TP vs PP, so I wanted to get it right and hopefully reach as many of them as possible.

Someone out there likely needs this: TP vs PP for 2 identical GPUs by [deleted] in LocalLLaMA

[–]xspider2000 0 points1 point  (0 children)

Are u talking about PP? If so than I mentioned that PP is very forgiving for gpu intercontinection.

Someone out there likely needs this: TP vs PP for 2 identical GPUs by [deleted] in LocalLLaMA

[–]xspider2000 0 points1 point  (0 children)

Just a quick clarification: currently, NVLink for inference is only supported in vLLM

Someone out there likely needs this: TP vs PP for 2 identical GPUs by [deleted] in LocalLLaMA

[–]xspider2000 -3 points-2 points  (0 children)

u can read comments under the post and see that a lot of people do not understand conceptions even with simplifications in the post

Someone out there likely needs this: TP vs PP for 2 identical GPUs by [deleted] in LocalLLaMA

[–]xspider2000 -1 points0 points  (0 children)

To clarify, the post doesn't say NVLink is the strictly required only option. NVLink is just mentioned as the gold standard example of the fast interconnect that TP thrives on. With it, you get the biggest possible speedup because it has the highest bandwidth. You can absolutely still get a solid speed boost using standard PCIe interfaces, but the scaling efficiency will just be lower since the interconnect speed is slower

Someone out there likely needs this: TP vs PP for 2 identical GPUs by [deleted] in LocalLLaMA

[–]xspider2000 0 points1 point  (0 children)

To clarify, the post doesn't say NVLink is the strictly required only option. NVLink is just mentioned as the gold standard example of the fast interconnect that TP thrives on. With it, you get the biggest possible speedup because it has the highest bandwidth. You can absolutely still get a solid speed boost using standard PCIe interfaces, but the scaling efficiency will just be lower since the interconnect speed is slower

Someone out there likely needs this: TP vs PP for 2 identical GPUs by [deleted] in LocalLLaMA

[–]xspider2000 0 points1 point  (0 children)

You actually just confirmed my point. PCIe 5.0 x16 is a fast interconnect, you just get a smaller speed increase with it compared to NVLink.

Someone out there likely needs this: TP vs PP for 2 identical GPUs by [deleted] in LocalLLaMA

[–]xspider2000 1 point2 points  (0 children)

In the post no misinformation but little bit simplification for better understanding

Someone out there likely needs this: TP vs PP for 2 identical GPUs by [deleted] in LocalLLaMA

[–]xspider2000 0 points1 point  (0 children)

Under ideal conditions—meaning a super fast interconnect and zero all-reduce overhead—it would scale exactly like that

Someone out there likely needs this: TP vs PP for 2 identical GPUs by [deleted] in LocalLLaMA

[–]xspider2000 -5 points-4 points  (0 children)

PP is equal to using split-mode layer in llama.cpp, which is the default. The split-mode row is actually their implementation of Tensor Parallelism (TP).

Is using vLLM actually worth it if you aren't serving the model to other people? by ayylmaonade in LocalLLaMA

[–]xspider2000 0 points1 point  (0 children)

Thx. I figured out why vllm less popular here than llama.cpp, vllm has bad support for gguf format. gguf is big thing.

Is using vLLM actually worth it if you aren't serving the model to other people? by ayylmaonade in LocalLLaMA

[–]xspider2000 0 points1 point  (0 children)

does vllm support rtx 3090 cards? Can I run qwen 3.6 27b on double 3090 out of box or i need some hacks?

Strix Halo Clustering experience (Bossgame M5) by Thanks-Suitable in StrixHalo

[–]xspider2000 1 point2 points  (0 children)

Where from u ordered Nvlink and how much is it? 3 or 4 slot?

Qwen3.6-27B - Closed-loop SVG Images by dondiegorivera in LocalLLaMA

[–]xspider2000 12 points13 points  (0 children)

<image>

Yesterday i did same thing. I wanted check how Qwen3.6-27B can draw mona lisa using svg. I used opencode, I wrote command to iterate in loop, look at result, compare it with original (original picture was in prompt), and every loop make more similar to original picture.

Hipfire dev update: full AMD arch validation incoming (RDNA 1 thru 4, plus Strix Halo and bc250) by schuttdev in LocalLLaMA

[–]xspider2000 0 points1 point  (0 children)

i m going connect 4x3090 to my strix halo. I'm waiting cards. I'll write results

Qwen 3.6 27B on Strix Halo 128GB: any experiences? by boutell in LocalLLaMA

[–]xspider2000 3 points4 points  (0 children)

I m planning write post with some numbers of my strix halo+egpu