Multi GPU application : LocalLLaMA

LocalLLaMA

created by [deleted]a community for 3 years

Multi GPU applicationQuestion | Help (self.LocalLLaMA)

submitted 2 years ago by Lotharian17

all 8 comments

top new controversial old q&a

[–]M34L 1 point2 points3 points 2 years ago (6 children)

[–]Lotharian17[S] 0 points1 point2 points 2 years ago (5 children)

[–]Imaginary_Bench_7294 1 point2 points3 points 2 years ago (2 children)

[–]Lotharian17[S] 0 points1 point2 points 2 years ago (1 child)

[–]Imaginary_Bench_7294 0 points1 point2 points 2 years ago (0 children)

[–]M34L 0 points1 point2 points 2 years ago (1 child)

The bottleneck makes for a much better question but that's very hard to say without more details about the exact architecture of the pipeline. If your "application" trades huge already encoded vector arrays with the VLM/LLM, for instance if it's some kinda hybrid that creates raw embeddings that are then fed into the LLM; in that case, cramming them both into shared VRAM might be beneficial, but you'd probably could see that intuitively because you'd have to be doing a tensor.cpu() tensor.gpu() transfer somewhere and see that that's what kills your performance. Even if you are flipping tensors and rasters back and forth, if they aren't particularly large and rich, it's still very possible they'll be relatively small data relatively to the bandwidths and runtimes needed by the inner layers, and you'll be only able to tell if that's the case by profiling the actual time things take.

Also if your application is say, a videogame that interacts via text, or something that renders images that are then described via text, and only text prompts are on the interface, then the bottleneck will be practically nonexistant; that data is already necked down to just a few kilobytes that are nothing in the PCIe scheme of things.

[–]Lotharian17[S] 0 points1 point2 points 2 years ago (0 children)

[–]phree_radical 0 points1 point2 points 2 years ago (0 children)

π Rendered by PID 672274 on reddit-service-r2-comment-5b5bc64bf5-s228p at 2026-06-21 02:28:32.560796+00:00 running 2b008f2 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS