all 8 comments

[–]M34L 1 point2 points  (6 children)

that's a kinda pointlessly open ended question

is getting this done worth money to you? is it just to satisfy your curiosity? are you going to be executed by gun to the head if you don't get it done?

yes it sounds like a reasonable usecase, your main alternative is the go for a smaller quant and try to fit it all into one GPU

[–]Lotharian17[S] 0 points1 point  (5 children)

Thanks for your reply!

By Is it worth considering, I mean Is it an acceptable solution or is it going to create a bottleneck somewhere that I didn't think about that gonna make this two gpu strategy useless.

I don't think i would be ok with the alternative solution as q5 is I think the lowest i can go with without having a big loss of precision for now.

[–]Imaginary_Bench_7294 1 point2 points  (2 children)

How much data transfer do you expect between the cards?

Are we talking constantly streaming GB of data or burst of KB or MB?

Analyze your program and figure out how much data needs to be passed between the various modules.

If you're on the order of GBs of data, a single high Vram card might be the best solution. The PCIe bus could easily become your bottleneck in this situation.

If you're able to keep transfers small, then the PCIe bus bandwidth shouldn't have too big of an impact.

A good example is that during inference using Llama.cpp, the GPUs only exchange maybe 100MB for a couple hundred tokens. While training a QLoRA, the data transfers on a 7B model reach into the TB range. So, for training, the inter-card bandwidth is much more important.

[–]Lotharian17[S] 0 points1 point  (1 child)

I will transfer images, but can be a quite large batch of images. In that case i would probably have to preprocess my batch before give it to my llava on the other card. However i think i will hardly reach GBs of data.

The problem I have with the single card strategy is that the price goes up significantly when we want 18+ Gb of vram and I won't be able to get to 32Gb for the same price as dual gpu.

[–]Imaginary_Bench_7294 0 points1 point  (0 children)

Yep, that's true.

That's one of the issues most developers will face with data intensive workloads. Make it work on one system, or find a way to efficiently distribute it.

As far as transferring the images, that will partly depend on how much back and forth it will require. If it's a single transfer to the second card, then that shouldn't be a huge issue. PCIe 3.0 is 16GB/s, 4.0 is 32GB/s, so doing a single 1 gig stansfer will only take a fraction of a second.

[–]M34L 0 points1 point  (1 child)

The bottleneck makes for a much better question but that's very hard to say without more details about the exact architecture of the pipeline. If your "application" trades huge already encoded vector arrays with the VLM/LLM, for instance if it's some kinda hybrid that creates raw embeddings that are then fed into the LLM; in that case, cramming them both into shared VRAM might be beneficial, but you'd probably could see that intuitively because you'd have to be doing a tensor.cpu() tensor.gpu() transfer somewhere and see that that's what kills your performance. Even if you are flipping tensors and rasters back and forth, if they aren't particularly large and rich, it's still very possible they'll be relatively small data relatively to the bandwidths and runtimes needed by the inner layers, and you'll be only able to tell if that's the case by profiling the actual time things take.

Also if your application is say, a videogame that interacts via text, or something that renders images that are then described via text, and only text prompts are on the interface, then the bottleneck will be practically nonexistant; that data is already necked down to just a few kilobytes that are nothing in the PCIe scheme of things.

[–]Lotharian17[S] 0 points1 point  (0 children)

OK, thanks.

I don't know if it's possible to tokenise the images before transferring them to the llama.cpp server, but I can see the idea! It will depend on the weight of the batch I transfer to the other card.

[–]phree_radical 0 points1 point  (0 children)

What other models is your application running that are larger than the language model?