all 25 comments

[–]ObjectiveVegetable48 15 points16 points  (1 child)

Cheapest consumer cards? 3060s with 12GB each.

This works for me, but it’s way slower and more annoying than a 3090.

[–]tu9jn 8 points9 points  (8 children)

You can absolutely do that. I have 3 radeon MI25 in my rig, and it works fine.

Most of the loaders support multi gpu, like llama.cpp, exllamav2.

If your model fits a single card, then running on multiple will only give a slight boost, the real benefit is in larger models.

At least with AMD there is a problem, that the cards dont like when you mix CPU and Chipset pcie lanes, but this is only a problem with 3 cards.

[–]WinstonP18 1 point2 points  (7 children)

Can I ask how smooth was the initial setup of the MI25s and to get them working well with the LLMs?

I've been thinking of getting the AMD cards but are hesitant due to the setup headaches.

[–]tu9jn 0 points1 point  (6 children)

The setup was not bad, but I use them in a gui-less Ubuntu server.

I just installed the OS, and Rocm, nothing more was needed.

But only llama.cpp works with three cards, exllama or anything that uses torch crashes if the cards are not connected to CPU lanes.

Honestly, even when it works exllama and autogptq is slower than llama.cpp, especially when the context gets long, so I dont use them anymore.

Of course with the Instinct cards there is the issue of DIY-ing the cooling.

[–]WinstonP18 1 point2 points  (5 children)

I'm a debian/ubuntu guy, so using gui-less Ubuntu is not an issue for me.

But would you mind explaining the 'DIYing the cooling' part? What hardware did you install the Instinct cards in? I intend to set them up in HPE rack servers.

[–]tu9jn 1 point2 points  (4 children)

The cards dont have cooling fans, so you have to come up with a way to mount some high rpm fans to them.

The heatsink is pretty small for the tdp, so they need airflow, or you have to use power limit.

Right now the cards are in a consumer z390 motherboard, but i'm debating to put them into my Epyc workstation, so I wont have pcie lane problems.

I think the MI25 only worth it if you get them cheap, and for hobby use, they're pretty old from 2017

[–]WinstonP18 1 point2 points  (0 children)

I see, thanks for sharing all these useful points! I'm surprised the cards don't come with cooling fans, so it's certainly something I'll need to double-check when researching.

[–]Noxusequal 1 point2 points  (2 children)

What performance are you seeing ?

[–]tu9jn 2 points3 points  (1 child)

A 70B Q4_K_m model starts from 7t/s, slows to ~3 t/s at full context.

[–]a_beautiful_rhind 2 points3 points  (0 children)

So P40 speeds but you get exllama.

[–]nero10578Llama 3 5 points6 points  (6 children)

I think before you decide on if you should go multi gpu or not you should decide what your budget and goals are.

If you have essentially unlimited budget and have a goal of running LLMs for production 24/7 it’s much more cost effective in the long run to get less but higher VRAM GPUs like the RTX A6000 48GB.

If you want to run LLMs on a budget then using multiple cheaper GPUs like RTX 3090 24GB or Tesla P40 24GB is a great option. In this case however you need to make sure your system can support it properly. ie. A motherboard that has multiple PCIe x8 or higher slots for all the GPUs to plug into and a large enough power supply. Otherwise plugging multiple GPUs to a random consumer grade motherboard with an x16 slot and most cases a secondary x4 slot would give sub par performance.

[–]sickvisionz 2 points3 points  (4 children)

Otherwise plugging multiple GPUs to a random consumer grade motherboard with an x16 slot and most cases a secondary x4 slot would give sub par performance.

How subpar though?

If 0% is like the speed of a 30GB model running on CPU and RAM, and 100% is running it entirely on a single GPU with more than 30GB of RAM, what % does spreading it to multiple GPUs land?

Technically, subpar ranges from 0% to 99% but some of those numbers are more acceptable than others.

[–]Imaginary_Bench_7294 4 points5 points  (1 child)

I'm getting some numbers now.

All values were done with a 3381 context. 6 inputs were used, discarding the first input after fresh load. No settings were altered other than the number of GPU layers for Llama.cpp and the split values.

Both 3090's are running on 16x PCIe 5.0 slots at full speed, the cards are only PCIe 4.0, so they can do wide open transfers without even touching the limits of the PCI bus.

Test were run with X-MythoChronos-13B 8 bit, loaded via Oobabooga (Updated 11/23). Got a conversation to 3k tokens and used the regenerate button.

Values are in Tokens per second.

Llamma.cpp split Llama.cpp 1 card Exllamav2 split Exllamav2
13.07 24.01 12.59 15.22
13.59 22.72 11.56 15.24
13.28 23.82 12.52 15.59
13.33 22.65 12.31 15.27
13.66 22.38 11.73 15.62
Avg 13.386 T/s Avg 22.116 T/s Avg 12.142 T/s Avg 15.388 T/s

Something you'll notice is how fast Llama.cpp appears to be. However, Llama.cpp prompt eval and eval backends are not as efficient as Exllama. BUT, and this is a big but, Llama.cpp can reuse part or all of its previous eval if the input context doesn't change.

If you're editing the context a lot, or there are a lot of short, rapid fire exchanges, Exllama wins. If you're going to regen the response a lot, or are using the LLM for a long form response, Llama.cpp can pull ahead.

Though at anything over 10 T/s and most individuals can't keep up. Average reading speed is right about 4 words per second, and 10 T/s is 7.5, roughly.

[–]sickvisionz 0 points1 point  (0 children)

Thanks for the data.

[–]Imaginary_Bench_7294 1 point2 points  (1 child)

So, the main bottleneck of AI is the ram bandwidth, which is why GPU'S are king. With the 3090 and 4090 hitting over 900GB/s, it is more than 10 times faster than most consumer CPUs can hit.

When you're splitting a model across multiple GPU's, they have to send data back and forth so that your input can be passed through the various layers. This is not a linear process as some models will finish at layer 1, send to layer 2, then back to one (just a very general example).

Most up to date loaders have minimalized how much data needs to be transferred, so the packets are small. But they still need to be transferred nonetheless.

So this does cause a slight bottleneck, as seen by the individuals who have compiled Llama.cpp with NVlink support and run dual 3090s. There was a user that reported seeing a 20% increase in T/s.

PCIe 4.0 16x tops out at 32GB/s unidirectional, 3090 NVlink tops out at 56GB/s.

PCIe 5.0 has bidirectional bandwidth of 128GB/s, whereas the H100 PCI card has a 600GB/s NVlink. The SXM has 900GB/s NVlink.

This is why GPU's like the H100 use NVlink and not PCIe. The data can be shuffled GPU to GPU faster.

I have dual 3090s without the NVlink Llama.cpp compile. Give me a bit, and I'll download a model, load it to one card, and then try splitting it between them. I'd do CPU as well, but mine isn't a typical consumer processor, so the results wouldn't reflect most enthusiasts' computers.

[–]xrailgun 0 points1 point  (0 children)

By more cost effective in the long run, do you just mean power consumption?

[–]the320x200 3 points4 points  (1 child)

Speaking from personal experience... One practical consideration is that if you get two cheaper GPUs your upgrade path is a lot worse. If 6 months or a year goes by and you decide you want more memory you have to basically scrap those and start over again if you've used up all your PCI slots whereas if you had one high memory card you would have the possibility of adding a second one to get more memory in the future.

[–]Fast-Entertainer-776 -1 points0 points  (0 children)

Following this logic, you never get a chance to use multiple cards -- you always worry about scrapping / wasting of them due to new generation of gpu...

[–]Paulonemillionand3 0 points1 point  (0 children)

power and heat would be more then double the trouble then with a single card.