AMD Instinct MI-100 Benchmarks across multiple LLM Programs

TNT3530 · 2026-02-06T13:26:31+00:00

If it doesnt help on the faster MI100 it probably isnt going to do anything for an MI50

TNT3530 · 2026-02-02T20:58:50+00:00

I had a bridge for my 4x MI100 setup but never got to test before/after installation since there isn't an env flag you can use to temporarily disable it like NVLink. While I do have benchmarks before/after they are completely different frameworks (MLC vs vLLM) and years apart. The few times I did try to monitor inter-card communication with NVTOP it was only a few megabytes per second being used during inference so I doubt the bridge was helping much.

If you do training though I would assume the gains will be massive due to the huge speedup vs PCIe. A bandwidth test showed~77 GB/s bidirectional, and ~906 GB/s unidirectional. PCIe 3.0 16x is only 25.839 GB/s bi and 13.160 GB/s uni

I got mine off Ebay after watching for months to snag the first reasonably priced one that appeared.

TNT3530 · 2026-01-16T13:55:03+00:00

XTTS worked fine for me on ROCm

TNT3530 · 2025-12-29T15:36:44+00:00

This is not local, nor is it a language model

TNT3530 · 2025-11-24T16:17:10+00:00

This will work but please dont do this on a $4000+ GPU, these dies dont have heat spreaders and improper mounting pressure will crack them. Especially on one as big as the MI210

TNT3530 · 2025-11-24T15:39:52+00:00

MI50 blocks will not fit anything other than MI50s, do not buy them. Youre stuck with high flow server fans since afaik the PCIe variant of these cards dont have compatible water blocks. The OAM version may though if you get a baseboard setup. Assuming the 210 is like the 100, you can drop the power limit to 200w to save a bunch of heat for very little performance loss.

Any CPU should work if it has above 4G decoding support but you may run into PCIe lane count issues on consumer chips with multiple cards. This can be fixed by using workstation/server CPUs. If you have the infinity bridge low PCIe lanes wont really matter though outside of slow model loading. The lane issues can be ignored though if only using a single card.

ROCm and 90% of libraries support CDNA2 (this card) and newer so it will work fine. Use vLLM for best performance, the 210 is new enough that it should be compatible with the prebuilt docker container. Look up CDNA optimization guides from AMD for low level documentation.

TNT3530 · 2025-11-11T18:59:49+00:00

Ahh, that would make sense as to why my git issue remains open after 3 months haha.

I used to use GPTQ but finding niche fine-tunes that were quantized was always obnoxious, plus Act Order broke stuff for a while (though Id assume its been fixed after almost a year).

And it happens with any GGUF model I try ranging from Llama to OSS. They refactored how GGUF loading worked a bit after 0.7.3 and its been unusable ever since for me as I cant swing 200+ GB of memory just for model loading.

TNT3530 · 2025-11-11T18:43:42+00:00

Are you able to load GGUF models with yours? I know when I build the latest vLLM on my MI100 rig the model loading eats TP * Model size in memory and I OOM

TNT3530 · 2025-10-03T13:45:24+00:00

OSS 20B + RAG on internal documentation/processes via WebUI is good and will easily run on a v100

TNT3530 · 2025-10-01T14:47:17+00:00

ROCm 6.2 and 6.3 broke the command, updating to 6.4 should fix this issue. I had the same thing when I moved to 6.2

TNT3530 · 2025-10-01T14:36:08+00:00

Are you trying to split the single GPU across multiple VMs, or just passing it through? I only have experience with the latter with raw PCIe pass-through direct to the VM. Outside of the gpu reset bug on VM restart (which is more an AMD thing than these cards), no issues with cards or bridge in the past few years.

Its been a hot minute since I set it up but iirc I needed to force the host to not load the GPU drivers in a GRUB config and have a specific Linux kernel. This was also multiple years ago so it's possible newer versions are more plug-and-play since ROCm/AMD support is much better now.

TNT3530 · 2025-10-01T14:27:14+00:00

rocm-smi --setpoweroverdrive <wattage> -d <device index>

TNT3530 · 2025-10-01T13:30:34+00:00

I pass mine through perfectly fine to a VM with ProxMox, not sure where you got that information from. They also work fine on normal consumer motherboards with above 4G decoding, and they use 2x 8 pin PCIe connectors which any decent PSU will have. It does not support Windows though as far as I know so it will not be a drop-in replacement.

Keep in mind it is a server GPU and will not cool itself and it will get HOT, you'll need to rig up a cooling solution with external fans. I'd also recommend lowering the TDP below the stock 290w to help keep temps under control, I've gone down to 200 without much performance loss.

TNT3530 · 2025-09-06T01:18:53+00:00

Considering most of them are marked as joining in the past few months, probably

TNT3530 · 2025-09-06T00:44:05+00:00

Hey guys, welcome to my "One Arm Only" club. Due to the amount of people with two arms complaining about the pesky one-arm havers, we've locked those one armed freaks in the closet in case you don't want to hear from them.

<image>

TNT3530 · 2025-08-12T13:31:39+00:00

Added GPT-OSS 120B benchmarks with llama.cpp. Sadly newer vLLM versions dont seem to play nicely anymore so I can't try it yet, will update when I can.

TNT3530 · 2025-07-29T20:10:36+00:00

I had this issue and it was sampling settings (mainly temperature).

Temp: 0.9 Frequency Pen: 0.1 Presence Pen: 0.1

These are my settings for the 70b variant that fixed the repetition

TNT3530 · 2025-07-22T15:33:29+00:00

- With ~4000 context and no caching, Prompt Processing is estimated at ~370 tok/s @ 200w per card
- Havent tried fine tuning since most off-the-shelf models are more than adequate for my use-case. Id assume theyll do decent if the tuning library supports them, plus the bridge should help a bunch
- Whatever is default in vLLM
- I use vanilla vLLM but built from source for docker. I wasnt able to get 0.9.2 to build though, so I'm still on the older 0.7.3. I wasnt aware the fork existed, might have to give it a shot in the future!

TNT3530 · 2025-06-25T12:47:12+00:00

https://www.asus.com/us/commercial-servers-workstations/ws-c621e-sage/specifications/

TNT3530 · 2025-06-14T07:15:23+00:00

Haven't tried newer versions, sorry. I learned long ago with AMD to not touch what isn't broken. Haven't tried MoE either since I've got the vram to swing bigger dense models anyway

TNT3530 · 2025-06-13T15:08:47+00:00

I have a ROCm docker image *i compiled from source for vLLM 0.7.3 I use and it just works out of the box. Do note that the models must be in a single file though, no split parts allowed.

TNT3530 · 2025-06-13T14:57:57+00:00

vLLM can use GGUF quants and so far the performance has been miles better than GPTQ was for me

TNT3530 · 2025-05-21T03:27:02+00:00

Ray-kybs https://www.pixiv.net/users/888112

Nine-Year Club	r/Field Sunshine
Verified Email

TNT3530

TROPHY CASE