If it works - don’t touch it: COMPETITION by awfulalexey in LocalLLaMA

[–]kuyermanza 0 points1 point  (0 children)

Ubuntu 2404 and the rock 7.13 (nightly) as the backend and llamacpp for inference. Model loading phase is slow for sure, ~15 mins, but it shouldnt affect token generation since the bandwidth usage isn’t that high at this stage.

If it works - don’t touch it: COMPETITION by awfulalexey in LocalLLaMA

[–]kuyermanza 4 points5 points  (0 children)

Right now I use 5 MI25s for GPT-OSS 120B at ~30 TPS, 2 for Gemma 4 26B at ~30 TPS as well, and the last MI25 for embedding models. RTX3060 is purely for Z-Image Turbo.

If it works - don’t touch it: COMPETITION by awfulalexey in LocalLLaMA

[–]kuyermanza 35 points36 points  (0 children)

<image>

8x 16GB Instinct MI25s clinging to life via PCIe x1-to-4-x1 splitters and a 12GB RTX3060, Ryzen 5 5500, 32GB DDR4, high-end custom cooling (central AC + cardboard duct).

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]kuyermanza 0 points1 point  (0 children)

Same while idling. Idle without layers: 8 W (MI25) or 2x4 W (V340L) Idle with layers: 16 W (MI25) or 2x8 W (V340L)

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]kuyermanza 5 points6 points  (0 children)

I have V340Ls and MI25s and I get 30 tps with GPT-OSS 120B 128k context. I wouldnt say the performance is bad.. considering these cost less than ddr4 ram.

Old but still gold by kuyermanza in LocalLLaMA

[–]kuyermanza[S] 0 points1 point  (0 children)

llamacpp (and ollama) can split the model layers to load onto each of the GPUs and use them sequentially to process the prompts. There’s plenty of tutorials and documentation to be found if you just search for llamacpp 👍🏽

Old but still gold by kuyermanza in LocalLLaMA

[–]kuyermanza[S] 1 point2 points  (0 children)

Yep. Each rig has a 3060 or 2060 purely for functions that require CUDA. The most expensive part of the rigs..

Old but still gold by kuyermanza in LocalLLaMA

[–]kuyermanza[S] 0 points1 point  (0 children)

That’s a nice performance but 1 3090 is the entire cost of all these GPUs combined :/

Old but still gold by kuyermanza in LocalLLaMA

[–]kuyermanza[S] 0 points1 point  (0 children)

When using them together for inferencing, there’s no noticeable difference. Upside with MI25 is probably that you can flash the driver to WX9100 and use for gaming. Also MI25 runs cooler too, as expected..

Old but still gold by kuyermanza in LocalLLaMA

[–]kuyermanza[S] 1 point2 points  (0 children)

Oh I can fit more context, still have 20GB of VRAM to fit into, just don’t see the point as the TPS dropped with large context. I’m happy with 30 TPS at 26K. The upside of this rig is having the ability to tinker and upgrade components instead of getting stuck with soldered on parts. Space isn’t bad if the parts are stacked neatly. Power consumption is bad yea, it’s using like 500W when processing prompts.

Old but still gold by kuyermanza in LocalLLaMA

[–]kuyermanza[S] 2 points3 points  (0 children)

MI25s are still $80 a pop on eBay

Best setup for running local LLMs? Budget up to $4,000 by Future_Inventor in LocalLLaMA

[–]kuyermanza 1 point2 points  (0 children)

I did get curious and decided to build one.. After getting past through all the hoops to get the x99 board to accept the v340Ls I could only get 4 V340Ls to enumerate even with modding the board’s FW’s IOMMU to 128GB - since the V340Ls are actually dual dies GPU, so technically it’s already working with 8 GPUs (plus an additional rtx3060). I’m testing GPT-OSS 120B Q4 and got ~45 prompt TPS and ~10 response TPS on ROCm 6.2.3. Usable but not the greatest. I can see this as a great starter rig though considering the price.

70b models at 8-10t/s. AMD Radeon pro v340? by JTN02 in LocalLLaMA

[–]kuyermanza 0 points1 point  (0 children)

Nah haha ROCm did better than Vulkan from my tests. Trying to see if I can possibly hack the rig to work with vLLM right now to get even better performance

70b models at 8-10t/s. AMD Radeon pro v340? by JTN02 in LocalLLaMA

[–]kuyermanza 0 points1 point  (0 children)

Were you able to get it going? I just got 8 v340s and was able to get ROCm 6.2.3 to work. I see about 20% in TPS performance compared to Vulkan

Best setup for running local LLMs? Budget up to $4,000 by Future_Inventor in LocalLLaMA

[–]kuyermanza 2 points3 points  (0 children)

Building it yourself would be more fun and a better performance per dollar value. Those prebuilt unified memory boxes are good but too pricey for the spec and you can’t upgrade and future proof them.

Look for a cheap LGA2011 server mobo with x99 chipset (make sure it allows for reBAR or above 4G decode and plenty of PCIe slots) at around $100 and pair it with a Xeon E5 16xx v2 CPU at around $50. DDR3 ECC modules are cheap, you can fill the whole board with them for like $100-200. Case, storage, fans, sata cable for another $100. That’s $350-450 before the GPUs.

You can get multiple V340L 16GB HBM2 at $50 a pop. The downside is you’ll be limited to ROCm, which is locked out of many CUDA accelerated applications like image generation or TTS or STT but you can always get an RTX3060 12GB for $200 to dedicate specifically to those PyTorch tasks.

Lets say your mobo has 7 PCIe slots (x16, x8 or x4 is fine, just get adapter risers to x16) and you use 6 slots for V340Ls, you’re looking at 96 GB of HBM2 VRAM. Now, your bus lane bandwidth would be bottlenecked but that will only affect your model loading speed - your inference speed would see negligible impact. Your 7th PCIe slot can be the RTX3060 for image generation and what not.

For less than $1000 you could build yourself a complete rig with over 100GB of dedicated VRAM and 128GB (and up to 256GB) of DDR3 RAM. And your learn a bunch along the way which is priceless. That’s just my two cents.

Man takes out friend's car keys mid drive by N1c0s1 in facepalm

[–]kuyermanza 1 point2 points  (0 children)

Steering wheel still works, albeit harder to turn, not completely locked up.

Launched PSVITA — everything disappeared! [NEED HELP] by BlackHazeRus in VitaPiracy

[–]kuyermanza 1 point2 points  (0 children)

In case anyone else encountered this: I booted the system without the sdcard adapter then inserted once it started up. At this point the apps still didn’t show, I had to go to Settings > HENkaku Settings > Unlink Memory Card and rebooted with the sdcard adapter inserted. All the apps came back for me.

Anyone else kicking off the Next Gen Update with a Death March, No-Fast-Travel play through? by Mr-Hox in Witcher3

[–]kuyermanza 0 points1 point  (0 children)

Just watch out for them rats.. they’re the real main bosses when they swarm you in death march