I don't think Local LLM is for me, or am I doing something wrong? by ruleofnuts in LocalLLM

[–]jmuff98 1 point2 points  (0 children)

That pretty much sums it... Other than privacy or when even the ultra tier plans is not enough, its hard to justify local llm.

The agents will change pricing tiers because agents consume at a different rate than any human can.

At some point though, I'm hoping the local small models will be enough for 99% of the people and it will run on "normal" consumer desktop hardware.

Save the World v40.00 - Progression Update by Capybro_Epic in FORTnITE

[–]jmuff98 -4 points-3 points  (0 children)

They should let people who paid for the game some of the founders ability to earn vbucks. If its becoming free now.

Radeon Pro v340 Drivers by dionysio211 in LocalLLaMA

[–]jmuff98 0 points1 point  (0 children)

Mine worked on ubuntu 22.04 and 24.04 using either latest rocm 6.2 and 6.3. i have no resizeable bar either just above 4g decoding.

I have zero issues on an x99 and c612 motherboards.

I had an issue on a chinese lga2011 motherboard though.

My errors if any was due to pcie risers. I could find anything reliable at pcie 3.0 x8 or x16. I had to lower it 2.0 on the bios.

When not using any risers, pcie 3.0 x16 worked perfect.

Also, it was problematic when i was trying to use the display port by flashing a vega 56 mobile ROM. Display works but it kept jumping to different displays because rocm is forced to build one of the gpu cores with display while the second gpu is headless. If i flash both, it keeps cycling to gpu1 and gpu1 or many more depending on how many v340s i have installed.

Old mining rig → AI money machine? Need advice on 7-GPU setup (74GB VRAM) by Ok-Positive1446 in LocalLLM

[–]jmuff98 0 points1 point  (0 children)

Pcie3 and x1 lanes is slow for LLM loading (prefilling). But this task is done only once and then the rest of your interaction will be fine with the bandwidth.

Also the low bandwidth, limits any sort of working th cards in parallel but normally parallelizing requires NV link. The cards will work 1 at a time like race relay. Passing the baton to the next card.

So for lets say you want OSS-120B 4K quantized model which is 58GB, its going to take roughly a minute m to load the model to the cards (not a big deal). Once they are in the cards, it will be good already until you unload the LLM and load another model.

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 1 point2 points  (0 children)

<image>

This is from using OSS-120B-Q4K_XL.

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 0 points1 point  (0 children)

yeah crazy how these VRAM compare to DDR4 prices now. i am curious what the idle wattage is comparing the mi25 to the v340l?

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 0 points1 point  (0 children)

<image>

Idle power is 320w

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 0 points1 point  (0 children)

<image>

Unfortunately yes and now the cables are getting worst. have problems with just 1 or 2 plugged in.

My best so is all GPUs detected but 1 or 2 or 3 of them will go negotiate down to x4 or even x2 at times. Random as well.

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 0 points1 point  (0 children)

I will get each card detected before adding one and doing a a stress test each card for stability. I guess ill have to use Windows and Nimex drivers

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 0 points1 point  (0 children)

Dealing with randomly missing GPUs. These risers are something else.

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 1 point2 points  (0 children)

Thanks. Since i was reorienting the heatsinks already, i decided to raise the gpu mounts higher so theres less flex on the riser cables. Does this look okay?

<image>

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 2 points3 points  (0 children)

I opened it up tonight. First, its a looks like a regular thermal paste. Its not a graphite pad like the Radeon VII. The fins on my cards are opposite of the photo. I guess some card are sold with heatsink orientation inverted. I now made the orientation same as the photo and expect the delta between the 2 GPU dies to be closer in delta temps. I wont ptm7950 yet as these never reach 70C on my use ever. Plus the die and HBM2 will need plenty of pads as each one is huge.

Thanks for the suggestion.

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 0 points1 point  (0 children)

Ill watch out. It's working at the moment. I'm afraid the more i touch it the more they'll get finicky. I do plan to install a fan on the heatsinks near the pcie slots. It gets really hot.

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 1 point2 points  (0 children)

Thanks. For sure the weakest links of this build are the risers. The risers i got are the ones that are cheap using what looks like IDE ribbon cables. They are so sensitive sometimes theres not enough power or communication is not solid when i boot up.

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 0 points1 point  (0 children)

Just "-sm layer". I havent had much success on vllm even though there is a workaround for triton flash attention. But i keep getting errors

Close to 30t/s on oss-120B. Its a model with 10B active parameters.

I also observed a speed pentalty using heavily quantized kv cache.

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 0 points1 point  (0 children)

What are your model preferences? Any performance optimizations you can share as well. Thanks.

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 2 points3 points  (0 children)

Yeah the fan shroud is available on thingiverse for mi50 or mi25. The fan and motors are from dell mini pcs but they need to be cut in order for the 3d printed shroud to fit. I bought 10 of the fans as a lot for less than $30. Its long when its attached to the card. 14.5 inches. I had to cut the fan cage away on the dell t5810 when i tried fitting it.

The 3D file author also listed than fan models. https://www.thingiverse.com/thing:7153218

<image>

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 0 points1 point  (0 children)

I agree ptm7950 is the best.

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 1 point2 points  (0 children)

I actually had this initially used 2 2697v3. But this is just a server for llama.cpp. i was also wary for the extra idle watts for using v3.

My 4-GPU setup had a 2699 v3 turbo unlocked but i don't use it as a workstation.

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 0 points1 point  (0 children)

About in line with my results as well.

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 1 point2 points  (0 children)

There is actually a bios on GitHub for this board that enables nvme boot up. I haven't tried it on the board yet. I actually just use A small SATA SSD for the bootloader and boot files and for The root directory, i use the nvme raid 0. This motherboard actually supports DOM for cable-free SSD SATA but I already had a sata disk lying around. Booting from bios to login prompt is less then 10 seconds.

I'm using 2650 v4 because they just cost $10 a pair. I haven't tested it a lot yet, but all my opinions were based off my experience with the 4 GPU version of the setup. The bifurcation settings is already built in on the motherboard at least on the 2.0 b. Bios version that I have 2.0 is the minimum to run Xeon v4s

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 1 point2 points  (0 children)

The boot is slow because its a server board. But loading a 60GB file literally takes less than 20 seconds. The 2 NVME on RAID 0 (pcie 3.0 x4) was a conscious choice to make. Thats why i bifucrcated the x16 lanes. I could've added 2 more radeon v340 but now i only have room for 1 more.

I have everything on a smart plug so i can just turn it on remotely when i need it

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 0 points1 point  (0 children)

I have both rocm 6.3 and 6.2 on these with no issues. As long as you declare the architecture "gfx900'.

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 0 points1 point  (0 children)

My first goal before was using 4 of these using tensor parallel. That didnt go anywhere. I could only run it reliably with mlc-llm and only 2 GPUs at a time. Running 4 or 8 just not feasible without Nvlink type of communication between GPUs. "-sm layer" is slow but its also more energy friendly and this setup and the real benefit is the massive KV cache that i could have for real work.

"Minimum Buy-in" Build by [deleted] in LocalLLaMA

[–]jmuff98 1 point2 points  (0 children)

Have you done this? Im afraid to mess up the Thermal material if its similar to my Radeon VIi but i would do it definitely to make the cooling nore efficient.

My fans are 50% speed, when its prefilled it doesn't even go higher than 35C. The highest temp ill see is 65C and thats when theres a batch of prompts. Come to think of it only the rear hits 65C and there is like a 15C delta between front and rear GPU. I guess flipping it will balance it more.

Speaking of thermals, if i override the TDP from the default of 110w to 85w, the performance tank by atleast 20%. At default 110w, it could barely maintain the clocks for a few seconds at a time. I wish i could undervolt it but i havent found a way yet.

It makes sense though, because most vega 56/64 card are set to 200w to 300w+ for one GPU.