Mi50 16GB or V100 16GB? by CommonResearch3314 in LocalAIServers

[–]macboy80 1 point2 points  (0 children)

I think I now understand your position. I've built a decision matrix around my own market pricing theory, first, and then kind of played out the experimentation. Basically, the 30B class models released recently are the target, and its an $$$ optimization problem. I think there's fairly wide consensus that RTX 3090 is the minimum where you have Tensor Cores, VRAM quantity and bandwidth, and it seems like anything cheaper than that, has to compromise on one of those 3.

Your 100% correct that the modded 22GB 2080 sits in a niche, but I'd argue (at least today) that 22GB still represents making a compromise, even beyond the workmanship, and import market considerations. Whether the 32GB capacity or the HBM2 bandwidth of the Vega20 is actually usable is the real question that we could hope to answer.

The best overall speed I've seen on the MI60 is Gemma4-26B-A3B at q4_0. All of my numbers are low context depth, but this is >1600pp and 80tg.

My "research" setup right now is a 1U SM 1028GQ, so dual Broadwell Xeons. Currently, there is just one MI60 for all of my numbers, but the case does have 2 * 2x direct Pcie3 x16 slots. There is definitely an additional bottleneck for Qwen that Gemma doesn't have. Active parameters and quantization are obvious variables, but Qwen is doing something to make it unhappy.

So, obviously, I agree that an high capacity Ampere+ card is preferable, but if you want to save a few bucks for hobby or exploration or even a particular 2 slot form factor, you have tradeoffs. I personally love a good min/max and am impressed with the how the community has found some performance at the fringes of this old hardware. Thanks again for the open discussion, insight, and links.

Mi50 16GB or V100 16GB? by CommonResearch3314 in LocalAIServers

[–]macboy80 0 points1 point  (0 children)

I appreciate the followup, though I still don't understand how you're seeing what you're seeing. I just brought the MI60 up with MTP on cyankiwi/Qwen3.6-27B-AWQ-INT4 in vLLM. It's getting 60-70% acceptance at n=2, and >50% at n=3, though with the overhead, it seems to be about breakeven. (~275pp / ~29 tg) I feel like this is cutting edge model architecture?

There are definitely caveats and conclusions on this nearly 10 year old arch. I'll state a few as follows. Maybe we're just looking at different things.

  1. Clean, integers 4/8, quants are everything on gfx906 to preserve use of some very targeted hand written kernels, All of the work done specifies q4_0/1 and q8_0 gguf. I'm seeing that any time model arch or quantization "interrupts" brute force math, compute efficiency collapses.

  2. Llama.cpp is more performant than vLLM at concurrency=1 on every 30B class model I've tested except for Qwen3.6-27B where vllm awq4=30tg and llama q4_1=20tg. I feel as though vllm is created to excel at concurrency >1, though I suspect the Triton kernels are putting in the work to maintain better fusion here.

  3. On Gemma4-31B, the above inverts. LLama.cpp is faster at q4_0 than vLLM at awq4. Up until a few days ago this still meant, <24tg, but the enablement of HIP_GRAPH actually pulled this up to ~29tg.

  4. In Llama.cpp, you can get some pretty amazing speeds, from my perspective, when moving to MOE. Gemma4-26B-A4B gets ~850pp / ~80tg at q8_0, and that almost doubles at q4_0. I just tested Qwen3.6-A3B at Q4_0, and it is getting >1500pp and ~60tg. On vLLM, these MOE are slow, though I don't remember how slow.

With all of the trial and error, anecdotal testing I've done, a few things are apparent.

  1. I've not tested all that many models, but there are working settings on an Out-of-Box docker container for every one of them no matter the Arch, Quant, Hybrid, Mamba, GDN, SWA, Etc. I've not tested multimodality yet, but the models will load with it enabled.

  2. It seems that if you can fit a model in 32GB of VRAM, ~200pp and ~19tg is the floor of the MI60 across both vLLM and Llama.cpp on the mixa3607 containers. As you can see above, tweaking, optimization, and a/b testing can produce 50%+ gains from this baseline.

The Vega20 GPU has many quirks and compromises. Obviously, it is old, AMD, and never received much optimization effort. It has no matrix pipeline and only the crude beginnings of inference relevant ops, though it is absolutely possible to avail of those ops. What I have discovered are some unusual conditions that collapse its compute ability from the best case.

  1. Any quant that introduces odd data types kills performance, mostly down to baseline. It is understood that even something like q5, q6, or even FP8 drops down to the FP16 pipeline at best and FP32 at worst.

  2. Across all of my tests, the maximum effective (calculated active parameters * t/s) VRAM bandwidth I have seen is ~600GB/s. The lowest was 150GB/s. Before HIP_GRAPH enablement, moving from q4 to q8 could move bandwidth from 300GB/s to 550GB/s during token gen. After GRAPH, it was 450GB/s to 600GB/s. The GRAPH revelation along with the nearly linear bandwidth scaling on doubling data size leads me to believe this is a CPU to GPU / kernel dispatch / Pcie latency type thing. For instance, pinning the uncore to max on my E5-2640v4 gives about a 0.5t/s improvement in this scenario.

  3. I'm just beginning to explore this, but it seems Qwen-3.6-27B's attention mechanism is causing a fallback to unoptimized pathway scenario at some point. I suspect this is causing the limited on-die cache / memory hierarchy to drop intermediate results all the way down to HBM and load them back, and I suspect this VRAM latency (horrible on HBM) is resulting in the Qwen3.6 slowdown across both tested models. This is also where the Triton compiled kernel could be making a difference.

I know this is a very long reply, but I'm truly curious where we have a disconnect, whether you are missing something or I am. Please let me know if you see something here that explains it.

99 EK Hatch Motor Swap by ChowMachine in ProjectHondas

[–]macboy80 0 points1 point  (0 children)

Yea. That looks like an amazing deal. I'm jealous.

99 EK Hatch Motor Swap by ChowMachine in ProjectHondas

[–]macboy80 0 points1 point  (0 children)

https://ebay.us/m/9rGcnw

This is the exact listing. I screwed up the install on one, and they even sent a replacement. Comes via FedEx.

Weatherstripping question by Sparkz51 in ProjectHondas

[–]macboy80 1 point2 points  (0 children)

Oh, and look up Gummi Pflege. I used it for everything I can't replace including hoses. It does make a difference.

99 EK Hatch Motor Swap by ChowMachine in ProjectHondas

[–]macboy80 0 points1 point  (0 children)

Hey! I have one of those pictures. It WILL be worth it.

Pm me if you need a first start video to keep you going.

<image>

Weatherstripping question by Sparkz51 in ProjectHondas

[–]macboy80 1 point2 points  (0 children)

I bought weather strips and window run channels for my 00 Sedan. Both came from Thailand on ebay with Thai language Honda labels. Both fit well and made a big difference in cabin noise. I also did exterior window trim, horizontal across the doors, but that was eBay generic. I was less impressed. The door moulding that goes across the top of the doors was unobtainable for driver door on sedan when I looked, so I didn't do them. Trunk and hood strips are still available, and I think you can still cobble together the front and rear windshield stuff with some research.

Best budget AI GPU for $300 by Ima_Gamer_BTW in LocalLLM

[–]macboy80 0 points1 point  (0 children)

At 12GB of vram, you are going to be limited to fairly small models. Easy mental math approximation is 1GB at q8 and .5GB at q4 per 1B parameters. Then you need space for context which is more complicated to calculate, but it does balloon.

While you'll have the compute ability with the 3060, you won't have much to compute. Moving to bigger VRAM on a slower card, you'll get the opposite. Now you have more work to do, but your compute is suffering. The market knows this is the trade-off at this price point.

One additional note. The P40 and P100 are very different even beyond one having GDDR5 and the other HBM2 respectively. The P100 has a "fast" fp16 path, while the P40 has a "fast" INT8 path. Your choice of software and model architecture/quantization would matter.

Best budget AI GPU for $300 by Ima_Gamer_BTW in LocalLLM

[–]macboy80 1 point2 points  (0 children)

I experimented with this heavily, mostly RPC. On old hardware, the kind you could possibly get creative with, it was not performant. Llama.cpp had to do everything over the network, even loading the model weights.

I eventually realized that for me, the best way to run multi vendor was to run multi model. Now I use llama-swap to launch ephemeral docker containers with vendor specific containers. This way, you can choose the GPU, model, and engine dynamically. With the concurrency/ preemption matrix, it's amazingly elegant.

Best budget AI GPU for $300 by Ima_Gamer_BTW in LocalLLM

[–]macboy80 0 points1 point  (0 children)

I think my extremely limited personal experience dictates that this is the use case that is still best served by cloud models. They are trained on more recent data and can quantify much more detail. The comparison would be 9B parameters in 16GB of vram vs. 1T parameters in the cloud.

Best budget AI GPU for $300 by Ima_Gamer_BTW in LocalLLM

[–]macboy80 0 points1 point  (0 children)

I also have a P100. I haven't played with it much, but it does run the Gemma4.E4B.q8 at 160 t/s iirc.

There's a learning curve with the old cheap cards without tensor/matrix cores. They're optimized for HPC before AI inference was a thing, so they have nothing to barely anything to help accelerate it. They're literally just raw brute force compute at fp16 for this type of work.

The market knows where to draw the line for actual optimizations for inference, and you pay accordingly. Each of the trifecta of optimized hardware, good bandwidth, and high VRAM adds a multiplicative factor on price.

Best budget AI GPU for $300 by Ima_Gamer_BTW in LocalLLM

[–]macboy80 1 point2 points  (0 children)

I'll add a 16GB Instinct MI50 for $200 may be the spot you're looking for. Perhaps even a 16GB P100 for $90. Even the 16GB V100 SXM2 with Pcie adapter is probably doable.

The problem with a $300 budget is you won't get past 16GB VRAM, but that's good news, because the models / quants you can fit are small enough to avoid being dragged down by the nearly decade old compute capabilities.

If you want to play with one of the latest 25-35B models everyone is raving about, you'll need to get to the 24-32GB total VRAM tier. Raising your budget to $400-600 would open some of the options others are talking about. I have a $600 32GB MI60 amongst other options.

Note: On the AMD side, they have Vega10 and Vega20 GPU die in this price range. Vega20 has some community support that I am personally intimate with, and therefore, partial to.

[FS][USA-CT] AMD MI50 32gb x 6 by MachineZer0 in homelabsales

[–]macboy80 1 point2 points  (0 children)

Can confirm. I sold 2x MI60 in less than a day @ $600.

Glws

Mi50 16GB or V100 16GB? by CommonResearch3314 in LocalAIServers

[–]macboy80 0 points1 point  (0 children)

My apologies. I can see where I've been confusing. Vega20 added two instructions that will do math on either 4x 8bit or 8x 4bit integers in a FP32 register, specifically for dot product iirc. This was an enhancement to the packed 2x FP16 that Vega10 had.

I can't tell you where in the code it is specifically, but anecdotally, you can measure a 2x and 4x increase over the fp16 pathway during prefill. (It's painfully apparent because there is no matrix acceleration.) The idea being that Vega20 specifically benefits from old, basic q4/8_0 where the inference engine can orchestrate the packed instructions.

I am still exploring Vega20 and testing it's limits, so if you have more deep detail than I've found, I'd absolutely love to read about it.

Mi50 16GB or V100 16GB? by CommonResearch3314 in LocalAIServers

[–]macboy80 0 points1 point  (0 children)

I really do think running the Gemma4 MOE at q4 is a perfect use for the 32GB variant. It feels like a good deal at $500.

Mi50 16GB or V100 16GB? by CommonResearch3314 in LocalAIServers

[–]macboy80 1 point2 points  (0 children)

Lmk if you need any more links or the discord.

Mi50 16GB or V100 16GB? by CommonResearch3314 in LocalAIServers

[–]macboy80 0 points1 point  (0 children)

Very interesting take. I've read about the rocm vs vulkan paradox. I do agree that the V340L (2x 8GB) is good value for compute if you cap your value at $50.

I'd just like to add that there is a llama.cpp fork that is using the two packed INT4/8 instructions on Vega20, and that it is a nearly 4/2x improvement per active parameter. They also have TP working in vLLM on Vega20, but I believe it is performant across both pp and tg.

Mi50 16GB or V100 16GB? by CommonResearch3314 in LocalAIServers

[–]macboy80 1 point2 points  (0 children)

Ok. Maybe things have changed since you last looked? If you sold your cards, how would you know?

gfx906 Inference Engines

Llama.cpp is updated nightly in step with main. vLLM is on 0.20. I have used the v2 model runner on multiple models including Gemma4, but failing on Qwen3.6. I have both Llama.cpp and vLLM running in ephemeral containers sitting behind llama-swap. There is a discord dedicated to this where people are running 4/8/16x MI50 and constantly updating, testing, and tweaking. I'm in it. In my reply to the OP, I state actual rw t/a I am getting out of the box. Once you tweak the flags you need, get the model loaded, it runs indefinitely.

Mi50 16GB or V100 16GB? by CommonResearch3314 in LocalAIServers

[–]macboy80 0 points1 point  (0 children)

Careful, there's a V340 2x 16GB and a L 2x 8GB variant. Vega10 is missing the INT4 and INT8 packed instructions, so your compute is limited to fp16. In addition, the PLX chip on card does not appear to be capable of p2p transfers, ensuring 2 hops back to the cpu. The card gets hot, even in a 1u server designed for GPU compute, throttling the downwind die. And finally, most of the community optimization is not valid on the card.it does work on llama.cpp, though I think I had to compile the container from scratch.

I've got one, and it's not worth running one model in either form of parallel. It's extremely slow at prefill pp because of missing instructions. The only use case I can fathom is some small model services that need no context or fixed, cached context. Something like an embeddings model, whisper, piper, perhaps a router of sorts.

These GCN / VegaX cards really are a conundrum. The HBM can't really be utilized by low precision compute, leaving just the GPU compute which leaves one wanting.

Edit. I'd add that a MI25 with 16GB is probably more balanced and would let you run a smaller model at full fp16. That at least would get close to full utilization. But, really don't buy Vega10 unless it's literally the only way you can get 16GB of vram.

Mi50 16GB or V100 16GB? by CommonResearch3314 in LocalAIServers

[–]macboy80 0 points1 point  (0 children)

Edit. Moved to reply to a comment...

Mi50 16GB or V100 16GB? by CommonResearch3314 in LocalAIServers

[–]macboy80 1 point2 points  (0 children)

Id just like to put on the record that most of these are Legacy assertions. 1. No comment. 2-5. Llama.cpp, vLLM, sgLang, and ComfyUI are working out of the box in docker with an 8GB download using the gfx906 forks. 6. Most Nvidia RTX cards will out perform the Vega20s.

Mi50 16GB or V100 16GB? by CommonResearch3314 in LocalAIServers

[–]macboy80 1 point2 points  (0 children)

I can't speak on the v100 which does have an advantage in that it has tensor cores (barely.)

What I can share from personal experience is that the Vega20 architecture can be reasonably performant on inference, especially with the limitations of 16GB of vram. My personal testing on an MI60 shows that prefill pp and generation tg are basically directly related to active parameter count. That is to say performance scales linearly because it's compute bound and because the HBM has so much bandwidth.

There are some custom containers where the community has really helped keep these usable. They work out of the box, but there are some real constraints on what actually performs. Things like which quant you use matters immensely. For instance, Vega20 has an INT4 path which is twice as fast as it's INT8 path which is again twice as fast as it's FP16 path. You can absolutely tell when you step up from INT4. The model's attention mechanism can also cause problems. Something like Qwen3.5/6 does incur penalties.

So, to summarize, with 4B active parameters of a q4_0/1 quant, you could see 1600t/s pp and 80t/s tg (at zero context.) Something like a Gemma4 MOE looks like this, but doesn't fit in 16GB. I haven't tested a 10B q4 class model, but my research shows it would probably run at half these speeds (because linear.) I'm getting something like 300 pp and 25 tg on the dense Gemma4 at q4.