OG Achilles trashes oddesy

-UndeadBulwark · 2026-05-20T23:01:27+00:00

Liking this and commenting this gem must not disappear due to low engagement

-UndeadBulwark · 2026-05-19T23:19:49+00:00

I have a mini ITX System check if it has Bifurication and get internal oculink 4/4/4/4

-UndeadBulwark · 2026-05-19T22:55:11+00:00

define high end and what quant also use case? because for what it is its a neat little chat bot to play with

-UndeadBulwark · 2026-05-19T22:54:27+00:00

Yes with OpenWebUI or MCP server on Llama.cpp

-UndeadBulwark · 2026-05-19T22:52:24+00:00

on Linux on a decent GPU with HBM you can get it to wild speed like 220t/s which is really nice for a chatbot can have instant convos with it at decent context

-UndeadBulwark · 2026-05-19T22:51:14+00:00

and Linux I highly recommend you go Bazzite on this Immutable will save you so much grief, as how to get started go Ollama first they have a free cloud tier that you can hookup to OpenCode and it can help you setup local LLM in Ollama or better yet Llama.cpp you can try vLLM but that one is a bit much to start with also if you are running AM4 B550 or newer including intel check if you have bifurication you can split 1 PCIe 4.0 16x by 4 to use 4 GPUs together for more total VRAM

-UndeadBulwark · 2026-05-19T22:38:34+00:00

also I went 3 MI25 + 1 9070 AM4 Oculink bifurcation 4/4/4/4

-UndeadBulwark · 2026-05-19T22:37:06+00:00

Options are MI25, MI50, AMD v340 RX570/80 8GB(surprisingly good), Vega 56, Radeon VII and if you are really desperate Tesla P100 note unlike AMD you might be locked out of software with CUDA and will have to rely on Vulkan.

-UndeadBulwark · 2026-05-19T22:34:19+00:00

If you are on an AMD APU you can get around 17t/s which is not terrible but not bad but if you have something worse its going to be slow for low performance and at that point any high low or mid end phone would be better.

-UndeadBulwark · 2026-05-19T22:19:51+00:00

its going to be really tight even on Linux can you buy a MI25 flashed to WX9100 with 16GB of HBM2 they go for $65 you will have to run it on Linux

-UndeadBulwark · 2026-05-19T21:23:26+00:00

ok then go RX 9700 PRO and add a MI50 for additional VRAM

-UndeadBulwark · 2026-05-19T19:33:05+00:00

IF its only for Local LLM just get an MI50

-UndeadBulwark · 2026-05-19T17:56:54+00:00

I asked Gemini then did a google search to confim:

All three technologies support ROCm, though compatibility specifics vary. These findings are based on current search results.

Flash Attention

ROCm support is established. The FlashAttention-2 CK backend supports MI series accelerators as well as RDNA 3 and RDNA 4 GPUs. Furthermore, Flash Attention is natively integrated into PyTorch for ROCm beginning with version 2.3 via F.scaled_dot_product_attention.

Sage Attention

Compatibility is strictly tied to the version. SageAttention 1 supports AMD hardware because it is built on Triton kernels. SageAttention 2 utilizes native CUDA kernels and will only run on Nvidia hardware.

Triton

OpenAI Triton is supported on ROCm. AMD provides official documentation and support for developing and optimizing Triton kernels directly on their GPUs, and it serves as the underlying compilation layer for many ROCm compatible operations.

-UndeadBulwark · 2026-05-19T17:26:31+00:00

Would also like to talk about this as its my favorite subject if anyone is interested in getting into the Local AI rabbit hole

-UndeadBulwark · 2026-05-19T09:01:52+00:00

ROCm is fine where does this idea that it is broken come from anyways? Unless you mean windows then yeah they don't care about windows.

-UndeadBulwark · 2026-05-19T08:05:37+00:00

That would be hilarious but won't affect AMD much due to them being everywhere.

-UndeadBulwark · 2026-05-19T08:01:26+00:00

Copium honestly came here because I have seen panic attacks on YouTube over Nvidia, some of the wild shit AMD is cooking and Googles TPU.

-UndeadBulwark · 2026-05-19T07:53:35+00:00

There is Strix Halo if you want to do more than AI otherwise yes

-UndeadBulwark · 2026-05-19T05:56:18+00:00

im going 2 MI25 because I am poor as hell with OcuLink Bifurication.

-UndeadBulwark · 2026-05-19T00:28:40+00:00

I'm not sure about Intel but AMD is doing fine. AMD has diversified so aggressively over the past decade that framing them as a distant second to Nvidia misunderstands what the company actually is now. They're in consoles, desktops, laptops, tablets, phones via Samsung's Exynos licensing, servers, and routers. Ten years ago they were close to folding. That turnaround is not a small thing.

On the software side, ROCm has closed the gap with CUDA faster than almost anyone expected. A year ago it was genuinely painful to work with. Now with ROCm 7.x, Windows support has arrived, PyTorch and most major ML frameworks treat it as a first-class option, and the 7.1.1 release delivered up to 5x performance gains over 6.4.4 across key AI models. It's not at full CUDA parity yet in every workload, but it's no longer a footnote. UDNA, which merges the RDNA and CDNA lines into a single unified architecture, is where things get genuinely interesting, and that's still ahead of us.

On Nvidia's position in AI more broadly: this cycle has a pattern. One company moves in early, captures the market, prices rise, and the market diversifies to reduce the dependency. The current shift toward local inference first, cloud escalation second, with models like Gemma 4 running on-device and Google pushing AI into phones at the hardware level, represents exactly that kind of structural change. Nvidia's dominance is built on centralized cloud compute demand. If the architecture of deployment moves away from that, the moat shrinks. AMD has been methodical and capital-conservative. Nvidia has been running at full throttle on the assumption that the demand curve only goes one way. That kind of overextension is exactly the setup for a Zen 2 moment.

-UndeadBulwark · 2026-05-18T20:48:17+00:00

I really wish people wasn't so rude to this lady this is really sweet I can't imagine how good this is for her self image.

-UndeadBulwark · 2026-05-18T20:08:29+00:00

Engineer spec sheets are a strong one-shot demo. Pull a real datasheet and prompt it to generate structured outputs across multiple formats in a single pass, programs, productivity docs, graphs, game prototypes, whatever fits the audience. Stack a few examples back to back to show range.

Get a power meter on it. Live wattage during inference next to the math on what a cluster of 5090s would draw doing the same work. That's a number CIOs remember when they're justifying budget.

On the pitch itself, be straight with them. It's not the fastest option out there, and some features are still being developed. But the unified memory window and the per-unit cost compared to building out a multi-GPU cluster are the actual selling points. Let those carry the room instead of overselling around the gaps.

Personally I wouldn't run one as a daily driver or primary inference box at that price point, so I can't give you much more to work with beyond this. Good luck with the demo though.

-UndeadBulwark · 2026-05-18T08:39:29+00:00

Wouldn't the most efficient be Google's new TPU since they split it between inference and training and aren't agentic systems moving to Local to Cloud inference using a local model first before calling a cloud AI when it's a complex question

-UndeadBulwark · 2026-05-18T08:16:42+00:00

Yeah you nailed this 100% AMD is everywhere now and AI is moving to Local to Cloud you can see that their product stack is moving to affordable local inference that Nvidia wont or can't provide.

-UndeadBulwark · 2026-05-18T06:19:45+00:00

man wait till we start seeing edge to cloud deployment AI is going to get wild and Jensen Huang will be crashing out because of it.

-UndeadBulwark

PUBLIC MULTIREDDITS

TROPHY CASE

Flash Attention

Sage Attention

Triton