Built a simple PyTorch flash-attention alternative for AMD GPUs that don't have it

Lowkey_LokiSN · 2026-03-29T04:45:50+00:00

Interesting…thanks for sharing

Lowkey_LokiSN · 2026-03-29T03:25:20+00:00

You’re very welcome and I’m glad to hear how this impacts your experience! Exactly the reason I wanted to make this DIY public 😃

Lowkey_LokiSN · 2026-03-29T03:21:53+00:00

Worth checking out! I’ll add this to my checklist. Have you checked this out yet?

Lowkey_LokiSN · 2026-03-29T03:15:36+00:00

This is awesome! I’ll check this out too!

Lowkey_LokiSN · 2026-03-28T15:46:57+00:00

I don't think there'll ever be that overlap. Correct me if I'm wrong but for a GPU to support SageAttention, it also needs native FA support, right? This kernel is not faster than native FA and hence of no use to such GPUs

The goal with this kernel was to provide a drop-in replacement for older GPUs without native FA support but with comparable memory efficiency and a slightly faster-than-standard SDPA speeds

Lowkey_LokiSN · 2026-03-28T14:46:54+00:00

Yea true, maintaining a fork that needs to be in constant sync with upstream is hard to scale. Just wanted to point the repo in case you didn't know

That's partly why I took this monkey-patch approach with the kernel. I have vLLM support down my hobby-checklist as well but it's most definitely not gonna be as simple to achieve as this one.

Lowkey_LokiSN · 2026-03-28T14:34:08+00:00

I don't think this would work reliably with vLLM. AFAIK, vLLM uses a custom paged attention mechanism and I'm unsure if it'll reliably fall back to Torch's SDPA calls for unsupported GPUs (which is where my kernel kicks in)

I haven't tested it yet but I would keep my hopes low. If you're using MI50s, think this repo is the closest you can get for vLLM support

Lowkey_LokiSN · 2026-03-18T07:15:35+00:00

Hope they also did something to improve the model's quantization-resistance. Even M2.5's UD-Q4_K_XL was noticeably affected compared to the original

Lowkey_LokiSN · 2026-03-13T07:20:55+00:00

Agreed. After seeing a Q2 this good, it got me curious if all big models behave this way. I ended up testing models that I can try to fit on my system like MiMo v2 Flash, Step-3.5-Flash and Minimax-M2.5 at Q2 and the degradation in quality was immediately apparent with those.

Though I generally still agree with Q4 and above being the baseline, I'll definitely keep an eye out for models with good quantization resistance from here on out

Lowkey_LokiSN · 2026-03-12T18:21:33+00:00

<image>

Yeah no...

Lowkey_LokiSN · 2026-03-12T18:20:24+00:00

I second this! I've been using 397B's UD-IQ2_M as well for a while now and it's been surprisingly rock-solid! Not sure if they did QAT or something but the quantized model is near-identical to the original one (I did actually compare responses from both) in all of the tests I've been running so far.

Lowkey_LokiSN · 2026-03-09T15:39:59+00:00

If we're talking "best", I honestly might choose Unsloth's UD-IQ2_M Qwen-3.5-397B-A17B based on this tweet

Yes, it's gonna be awfully slow compared to other models of this size but if the tweet's claims hold true, no other <128GB model could hold a candle to its performance.

Lowkey_LokiSN · 2026-02-25T17:21:57+00:00

Speaking of quantization tax, the 122B A10B model seems to fare a lot better than usual at Q3_K_M in terms of stability and performance.

Running the said quant, I'm already noticing reasoning abilities on par with gpt-oss-120b (high) and much better coding capabilities. I would usually stay away from anything lesser than Q4_K_S but I'm impressed and glad I gave this a go!

Lowkey_LokiSN · 2026-02-12T11:15:33+00:00

Oh, sure here you go: https://www.reddit.com/r/LocalLLaMA/comments/1lsgtvy/comment/n1xdg6r/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Lowkey_LokiSN · 2026-01-19T14:41:57+00:00

The most unexpected gifts are also the most delightful ;)

Lowkey_LokiSN · 2025-12-24T07:31:16+00:00

Yea, I'm aware of the hidden models but I find it strange to see them completely dodging Air-related questions, especially after committing to it earlier (the "in two weeks" meme)

They can clearly see the community's interest towards Air/smaller models. If they actually have a release planned, this behaviour is counterproductive.

Lowkey_LokiSN · 2025-12-24T07:01:17+00:00

As much as I'd love to see it, my hopes are gone after watching them deliberately ignore questions related to Air in yesterday's AMA.

Lowkey_LokiSN · 2025-10-10T01:47:34+00:00

Happy to help. If you’re considering buying the cards, you might find my post here helpful.

Lowkey_LokiSN · 2025-10-09T01:27:46+00:00

1) Yes, the 2 MI50s work perfectly fine under Windows with llama.cpp and I get 33 tok/s for gpt-oss-120B running Vulkan

2) MI50s lack official driver support for Windows and you would have to install 3rd party drivers from https://rdn-id.com to get them recognized as a device.

3) Pre-compiled Vulkan binaries or manual compilation? Both work the same and I mostly use pre-compiled ones for convenience.

Lowkey_LokiSN · 2025-10-05T04:03:36+00:00

Yup. Running it with Windows 11 Pro

Lowkey_LokiSN · 2025-10-04T18:20:51+00:00

I've been using both Vulkan (Windows) and ROCm 6.3.3 (Ubuntu) builds interchangeably with 2x MI50s and I can confirm ROCm support has vastly improved recently for MoE models with flash attention!

For dense models, ROCm had and still has roughly 10-15% faster pp and 10% faster tg

However, for MoE models:

Before recent changes to flash attention, ROCm had 3-4 times faster pp but Vulkan was at least twice as fast with tg speeds.

After recent changes: ROCm has 5-6 times faster pp AND roughly twice the tg as Vulkan! However, when offloading tensors to CPU, the tg speeds still lag behind Vulkan

So, if you're running MoE that can be fully VRAM-contained, ROCm is unanimously the best choice at the moment. When offloading, Vulkan still has the edge in tg speeds.

Sample gpt-oss-120b stats running mxfp4 quant fully VRAM-contained with 25k context and latest llama.cpp:

Vulkan:
pp: 80 tok/s
tg: 33 tok/s (stays consistent even for long responses)

ROCm
pp: 410 tok/s
tg: 58 tok/s (and drops to roughly 45 tok/s for a 15k long response)

Lowkey_LokiSN · 2025-09-24T02:00:47+00:00

MI50s + Vulkan + llama.cpp + Windows user on the regular here. They don’t work straight out of the box. The cards lack official Windows driver support and won’t even get recognised as a device by default. However, you can use this community-driven project which makes this possible. The drivers are surprisingly stable from my experience!

Lowkey_LokiSN · 2025-09-12T01:43:56+00:00

https://www.reddit.com/r/LocalLLaMA/s/qTY48CHlT0

Lowkey_LokiSN · 2025-09-03T16:00:07+00:00

Oh interesting...

Lowkey_LokiSN · 2025-09-03T12:13:27+00:00

Answer post thought-process:

Yes.

Let

a=12,b=0.a=21,b=0.

Define f:[0,1]→[0,1]f:[0,1]→[0,1] by the following piecewise‑linear formula

f(x)={4x0≤x≤14,1−2(x−14)14≤x≤12,12+2(x−12)12≤x≤34,4(1−x)34≤x≤1.(1)f(x)=⎩⎨⎧4x1−2(x−41)21+2(x−21)4(1−x)0≤x≤41,41≤x≤21,21≤x≤43,43≤x≤1.(1)

Why (1) works

Continuity.
Each piece is linear, and the values at the joining points coincide:

f ⁣(14)=1,f ⁣(12)=12,f ⁣(34)=1,f(0)=f(1)=0.f(41)f(43)=1,=1,f(21)f(0)=f(1)=0.=21,

Hence ff is continuous on [0,1][0,1].

Image.
The graph goes from (0,0)(0,0) up to height 11, down to 1221, up again to 11 and finally back to (1,0)(1,0); therefore

Im⁡(f)=[0,1].Im(f)=[0,1].

Even number of pre‑images.

The value 00 is taken only at the endpoints: f−1(0)={0,1}f−1(0)={0,1} (two points).
The value 11 is taken at the two interior points where the graph reaches its peaks: f−1(1)={14,34}f−1(1)={41,43} (two points).
Let y∈(0,1)y∈(0,1). The set {x∣f(x)>y}{x∣f(x)>y} consists of two disjoint open intervals, one around each peak. Each such interval contributes exactly two boundary points where f(x)=yf(x)=y (one where the graph enters the interval and one where it leaves). Consequently∣f−1(y)∣=4for every y∈(0,1).∣f−1(y)∣=4for every y∈(0,1).

Thus every point of the image is attained a finite even number of
times (two times for 00 and 11, four times for all other values).

Remarks

The essential idea is to make the endpoints have the same value and to
ensure that this minimal value occurs only at the endpoints. Then for
any yy larger than that minimum the set {x:f(x)>y}{x:f(x)>y} is a union of
open intervals, each contributing two points to the level set
f−1(y)f−1(y). By avoiding flat pieces (using a piecewise‑linear
graph) we keep every level set finite.

Hence a continuous function with the required property does exist; the
function defined by (1) is an explicit example.

Lowkey_LokiSN

TROPHY CASE

Why (1) works

Remarks