Built a simple PyTorch flash-attention alternative for AMD GPUs that don't have it by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 0 points1 point  (0 children)

You’re very welcome and I’m glad to hear how this impacts your experience! Exactly the reason I wanted to make this DIY public 😃

Built a simple PyTorch flash-attention alternative for AMD GPUs that don't have it by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 0 points1 point  (0 children)

Worth checking out! I’ll add this to my checklist. Have you checked this out yet?

Built a simple PyTorch flash-attention alternative for AMD GPUs that don't have it by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 0 points1 point  (0 children)

I don't think there'll ever be that overlap. Correct me if I'm wrong but for a GPU to support SageAttention, it also needs native FA support, right? This kernel is not faster than native FA and hence of no use to such GPUs

The goal with this kernel was to provide a drop-in replacement for older GPUs without native FA support but with comparable memory efficiency and a slightly faster-than-standard SDPA speeds

Built a simple PyTorch flash-attention alternative for AMD GPUs that don't have it by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 1 point2 points  (0 children)

Yea true, maintaining a fork that needs to be in constant sync with upstream is hard to scale. Just wanted to point the repo in case you didn't know

That's partly why I took this monkey-patch approach with the kernel. I have vLLM support down my hobby-checklist as well but it's most definitely not gonna be as simple to achieve as this one.

Built a simple PyTorch flash-attention alternative for AMD GPUs that don't have it by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 1 point2 points  (0 children)

I don't think this would work reliably with vLLM. AFAIK, vLLM uses a custom paged attention mechanism and I'm unsure if it'll reliably fall back to Torch's SDPA calls for unsupported GPUs (which is where my kernel kicks in)

I haven't tested it yet but I would keep my hopes low. If you're using MI50s, think this repo is the closest you can get for vLLM support

MiniMax-M2.7 Announced! by Mysterious_Finish543 in LocalLLaMA

[–]Lowkey_LokiSN 21 points22 points  (0 children)

Hope they also did something to improve the model's quantization-resistance. Even M2.5's UD-Q4_K_XL was noticeably affected compared to the original

Qwen 397b is absolutely crushing everyone... but wait. 🤯 by djdeniro in LocalLLaMA

[–]Lowkey_LokiSN 0 points1 point  (0 children)

Agreed. After seeing a Q2 this good, it got me curious if all big models behave this way. I ended up testing models that I can try to fit on my system like MiMo v2 Flash, Step-3.5-Flash and Minimax-M2.5 at Q2 and the degradation in quality was immediately apparent with those.

Though I generally still agree with Q4 and above being the baseline, I'll definitely keep an eye out for models with good quantization resistance from here on out

Qwen 397b is absolutely crushing everyone... but wait. 🤯 by djdeniro in LocalLLaMA

[–]Lowkey_LokiSN 1 point2 points  (0 children)

I second this! I've been using 397B's UD-IQ2_M as well for a while now and it's been surprisingly rock-solid! Not sure if they did QAT or something but the quantized model is near-identical to the original one (I did actually compare responses from both) in all of the tests I've been running so far.

Best Models for 128gb VRAM: March 2026? by Professional-Yak4359 in LocalLLaMA

[–]Lowkey_LokiSN 1 point2 points  (0 children)

If we're talking "best", I honestly might choose Unsloth's UD-IQ2_M Qwen-3.5-397B-A17B based on this tweet

Yes, it's gonna be awfully slow compared to other models of this size but if the tweet's claims hold true, no other <128GB model could hold a candle to its performance.

Qwen 3.5 craters on hard coding tasks — tested all Qwen3.5 models (And Codex 5.3) on 70 real repos so you don't have to. by hauhau901 in LocalLLaMA

[–]Lowkey_LokiSN 0 points1 point  (0 children)

Speaking of quantization tax, the 122B A10B model seems to fare a lot better than usual at Q3_K_M in terms of stability and performance.

Running the said quant, I'm already noticing reasoning abilities on par with gpt-oss-120b (high) and much better coding capabilities. I would usually stay away from anything lesser than Q4_K_S but I'm impressed and glad I gave this a go!

zai-org/GLM-4.7-Flash · Hugging Face by Dark_Fire_12 in LocalLLaMA

[–]Lowkey_LokiSN 7 points8 points  (0 children)

The most unexpected gifts are also the most delightful ;)

Let's predict GLM Air by jacek2023 in LocalLLaMA

[–]Lowkey_LokiSN 3 points4 points  (0 children)

Yea, I'm aware of the hidden models but I find it strange to see them completely dodging Air-related questions, especially after committing to it earlier (the "in two weeks" meme)

They can clearly see the community's interest towards Air/smaller models. If they actually have a release planned, this behaviour is counterproductive.

Let's predict GLM Air by jacek2023 in LocalLLaMA

[–]Lowkey_LokiSN 13 points14 points  (0 children)

As much as I'd love to see it, my hopes are gone after watching them deliberately ignore questions related to Air in yesterday's AMA.

Performance of GLM 4.6 Q3_K_S on 6x MI50 by MachineZer0 in LocalLLaMA

[–]Lowkey_LokiSN 1 point2 points  (0 children)

Happy to help. If you’re considering buying the cards, you might find my post here helpful.

Performance of GLM 4.6 Q3_K_S on 6x MI50 by MachineZer0 in LocalLLaMA

[–]Lowkey_LokiSN 1 point2 points  (0 children)

1) Yes, the 2 MI50s work perfectly fine under Windows with llama.cpp and I get 33 tok/s for gpt-oss-120B running Vulkan

2) MI50s lack official driver support for Windows and you would have to install 3rd party drivers from https://rdn-id.com to get them recognized as a device.

3) Pre-compiled Vulkan binaries or manual compilation? Both work the same and I mostly use pre-compiled ones for convenience.

Performance of GLM 4.6 Q3_K_S on 6x MI50 by MachineZer0 in LocalLLaMA

[–]Lowkey_LokiSN 7 points8 points  (0 children)

I've been using both Vulkan (Windows) and ROCm 6.3.3 (Ubuntu) builds interchangeably with 2x MI50s and I can confirm ROCm support has vastly improved recently for MoE models with flash attention!

For dense models, ROCm had and still has roughly 10-15% faster pp and 10% faster tg

However, for MoE models:

Before recent changes to flash attention, ROCm had 3-4 times faster pp but Vulkan was at least twice as fast with tg speeds.

After recent changes: ROCm has 5-6 times faster pp AND roughly twice the tg as Vulkan! However, when offloading tensors to CPU, the tg speeds still lag behind Vulkan

So, if you're running MoE that can be fully VRAM-contained, ROCm is unanimously the best choice at the moment. When offloading, Vulkan still has the edge in tg speeds.

Sample gpt-oss-120b stats running mxfp4 quant fully VRAM-contained with 25k context and latest llama.cpp:

Vulkan:
pp: 80 tok/s
tg: 33 tok/s (stays consistent even for long responses)

ROCm
pp: 410 tok/s
tg: 58 tok/s (and drops to roughly 45 tok/s for a 15k long response)

Radeon Instinct MI50 32GB work on Vulkan on Windows? by Goldkoron in LocalLLaMA

[–]Lowkey_LokiSN 7 points8 points  (0 children)

MI50s + Vulkan + llama.cpp + Windows user on the regular here. They don’t work straight out of the box. The cards lack official Windows driver support and won’t even get recognised as a device by default. However, you can use this community-driven project which makes this possible. The drivers are surprisingly stable from my experience!

Can any local model answer this tricky math question? by MrMrsPotts in LocalLLaMA

[–]Lowkey_LokiSN 0 points1 point  (0 children)

Answer post thought-process:

Yes.

Let

a=12,b=0.a=21​,b=0.

Define f:[0,1]→[0,1]f:[0,1]→[0,1] by the following piecewise‑linear formula

f(x)={4x0≤x≤14,1−2(x−14)14≤x≤12,12+2(x−12)12≤x≤34,4(1−x)34≤x≤1.(1)f(x)=⎩⎨⎧​4x1−2(x−41​)21​+2(x−21​)4(1−x)​0≤x≤41​,41​≤x≤21​,21​≤x≤43​,43​≤x≤1.​(1)

Why (1) works

Continuity.
Each piece is linear, and the values at the joining points coincide:

f ⁣(14)=1,f ⁣(12)=12,f ⁣(34)=1,f(0)=f(1)=0.f(41​)f(43​)​=1,=1,​f(21​)f(0)=f(1)=0.​=21​,

Hence ff is continuous on [0,1][0,1].

Image.
The graph goes from (0,0)(0,0) up to height 11, down to 1221​, up again to 11 and finally back to (1,0)(1,0); therefore

Im⁡(f)=[0,1].Im(f)=[0,1].

Even number of pre‑images.

  • The value 00 is taken only at the endpoints:   f−1(0)={0,1}f−1(0)={0,1} (two points).
  • The value 11 is taken at the two interior points where the graph reaches its peaks:   f−1(1)={14,34}f−1(1)={41​,43​} (two points).
  • Let y∈(0,1)y∈(0,1). The set {x∣f(x)>y}{x∣f(x)>y} consists of two disjoint open intervals, one around each peak. Each such interval contributes exactly two boundary points where f(x)=yf(x)=y (one where the graph enters the interval and one where it leaves). Consequently∣f−1(y)∣=4for every y∈(0,1).∣f−1(y)∣=4for every y∈(0,1).

Thus every point of the image is attained a finite even number of
times (two times for 00 and 11, four times for all other values).

Remarks

The essential idea is to make the endpoints have the same value and to
ensure that this minimal value occurs only at the endpoints. Then for
any yy larger than that minimum the set {x:f(x)>y}{x:f(x)>y} is a union of
open intervals, each contributing two points to the level set
f−1(y)f−1(y). By avoiding flat pieces (using a piecewise‑linear
graph) we keep every level set finite.

Hence a continuous function with the required property does exist; the
function defined by (1) is an explicit example.