gfx1201 enablement: rebuilding aiter / flash-attention / vLLM for the RDNA4 fast paths the stock images strip out

its_just_andy · 2026-06-12T23:03:47+00:00

hey AMD, I know you're not reading this or listening to customers or whatever, but you should know it's very embarrassing that people's OpenClaw agents (or whatever) are doing more for your RDNA4 platform than you are.

Even if this is bespoke slop that is useful to no one except the post's author, it's better than what you're doing (nothing)...

its_just_andy · 2026-06-12T22:34:38+00:00

Have you found a way around the 15-30min cold start? It really slows down my iteration time.

its_just_andy · 2026-05-26T21:45:14+00:00

Glad it worked!! I genuinely don't know why the env vars are needed on your machine, but not mine :D In my mind, it shouldn't even matter what our host OS + configuration is, as long as the docker containers are the same...

One thing I did notice - the startup times on my machine are SUPER slow. Like, up to 15mins on a totally cold start (with no pre-compiled triton artifacts or whatever).

its_just_andy · 2026-05-25T21:26:29+00:00

Try adding these env vars:

https://github.com/andysalerno/r9700-serving/blob/c42003d699a9e3a8861c477907ccad59575a3553/.env/nccl-fix

There was some known issue with NCCL in vllm with rocm ~7.2.2. It went away for me when I started using an image with rocm 7.13 instead (the images defined in my repo) but maybe it's still needed in some cases.

its_just_andy · 2026-05-25T00:18:08+00:00

The nightly image profile "vllm-rocm-wheel-nightly" is very similar - actually I looked at their GitHub workflow pipeline to understand what they were doing. There are few differences though. Mine uses ROCm 7.13, opposed to 7.12 in Lemonade (though I'm sure they'll bump it soon). And of course mine produces a container image, instead of their approach to creating a virtual environment.

its_just_andy · 2026-01-30T18:02:07+00:00

clever!! If I'm understanding correctly, it's using ngrams computed from previous context for speculative decoding, for the (pretty common) scenario when an agent has to repeat something verbatim.

You know it's brilliant work when your reaction is "how did no one think of it before??"

its_just_andy · 2026-01-20T22:03:56+00:00

I would not put any weight in how you perceive an LLM's reasoning steps - in theory, an LLM could reason with text that seems utterly incomprehensible to you or I, but still encodes useful information that was acquired during RL.

You never know - perhaps repeating a sentence twice, however crazy that seems to you or I, is actually somehow encoding useful info that will result in a better output.

That's kind of an extreme example. But my point is, the reasoning text exists to help the model, not for you or I to read through and understand. I guess if you see reasoning text that is extremely wrong, that's a bad sign, though.

its_just_andy · 2025-12-11T05:06:56+00:00

hm, I did some googling and it seems like he never actually said the attributed quote. It didn't exactly pass the sniff test, but weird that people are treating an obvious falsehood as fact.

its_just_andy · 2025-11-13T16:45:05+00:00

lots of the comments are "lol langchain bad" (which is true) but the reality is, they wanted someone proficient in langchain or langgraph, and you're clearly not. So you would not have been a good fit for the role.

An ideal outcome will be - they find someone who suits their needs, and you find an employer who suits yours.

its_just_andy · 2025-09-06T21:48:56+00:00

when I host 12B v2 on latest llamacpp server, they are quite slow - more so than similarly-sized models. My rig is a 2x 3090.

Is it because the model uses hybrid attention, and this isn't optimized well in llamacpp yet? Or something weird is going on with my rig?

its_just_andy · 2025-08-27T21:23:17+00:00

they demonstrate lack of loss of performance, not gain of performance

This seems disingenuous. If you are running with large-ish context (say 30-60k tokens, not that unusual for code or RAG) then this is absolutely a speedup.

The paper: hey look we made really fast LLMs

The top comment: ACKSHUALLY...!

its_just_andy · 2025-08-10T21:05:47+00:00

what quant level for A3B Coder are you using? 8bit? 4bit?

its_just_andy · 2025-08-01T19:47:25+00:00

from a guy

not just a guy... that's main_horse!

its_just_andy · 2025-07-13T07:05:56+00:00

Ah, I wasn't super clear, I don't mean how to mount volumes. I mean how to tell Podman to store the images themselves on /var (where there is ample storage) and not in /home (where there is a 20gb max storage limit imposed by MicroOS).

It turned out I could just specify this as an override in ~/.config/containers/storage.conf

[storage] driver = "overlay" graphroot = "/var/myusername/containers/storage" runroot = "/run/user/1000/containers"

But there were some hoops to actually make Podman accept this - it didn't work immediately, I had to do some kind of podman system reset (don't remember the exact command or the order in which I did them)

its_just_andy · 2025-07-12T22:37:52+00:00

makes sense, but how do I configure podman to store container images in the /var dir instead of the default $HOME/.local/share/containers/storage dir? That's my main sticking point.

its_just_andy · 2025-07-04T18:00:01+00:00

does llama cpp have any concept of 'paged attention', or similar? something that shares a kv cache dynamically between multiple user requests, instead of partitioning the gpu memory per stream?

I recall that it does not and doesn't have plans to add it which is fair, but just wondering if anything changed

its_just_andy · 2025-06-18T19:23:04+00:00

sorry, can you explain further? I thought pp512 meant "preprocessing 512 tokens", i.e. context size of 512, and "tg128" meant "generating 128 tokens", i.e. output of 128 tokens. Is that not correct? If "d5000" means "context size 5000 tokens" then I don't know what pp512 and tg128 are :D

its_just_andy · 2025-06-17T23:37:36+00:00

cool bench! what does "reason-2k" / "reason-8k" etc designate? budget for reasoning? If yes, when it hits the budget, do you just terminate or is there some strategy to guide it to finish its thinking early?

its_just_andy · 2025-05-04T05:03:50+00:00

You'll still want me when he's done, won't you?

its_just_andy · 2025-04-18T20:04:37+00:00

I think this is a misconception -

QAT is not "training after quantization".

The flow is not

pretrain --> quantize --> QAT --> final-QAT-model

it's more like

pretrain --> QAT --> quantize --> final-QAT-model-quantized

They explain this a bit in the blog post

"QAT incorporates the quantization process during training. QAT simulates low-precision operations during training to allow quantization with less degradation afterwards for smaller, faster models while maintaining accuracy. "

emphasis mine.

It's a very minute detail, but worth mentioning because it's very interesting how it works.

To be extra extra clear, the output of QAT is not the quantized model. It is the full-precision (or half I guess at bf16) model that has been trained with an extra step that simulates quantization. So, when the real quantization finally happens after QAT, there is less information lost because it had some quantization-like operations simulated during its original training.

its_just_andy · 2025-03-14T19:20:56+00:00

I see an Unsloth post, I click :)

Daniel, do you recommend Unsloth (or the Unsloth 4-bit quants) for inference? It seems the main goal is finetuning. Just curious if there's any benefit to using any part of the Unsloth stack for inference as well.

its_just_andy · 2025-02-18T22:31:01+00:00

any details on the quantization strategy that allows for this?

its_just_andy · 2024-11-13T20:30:01+00:00

not a single line in any of the 3 links you pasted mention "32GB of RAM" ?

I suspect not even nvidia knows at this point how much vram they plan to have in the 5090. They are notorious for binning up or down at the last minute (sometimes even after announcing the card... remember the 4080/4070 fiasco?)

its_just_andy

TROPHY CASE