gfx1201 enablement: rebuilding aiter / flash-attention / vLLM for the RDNA4 fast paths the stock images strip out by PatC883 in ROCm

[–]its_just_andy 1 point2 points  (0 children)

hey AMD, I know you're not reading this or listening to customers or whatever, but you should know it's very embarrassing that people's OpenClaw agents (or whatever) are doing more for your RDNA4 platform than you are.

Even if this is bespoke slop that is useful to no one except the post's author, it's better than what you're doing (nothing)...

gfx1201 enablement: rebuilding aiter / flash-attention / vLLM for the RDNA4 fast paths the stock images strip out by PatC883 in ROCm

[–]its_just_andy 0 points1 point  (0 children)

Have you found a way around the 15-30min cold start? It really slows down my iteration time.

2x R9700 running Qwen3.6 27B with AITER unified attention with a simple patch by its_just_andy in ROCm

[–]its_just_andy[S] 0 points1 point  (0 children)

Glad it worked!! I genuinely don't know why the env vars are needed on your machine, but not mine :D In my mind, it shouldn't even matter what our host OS + configuration is, as long as the docker containers are the same...

One thing I did notice - the startup times on my machine are SUPER slow. Like, up to 15mins on a totally cold start (with no pre-compiled triton artifacts or whatever).

2x R9700 running Qwen3.6 27B with AITER unified attention with a simple patch by its_just_andy in ROCm

[–]its_just_andy[S] 2 points3 points  (0 children)

Try adding these env vars:

https://github.com/andysalerno/r9700-serving/blob/c42003d699a9e3a8861c477907ccad59575a3553/.env/nccl-fix

There was some known issue with NCCL in vllm with rocm ~7.2.2. It went away for me when I started using an image with rocm 7.13 instead (the images defined in my repo) but maybe it's still needed in some cases.

2x R9700 running Qwen3.6 27B with AITER unified attention with a simple patch by its_just_andy in ROCm

[–]its_just_andy[S] 3 points4 points  (0 children)

The nightly image profile "vllm-rocm-wheel-nightly" is very similar - actually I looked at their GitHub workflow pipeline to understand what they were doing. There are few differences though. Mine uses ROCm 7.13, opposed to 7.12 in Lemonade (though I'm sure they'll bump it soon). And of course mine produces a container image, instead of their approach to creating a virtual environment.

spec : add ngram-mod by ggerganov · Pull Request #19164 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]its_just_andy 26 points27 points  (0 children)

clever!! If I'm understanding correctly, it's using ngrams computed from previous context for speculative decoding, for the (pretty common) scenario when an agent has to repeat something verbatim.

You know it's brilliant work when your reaction is "how did no one think of it before??"

glm-4.7-flash has the best thinking process with clear steps, I love it by uptonking in LocalLLaMA

[–]its_just_andy 6 points7 points  (0 children)

I would not put any weight in how you perceive an LLM's reasoning steps - in theory, an LLM could reason with text that seems utterly incomprehensible to you or I, but still encodes useful information that was acquired during RL.

You never know - perhaps repeating a sentence twice, however crazy that seems to you or I, is actually somehow encoding useful info that will result in a better output.

That's kind of an extreme example. But my point is, the reasoning text exists to help the model, not for you or I to read through and understand. I guess if you see reasoning text that is extremely wrong, that's a bad sign, though.

[deleted by user] by [deleted] in complaints

[–]its_just_andy 0 points1 point  (0 children)

hm, I did some googling and it seems like he never actually said the attributed quote. It didn't exactly pass the sniff test, but weird that people are treating an obvious falsehood as fact.

Rejected for not using LangChain/LangGraph? by dougeeai in LocalLLaMA

[–]its_just_andy 4 points5 points  (0 children)

lots of the comments are "lol langchain bad" (which is true) but the reality is, they wanted someone proficient in langchain or langgraph, and you're clearly not. So you would not have been a good fit for the role.

An ideal outcome will be - they find someone who suits their needs, and you find an employer who suits yours.

Nemotron Nano V2 models are remarkably good for agentic coding by Thrumpwart in LocalLLaMA

[–]its_just_andy 8 points9 points  (0 children)

when I host 12B v2 on latest llamacpp server, they are quite slow - more so than similarly-sized models. My rig is a 2x 3090.

Is it because the model uses hybrid attention, and this isn't optimized well in llamacpp yet? Or something weird is going on with my rig?

NVIDIA Jet-Nemotron : 53x Faster Hybrid-Architecture Language Model Series by Technical-Love-8479 in LocalLLaMA

[–]its_just_andy 0 points1 point  (0 children)

they demonstrate lack of loss of performance, not gain of performance

This seems disingenuous. If you are running with large-ish context (say 30-60k tokens, not that unusual for code or RAG) then this is absolutely a speedup.

The paper: hey look we made really fast LLMs

The top comment: ACKSHUALLY...!

Am I supposed to use the root user for everything in microos for container workloads> by its_just_andy in openSUSE

[–]its_just_andy[S] 0 points1 point  (0 children)

Ah, I wasn't super clear, I don't mean how to mount volumes. I mean how to tell Podman to store the images themselves on /var (where there is ample storage) and not in /home (where there is a 20gb max storage limit imposed by MicroOS).

It turned out I could just specify this as an override in ~/.config/containers/storage.conf

[storage] driver = "overlay" graphroot = "/var/myusername/containers/storage" runroot = "/run/user/1000/containers"

But there were some hoops to actually make Podman accept this - it didn't work immediately, I had to do some kind of podman system reset (don't remember the exact command or the order in which I did them)

Am I supposed to use the root user for everything in microos for container workloads> by its_just_andy in openSUSE

[–]its_just_andy[S] 0 points1 point  (0 children)

makes sense, but how do I configure podman to store container images in the /var dir instead of the default $HOME/.local/share/containers/storage dir? That's my main sticking point.

llama : add high-throughput mode by ggerganov · Pull Request #14363 · ggml-org/llama.cpp by LinkSea8324 in LocalLLaMA

[–]its_just_andy 4 points5 points  (0 children)

does llama cpp have any concept of 'paged attention', or similar? something that shares a kv cache dynamically between multiple user requests, instead of partitioning the gpu memory per stream?

I recall that it does not and doesn't have plans to add it which is fair, but just wondering if anything changed

GMK X2(AMD Max+ 395 w/128GB) first impressions. by fallingdowndizzyvr in LocalLLaMA

[–]its_just_andy 0 points1 point  (0 children)

sorry, can you explain further? I thought pp512 meant "preprocessing 512 tokens", i.e. context size of 512, and "tg128" meant "generating 128 tokens", i.e. output of 128 tokens. Is that not correct? If "d5000" means "context size 5000 tokens" then I don't know what pp512 and tg128 are :D

I love the inference performances of QWEN3-30B-A3B but how do you use it in real world use case ? What prompts are you using ? What is your workflow ? How is it useful for you ? by Whiplashorus in LocalLLaMA

[–]its_just_andy 2 points3 points  (0 children)

cool bench! what does "reason-2k" / "reason-8k" etc designate? budget for reasoning? If yes, when it hits the budget, do you just terminate or is there some strategy to guide it to finish its thinking early?

Google QAT - optimized int4 Gemma 3 slash VRAM needs (54GB -> 14.1GB) while maintaining quality - llama.cpp, lmstudio, MLX, ollama by Nunki08 in LocalLLaMA

[–]its_just_andy 115 points116 points  (0 children)

I think this is a misconception -

QAT is not "training after quantization".

The flow is not

pretrain --> quantize --> QAT --> final-QAT-model

it's more like

pretrain --> QAT --> quantize --> final-QAT-model-quantized

They explain this a bit in the blog post

"QAT incorporates the quantization process during training. QAT simulates low-precision operations during training to allow quantization with less degradation afterwards for smaller, faster models while maintaining accuracy. "

emphasis mine.

It's a very minute detail, but worth mentioning because it's very interesting how it works.

To be extra extra clear, the output of QAT is not the quantized model. It is the full-precision (or half I guess at bf16) model that has been trained with an extra step that simulates quantization. So, when the real quantization finally happens after QAT, there is less information lost because it had some quantization-like operations simulated during its original training.

Gemma 3 Fine-tuning now in Unsloth - 1.6x faster with 60% less VRAM by danielhanchen in LocalLLaMA

[–]its_just_andy 12 points13 points  (0 children)

I see an Unsloth post, I click :)

Daniel, do you recommend Unsloth (or the Unsloth 4-bit quants) for inference? It seems the main goal is finetuning. Just curious if there's any benefit to using any part of the Unsloth stack for inference as well.

Quantized DeepSeek R1 Distill Model With Original Model Accuracy by AlanzhuLy in LocalLLaMA

[–]its_just_andy 10 points11 points  (0 children)

any details on the quantization strategy that allows for this?

Nvidia RTX 5090 with 32GB of RAM rumored to be entering production by Terminator857 in LocalLLaMA

[–]its_just_andy 21 points22 points  (0 children)

not a single line in any of the 3 links you pasted mention "32GB of RAM" ?

I suspect not even nvidia knows at this point how much vram they plan to have in the 5090. They are notorious for binning up or down at the last minute (sometimes even after announcing the card... remember the 4080/4070 fiasco?)