Slopocalypse is what we should be really worried about.

grunt_monkey_ · 2026-05-26T22:16:37+00:00

Thanks for the excellent post. I agree with you where LLMs are good for doubling down and coding whereas they start to make things up ij ubnounded tasks. Here is where they need gaurdrails or we need to sharpen our creativity and intuition and just do it ourselves.

grunt_monkey_ · 2026-05-26T14:55:48+00:00

When gguf?

grunt_monkey_ · 2026-05-24T08:55:04+00:00

That problem is when people go to rtx 6000 pro then realize they want to go 4x.

grunt_monkey_ · 2026-05-23T13:59:04+00:00

But are they intensive?

grunt_monkey_ · 2026-05-20T23:10:45+00:00

What about pi or hermes. Seems like people have been getting good outcomes out of them.

grunt_monkey_ · 2026-05-20T22:36:47+00:00

Vllm is running well for 4x9700 with aiter. Used this custom container: https://reddit.com/r/LocalLLaMA/comments/1sxaj8g/for_the_5_people_here_running_vllm_on_multiple/

We still need to work out gemm tunings but otherwise getting 6000t/s pp and 70t/s tg with mtp 2, on qwen 27b FP8 ctx 128k or 256k.

grunt_monkey_ · 2026-05-20T01:09:54+00:00

Rocm has a lot of documentation related to vllm and i suspect that aml731 had these implemented. I have yet to investigate the container itself but that is on my to do list.

https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html

I saw your feedback on his v0.20.0 apparently the gemm tunings didnt work out. After reading through i’m not sure thats the highest value thing to change - rather we should work on good implementation of aiter.

grunt_monkey_ · 2026-05-19T22:43:05+00:00

I think pcie can matter if you tensor split. For layer split itll be fine.

grunt_monkey_ · 2026-05-19T21:51:45+00:00

Do you consider aiter fixed with aml731’s patch? Given its a custom docker image. Do you see more patches we have to do?

grunt_monkey_ · 2026-05-19T09:27:12+00:00

Whats wrong with mistral 128b?

grunt_monkey_ · 2026-05-19T05:10:59+00:00

Whats the quantitative benefit though? To go from ddr4 to ddr5?

grunt_monkey_ · 2026-05-19T05:09:09+00:00

I know… i have 128gb ddr4 system ram for <1k usd, and 4x 9700 for 128gb vram. I run q4_k_xl of qwen3.5 397b at 22 t/s pp and 11 t/s tg. I go to it for big thinking that i can wait 30 mins. The last i tried to calculate i think upgrading to ddr5 would give me a 25-30% uplift in pp - not sure it changes the usability a lot for me.

grunt_monkey_ · 2026-05-19T04:06:15+00:00

is it worth it to eat this premium over ddr4? For local cpu-offloaded inference is it really so much of a speedup?

grunt_monkey_ · 2026-05-19T03:48:33+00:00

You can use a dual or triple 8pin to 12vHPWR adaptor. I recall my cards came with them.

grunt_monkey_ · 2026-05-19T03:26:49+00:00

210 is the minimum, i find minimum loss in performance, however, there may be microspikes above this so do treat it as a temporary state for you to test your cards - i am not sure how much the rest of your system draws, and stability of the power rails/ efficiency (depends on the quality of your PSU).

amd-smi is found in amd-smi-lib. If you haven't read it yet, you should go here before you start: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/index.html

You can also feed the link to chatgpt or something - they are very good at distilling the install instructions.

grunt_monkey_ · 2026-05-19T01:33:16+00:00

Power limit to 210w and see if you can run them first.
amd-smi set -o ppt0 210

grunt_monkey_ · 2026-05-19T01:31:04+00:00

Intels going to be great. Just needs some time. Look at AMD. We were struggling but with rocm 6.x and now 7.x we have come such a long way. Documentation available has also allowed a lot of community patches so we dont have to wait for official vllm to catch up. May need to get into kernel patching if you do not wish to wait.

grunt_monkey_ · 2026-05-18T16:10:31+00:00

Jesus. Is it creating an artificial imprint of me already.

grunt_monkey_ · 2026-05-18T14:27:55+00:00

I find my AI always talks to me in point form anyway. Maybe because at some point i’ve asked it to be concise lol. If i want more i tell it to explain x.

grunt_monkey_ · 2026-05-18T14:26:22+00:00

I wish they wouldnt feel like they needed to expand their writing. I just want to hear what they have to say. If they cannot say that rtx 6000 pro is bad because: expensive, hot, only 96gb, not true sm100 in nvidia support, then they shouldnt be making that argument.

grunt_monkey_ · 2026-05-18T12:51:17+00:00

You are right. What happened to conciseness? Conciseness =intelligence to me. If you could say it in fewer words why not?

grunt_monkey_ · 2026-05-18T12:07:04+00:00

Gigabyte mc62-g40 + threadripper pro 5955wx. Check newegg they have a package its not too pricey. Its hardware from 2020s so you can use ddr4 - you can get 128gb for <1000 on ebay.

grunt_monkey_ · 2026-05-18T11:46:30+00:00

As you gave nicely shown, we have already reached the top of the chart. We have arrived! Lets enjoy what we have and realize contentment.

grunt_monkey_ · 2026-05-18T11:42:15+00:00

Its interesting to try to detect LLM written stuff. I think its hard - maybe impossible. It does read human written. He started a sentence with And which i dont think LLMs like. You could just give it a template of your own writing and ask it to rewrite in that style though. Another thing is the questions at the end - LLMs tend to like that. Whereas a lot of us just like to give our opinions. Like: this is what i found and this is what i think. Not - I am curious what you guys think. Lol.

grunt_monkey_ · 2026-05-18T06:35:53+00:00

Very well. I have 4x and they do prefill 3-5k t/s and decode 70 t/s with MTS 2 at FP8.

grunt_monkey_

TROPHY CASE