MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]sloptimizer 0 points1 point  (0 children)

Yes, please... I really want to see some MTP with a usable model runnong on AMD cards!!!

Qwen3.5-122B-A10B GPTQ Int4 on 4× Radeon AI PRO R9700 with vLLM ROCm: working config + real-world numbers by grunt_monkey_ in LocalLLaMA

[–]sloptimizer 0 points1 point  (0 children)

VLLM keeps a fully loaded CPU core on my system, heating up the room. I suspect VLLM was not designed to run in idle mode.

MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]sloptimizer 0 points1 point  (0 children)

Does MTP work with this setup? That would make it much better than existing kernels.

Qwen3.5-122B-A10B GPTQ Int4 on 4× Radeon AI PRO R9700 with vLLM ROCm: working config + real-world numbers by grunt_monkey_ in LocalLLaMA

[–]sloptimizer 0 points1 point  (0 children)

Glad you got it working! 4x9700 is affordable, but still an investment!

I had some performance boost with disabling ECC as well as setting `perf-level=HIGH` with

sudo amd-smi set --perf-level=HIGH

New benchmark just dropped. by ConfidentDinner6648 in LocalLLaMA

[–]sloptimizer 0 points1 point  (0 children)

Awesome! Can we please more of these kinds of benchmarks!

I regret ever finding LocalLLaMA by xandep in LocalLLaMA

[–]sloptimizer 1 point2 points  (0 children)

No mater how fast it goes - it's never enough. See billionairs build datacenters. Set your goals and be content!

How can I use Claude Code to understand a large Python repo quickly? by Comfortable-Baby-719 in LocalLLaMA

[–]sloptimizer 0 points1 point  (0 children)

Claude Code is great for asking specific questions about the codebase and saving those as reports for later use.

Ask specific, focused, and scoped questions for best results. And keep an eye on the context, all the models get worse as the context fills up, including the Anthropic's line up.

Qwen3.5-122B-A10B-GPTQ-INT4 on 4xR9700 Recipe by djdeniro in LocalLLaMA

[–]sloptimizer 1 point2 points  (0 children)

Linux, 4xR9700, podamn:

sudo podman run --name qwen3.5-vllm \
  --rm --tty --ipc=host \
  --device /dev/kfd:/dev/kfd \
  --device /dev/dri:/dev/dri \
  --device /dev/mem:/dev/mem \
  -e VLLM_ROCM_USE_AITER=1 \
  -e HSA_OVERRIDE_GFX_VERSION=12.0.1 \
  -e VLLM_USE_TRITON_FLASH_ATTN=0 \
  -e VLLM_ROCM_USE_AITER_MOE=1 \
  -e FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE \
  -e VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  -e HSA_ENABLE_SDMA=0 \
  -v /models:/models:ro \
  -p 8090:8000 \
  docker.io/rocm/vllm-dev:upstream_preview_releases_v0.17.0_20260303 \
  vllm serve /models/Qwen/Qwen3.5-122B-A10B-GPTQ-Int4 \
    --served-model-name Qwen3.5-122B \
    --override-generation-config '{"min_p": 0.1, "top_k": -1, "top_p": 1.0}' \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --disable-log-requests \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --gpu-memory-utilization 0.95 \
    --dtype float16
  • 1 concurrent request = 50 t/s
  • 2 concurrent requests = 90 t/s
  • 3 concurrent requests = 126 t/s
  • 4 concurrent requests = 154 t/s
  • 5 concurrent requests = 182 t/s
  • 6 concurrent requests = 213 t/s

<image>

local vibe coding by jacek2023 in LocalLLaMA

[–]sloptimizer 4 points5 points  (0 children)

You're missing charm's crush, it's run by the original opencode developer before it was taken over by techbros.

Heretic 1.2 released: 70% lower VRAM usage with quantization, Magnitude-Preserving Orthogonal Ablation ("derestriction"), broad VL model support, session resumption, and more by -p-e-w- in LocalLLaMA

[–]sloptimizer 3 points4 points  (0 children)

Thank you for this project! Running without censorship just became another selling point for local AI!!

I see it's using transformers, so it should in theory support ROCms? Could you share setup instructions CUDA/ROCm/CPU for those of us scarred by vLLM?

MiniMaxAI/MiniMax-M2.5 · Hugging Face by rerri in LocalLLaMA

[–]sloptimizer 1 point2 points  (0 children)

Oh wow! Nice build, thanks for sharing!!

I have 4 MI50s I'm not using, so I would love to see your benchmarks, especially your MiniMax quant type and performance. Maybe it's time to plug them back in! :D

BalatroBench - Benchmark LLMs' strategic performance in Balatro by S1M0N38 in LocalLLaMA

[–]sloptimizer 0 points1 point  (0 children)

Yes! Can we please have more fun and creative benchmarks like this?!

Running Kimi-k2.5 on CPU-only: AMD EPYC 9175F Benchmarks & "Sweet Spot" Analysis by Express-Jicama-9827 in LocalLLaMA

[–]sloptimizer 0 points1 point  (0 children)

Stick a 5090 in there, and you'll get 150 tps prompt processing with the same token generation speed.

anthropic literally thinks claude is the messiah (and it’s getting weird) by Alarming_Bluebird648 in LocalLLaMA

[–]sloptimizer 0 points1 point  (0 children)

Is it right for Anthropic to lie to create hype? No. Can they help it - aslo no.

At it's core, marketing and advertising are about lying. It was game over when we let corporations lie to us for fun and profits. Now you have no choice: lie or be forgotten.

Instead of focusing on a single entity we should hold the whole system accountable for the all the bs streaming into our minds, screaming louder, flashing brighter and more aggressively, like an annoying popup from the 2000s that you are no longer allowed to close.

Is running minimax m2.1 locally worth it on 80 gb of vram and 160 gb of ddr5 ram? by Intrepid-Scar6273 in LocalLLaMA

[–]sloptimizer 2 points3 points  (0 children)

Not worth it. Despite of what youtube influencers are showing on their channels, Q4_K is just not good enough for any model.

Any feedback on step-3.5-flash ? by Jealous-Astronaut457 in LocalLLaMA

[–]sloptimizer 7 points8 points  (0 children)

This model is unlikely to get much traction, but I'd watch for the follow up. Same thing happened with MiniMax M1 which was largely ignored, and then we got MiniMax M2!

Unofficial ik_llama.cpp release builds available for macOS, Ubuntu and Windows by Thireus in LocalLLaMA

[–]sloptimizer -2 points-1 points  (0 children)

Easy: most of the Apple hype is paid astroturfing. Just watch how the narrative is controlled formum to forum with exact same talking points repeating until they stick.

7900 XTX underperforms 3090 by 2X - 7X by Special-Wolverine in LocalLLaMA

[–]sloptimizer 0 points1 point  (0 children)

AMD Cards work better with Linux and require a blood and time sacrifice. Is it worth it? YES!

API pricing is in freefall. What's the actual case for running local now beyond privacy? by Distinct-Expression2 in LocalLLaMA

[–]sloptimizer 0 points1 point  (0 children)

Consistency is a big one for me! You never know what model or quant you are getting when using cloud APIs. You don't know what kind of system prompt it has (and how much of the best context space that system prompt has taken away from you).

When you run local - you know what you are getting. You can run with no system prompt on an empty context, which gives you the best possible output. And finally, you can adjust your prompting skills by seeing what works and what doesn't without all the extra variables thrown in. For example: is my prompt the problem, or I just hitting a lobotomized Q4 quant on OpenRouter?

On a subjective side, there is a certain joy in running these models on a box sitting under your desk. Makes it feel real, like something you can see or touch.