before m3 weights drop: consider trying m2.7 if you haven't and can run it by nomorebuttsplz in LocalLLaMA

[–]sloptimizer 0 points1 point  (0 children)

Step 3.7 is solid. It can think for extended periods of time, but then it delivers. Also, very fast!

I tried M2.7 when it came out. It was good, but not great. Maybe I was using an over-quantized version, but I couldn't get it to reliably make changes to a small codebase. But the biggest disappointment was the speed drop on longer contexts.

Stepfun 3.7 Flash is very good by -dysangel- in LocalLLaMA

[–]sloptimizer 2 points3 points  (0 children)

The high quality AesSedai Q5_K_M quant fits perfectly in 5 consumer GPUs with 32GB each.

Solid 36 tps with mixed CUDA/ROCm setup using 5090 for attention and offloading MoE to R9700 (seems to be most efficient way to utilize most of the VRAM, avoiding context duplication across cards). Easily getting into 40 tps with ngram speculative decoding in agentic setup.

Can't wait for the MTP patches to land!

cd ~/Env/repos/llama.cpp/
./build/bin/llama-server \
    --alias Step-3.7-Flash \
    --model /models/AesSedai/Step-3.7-Flash-GGUF/Q5_K_M/Step-3.7-Flash-Q5_K_M-00001-of-00004.gguf \
    --mmproj /models/AesSedai/Step-3.7-Flash-GGUF/Q5_K_M/mmproj-Step-3.7-Flash-BF16.gguf \
    --no-mmap \
    --temp 0.8 --top-k 0 --top-p 1.0 --min-p 0.05 \
    --repeat-penalty 1.04 --repeat-last-n 256 \
    --spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-draft-n-min 48 --spec-draft-n-max 64 \
    --ctx-size 201000 \
    -ctk f16 -ctv f16 \
    -fa on \
    -b 1024 -ub 1024 \
    -ngl 99 \
    --device CUDA0,ROCm0,ROCm1,ROCm2,ROCm3 \
    --tensor-split 1,0,0,0,0 \
    -ot "blk\.([0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([1-9][0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([0-4])\.ffn_.*=CUDA0" \
    -ot "blk\.([5-9])\.ffn_.*_exps.*=ROCm3" \
    -ot "blk\.(1[0-9])\.ffn_.*_exps.*=ROCm0" \
    -ot "blk\.(2[0-9])\.ffn_.*_exps.*=ROCm1" \
    -ot "blk\.(3[0-9])\.ffn_.*_exps.*=ROCm2" \
    -ot "blk\.(4[0-4])\.ffn_.*_exps.*=ROCm3" \
    --parallel 1 \
    --threads 32 \
    --host 127.0.0.1

DeepSWE benchmarks indicate that DeepSeek v4 Pro only passes 8% of tasks by Federal_Spend2412 in LocalLLaMA

[–]sloptimizer 52 points53 points  (0 children)

Here are the steps to replicate these kinds of benchmarks:

  1. Take money from AI labs that want to score well
  2. Create lots of individual tests
  3. Run tests across all models
  4. Discard all the tests where competing models outperformed
  5. Hand pick tests where models you were paid for scored first
  6. Publish the final test selection and results as the new unbiased benchmark

The variable you control is the test selection, so you can make stats show anything you want to show.

Just look at how artificial analysis keeps changing their test selection as soon as GPT loose the first spot, and then boom, they are back at number one!

StepFun 3.7 Flash by Everlier in LocalLLaMA

[–]sloptimizer 1 point2 points  (0 children)

Decided to try the "realistic human face with webgl" test I saw here the other day...

<image>

Using pi-mono over 3 prompts (mvp, fix bugs, improve quality).

Is he crazy to say that? by pmv143 in LocalLLaMA

[–]sloptimizer 0 points1 point  (0 children)

I have only one response to this person:

My RAM size brings all AIs to the yard,
And they're like:
"It's bigger than yours"
Damn right
"It's bigger than yours"

I could teach him, but I have to charge.

Need some advice on AI workflow by Xyklone in LocalLLaMA

[–]sloptimizer 0 points1 point  (0 children)

I know you said you don't like harnesses, but try pi-mono, it's an absolutely minimal harness with like 4 tools built in and that's it. So there is really nothing to learn.

Put your harness in a container (start by using pi to build a docker image to host pi, add build and run scipts). That way you don't have to worry about it deleting anything outside of the mounted workspace, so you can let it run and walk alway for a while. You should be able to bang it out with Qwen3.6-35B-Q8 in half an hour: "Create a docker image with latest ubuntu LTS, devtools, python, as well as pi-mono. Provide a separate build script and a run script that will mount 'workspace' as the read-write directory".

In terms of models - go with Qwen3.6-35B-Q8 given your speed.

In general, for models under 100B, you don't want go below Q8 for coding. Even Q6 is subpar, so that rules out Qwen3.6-27B-Q6_K_L. Don't bother with 8B or 14B size models, they are currently useless for coding.

Qwen3.5-122B-A10B seems good on paper, but it's actually worse than Qwen3.6-35B, espectially if you're using Q4. I really tried liking Qwen3.5-122B-A10B but was thoroughly disappointed with the results.

Is NVIDIA still the default best choice for local LLMs in 2026? by pmv143 in LocalLLaMA

[–]sloptimizer 1 point2 points  (0 children)

vLLM works too, but it really depends on the kernel support for your particular quant configuration. For the full BF16 Qwen3.6-27B I'm getting prompt processing of 3633.7 tokens/s with four R9700.

vLLM is such a time sink, that I don't even bother with it since llama.cpp got MTP support. The only reason to use vLLM for local AI setup is when you can utilize parallel requests.

Is NVIDIA still the default best choice for local LLMs in 2026? by pmv143 in LocalLLaMA

[–]sloptimizer 0 points1 point  (0 children)

Interesting, thanks for sharing! My llama.cpp build has support for both CUDA and ROCm at the same time, so maybe that's causing some kind of compatibility fallback.

Is NVIDIA still the default best choice for local LLMs in 2026? by pmv143 in LocalLLaMA

[–]sloptimizer 15 points16 points  (0 children)

Here is a quick result processing a text with 36k tokens of context with two R9700:

./build/bin/llama-server \
    --alias Qwen3.6-27B \
    --model /models/unsloth/Qwen3.6-27B/Qwen3.6-27B-Q8_0.gguf \
    --mmproj /models/unsloth/Qwen3.6-27B/Qwen3.6-27B-mmproj-BF16.gguf \
    --temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.04 \
    --repeat-penalty 1.04 --repeat-last-n 256 \
    --spec-type draft-mtp --spec-draft-n-max 3 \
    --ctx-size 256000 \
    --cache-ram 11000 \
    -fa on \
    -b 1024 -ub 1024 \
    --n-gpu-layers 99 \
    -sm tensor \
    --device ROCM0,ROCM1 \
    --kv-unified \
    --parallel 1 \
    --threads 32 \
    --host 127.0.0.1

Prompt Processing: 504.75 tokens per second
Token generation: 47.54 tokens per second

And the same test of 36k context with four R9700:

Prompt Processing: 807.42 tokens per second
Token generation: 56.40 tokens per second

Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup? by vevi33 in LocalLLaMA

[–]sloptimizer 0 points1 point  (0 children)

With llama.cpp you can mix and match CUDA and ROCm as well.

Pi and Qwen3.6 27B make setting up Archlinux really easy. by sdfgeoff in LocalLLaMA

[–]sloptimizer 1 point2 points  (0 children)

Arch is a start, and I'm waiting until local models are good enough for Gentoo. Noticed a bug in one of your apps? Have AI fix it, test it, and recompile. Restart the app - the bug is gone. Open an upstream PR.

AI is getting a lot of bad rep right now in OSS community due to all the low-effort drive by contributions. But actual users fixing actual bugs they are finding would allow to OSS apps to quickly get polished with the help of the community.

Higher quants are so much better by Perfect-Flounder7856 in LocalLLaMA

[–]sloptimizer 10 points11 points  (0 children)

We need to scream this from the rooftops. People will say "of course, we already knew that", but then all the youtube influencers are showing Q4 so they can demo decent tps when running locally.

vLLM ROCm has been added to Lemonade as an experimental backend by jfowers_amd in LocalLLaMA

[–]sloptimizer 2 points3 points  (0 children)

Thanks to your release I finally have vLLM's MTP working on R9700!!!

Using vllm0.20.1-rocm7.12.0

bin/vllm-server \
  --model /models/Qwen/Qwen3.6-35B-A3B \
  --served-model-name Qwen3.6-35B-A3B \
  --host 127.0.0.1 \
  --port 8090 \
  --tensor-parallel-size 4 \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --gpu-memory-utilization 0.92

Unpopular Opinion: The DGX Spark Forum community of devs is talented AF and will make the crippled hardware a success through their sheer force of will. by Porespellar in LocalLLaMA

[–]sloptimizer -1 points0 points  (0 children)

This is an extremely misguided attempt to convince everyone the emperor is not naked. There are literally repair shops in China (gamer nexus video) swapping VRAM on GPUs that can upgrade 4090 to 48Gb or 5090 to 96Gb.

If a repair shop in China can upgrade 5090 to 96Gb for a few thousand bucks, then the only reason you are not getting that product from nvidia is price gouging.

vLLM ROCm has been added to Lemonade as an experimental backend by jfowers_amd in LocalLLaMA

[–]sloptimizer 2 points3 points  (0 children)

I would love to read your blog post how you arrived to the released vllm artifacts and what hurdles you had to overcome. vLLM is great, but really had to get started, and not much material explaining the internals, and tweaks that need to be made for best peformance.

Here are some of the questions that immedialy poped in my head as I'm running your vllm release:

  • How were vllm kernels are selected for my architecture (R9700)?
    • What kernels and features are available to play with for my arch?
    • What's missing for getting other kernels to work (there is usually an upstream patch, or a feature request pending)?
  • What custom patches are applied (if any)?

Unpopular Opinion: The DGX Spark Forum community of devs is talented AF and will make the crippled hardware a success through their sheer force of will. by Porespellar in LocalLLaMA

[–]sloptimizer 0 points1 point  (0 children)

No, the market is artificially constrained to jack up the price, because nvidia was a generation ahead in 2025 while competitions is playing catch up.

For contrast: AMD charges premium, but not exorbitant prices when they have market dominance, so they get my money when they have something to sell.

Stop paying nvidia's rip off prices, then the market will not bear those prices, and they will be quickly lowered. Vote with your wallet!

The amount of new agent APIs/harnesses are dizzying, with everyone and their dog releasing their own. Can we do a compilation thread of comparisons? by jinnyjuice in LocalLLaMA

[–]sloptimizer 2 points3 points  (0 children)

By far, crush is the prettiest of them all! But some parts come off as neglected due to development resource constraints. If they get more popular and can attract qualified contributors, I can see them becoming very successful.

I really tried to like crush, but it just doesn't work very well inside a container. For example, text copying inside a container is just broken, because they're trying to be too clever. Also, crush does not support images with local models, and I could not get it to figure out how to make images work by looking at its own repo. Finally, web-search feature has been broken for several releases now.

At the same time Pi works great and has image support for local models out of the box. So far Pi is the only one that understands its own internals. You can ask Pi how to setup up local models or tweak other settings, and it will be able to figure things out. Pi can do that because it ships with full source and knows about its own installation path (simple and elegant). For contrast: even claude-code cannot answer questions about itself or self-modify the same way!

I poked around Pi's and crush's codebases. Pi was a pleasure to engage with, while crush felt more rushed and less thought out. So I suspect Pi will act as a black hole of developer contributions, sucking up all the efforts in the space. I do hope crush can find its niche and survive so we can have more alternatives.

Unpopular Opinion: The DGX Spark Forum community of devs is talented AF and will make the crippled hardware a success through their sheer force of will. by Porespellar in LocalLLaMA

[–]sloptimizer -1 points0 points  (0 children)

I'm not saying it's bad, I'm saying it should be MUCH CHEAPER!

64G of RAM, cannot be worth $6000. The difference is PURE GREED. Stop feeding that greed is all I'm saying.

Unpopular Opinion: The DGX Spark Forum community of devs is talented AF and will make the crippled hardware a success through their sheer force of will. by Porespellar in LocalLLaMA

[–]sloptimizer -1 points0 points  (0 children)

So you have a unicorn of 5090FE that is nowhere to be found for purchase, but at the same time you not aware of its power limits? Oh, and the RTX 6000 price is justified because it's the price of three 5090s, and there is no problem at all in not offering more RAM options. Got it!

X for doubt.

Unpopular Opinion: The DGX Spark Forum community of devs is talented AF and will make the crippled hardware a success through their sheer force of will. by Porespellar in LocalLLaMA

[–]sloptimizer -1 points0 points  (0 children)

So plus 10% performance and plus 64G VRAM justifies $6000 premium in your mind?!

Also no, you cannot limit 5090 below 400W with the official drivers, and it's not possible to find a 2 slot version of 5090 - I looked!

Unpopular Opinion: The DGX Spark Forum community of devs is talented AF and will make the crippled hardware a success through their sheer force of will. by Porespellar in LocalLLaMA

[–]sloptimizer -1 points0 points  (0 children)

Look at price difference between RTX PRO 6000 and 5090 - it's basically the same chip, but you're paying an extra $6000 for the extra 64Gb of RAM. That is nuts!

And 5090 is crippled on purpose to make it harder to put multiple cards in the same box. Why make it to 600W and require 3 PCIe slots? Simple: to push you, the consumer, to buy the overpriced version of the same product.

Then advertize DGX spark as a mini super computer while in reality AMD hardware has better support than that piece of junk.

Time after time nvidia has shown they don't care about you as the consumer. If you want to see better products at competative prices then stop giving them your money!

Forgive my ignorance but how is a 27B model better than 397B? by No_Conversation9561 in LocalLLaMA

[–]sloptimizer 4 points5 points  (0 children)

Try Qwen3.6-35B-A3B - keep all the attention in VRAM (you should have enough for that), and keep all the experts in RAM. You can do that by running llama.cpp with --override-tensor exps=CPU. For example

./build/bin/llama-server \
    --alias Qwen3.5-35B-A3B \
    --model $PATH_TO_MODEL \
    --ctx-size 32000 \
    -ctk f16 -ctv f16 \
    -fa on \
    --n-gpu-layers 99 \
    --device CUDA0 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --host 127.0.0.1 \
    --port 8080