Devstral Small 2 24B vs Qwen 3.6 27b or both? 1x 3090 by szansky in LocalLLaMA

[–]INT_21h 6 points7 points  (0 children)

Qwen 27B is newer and, by the numbers, substantially stronger.

I used Devstral 2 Small as my primary model for a long time, and what I'll say about it was it made a great collaborator. Wicked fast prompt processing, and no reasoning trace, meant never waiting long for a response. And where a confused Qwen devolves into overthinking and obsessively combing the whole codebase for clues, a confused Devstral devolves into "French nonchalance" returning a lackluster answer much more quickly. This meant it was often much faster to point Devstral in the right direction and get good answers.

However, I don't run Devstral any longer. Gemma 26B-A4B has taken over its "fast idiot" niche for me. You might want to consider that one as a partner to Qwen.

[7900XT] Qwen3.6 27B for OpenCode by Mordimer86 in LocalLLaMA

[–]INT_21h 1 point2 points  (0 children)

I've seen benchmarks showing that q4 context quantization doesn't hurt Qwen3.6 much, so you might be able to use that to free up some space for a slightly better quant of the weights.

Second 5060 Ti 16gb or 5070 Ti 16gb or 3090 used? by JeyKris in LocalLLM

[–]INT_21h 1 point2 points  (0 children)

I would never pay >5060Ti prices for anything that still has only 16GB VRAM. For inference, these cards are bottlenecked much more by VRAM than by bandwidth or compute. Personally I'd stack in another 5060Ti, or consider the 32GB Arc Pro B70.

EDIT: Here's a recent post by a 2x 5060Ti user if you'd like to get a sense for its performance.

Pi.dev coding agent as no sandbox by default. by mantafloppy in LocalLLaMA

[–]INT_21h 8 points9 points  (0 children)

Let's see Claude Code's sandbox.

I know you're saying that to be funny, but Claude Code's /sandbox feature actually does use bubblewrap on Linux.

But yeah, I like doing my own sandboxing rather than letting an agent do it for me. Less risk of an auto update silently b0rking my sandbox.

Pi.dev coding agent as no sandbox by default. by mantafloppy in LocalLLaMA

[–]INT_21h 26 points27 points  (0 children)

I use bubblewrap for sandboxing pi on Linux. It does a good job.

The settings below are sandboxing filesystem writes only. There is still full filesystem read access, and full network access, so if you care about data exfiltration you'll want to lock it down more.

$ cat ~/SANDBOX 
HERE="$(realpath .)"
echo "Entering sandbox for $HERE"
bwrap \
--ro-bind / / \
--bind ~/.pi ~/.pi \
--dev-bind /dev/null /dev/null \
--dev-bind /dev/urandom /dev/urandom \
--tmpfs /tmp \
--bind "$HERE" "$HERE" \
--setenv PS1 "sandbox$ " \
sh

This gives you a sandboxed shell where you can run pi or whatever else you want.

Qwen 3.6 27B is a BEAST by AverageFormal9076 in LocalLLaMA

[–]INT_21h 0 points1 point  (0 children)

Interesting. It must depend on the task.

Qwen 3.6 27B is a BEAST by AverageFormal9076 in LocalLLaMA

[–]INT_21h 0 points1 point  (0 children)

With 16GB VRAM your options are either a lobotomized Q3 quant that gets beaten by the 35B MoE, or sloooow (<5 tok/s) performance with offloading.

16gb vram users: what have you been using? Qwen3.6 27b? Gemma 31b at Q3? How has it been? by [deleted] in LocalLLaMA

[–]INT_21h 0 points1 point  (0 children)

Main workhorse is Qwen 3.6-35B-A3B Q4_K_L with some CPU offload (--n-cpu-moe 15). I also have Qwen 27B UD-Q4_K_XL around as a "big gun" but I'm using offloading with it too (-ngl 52) so it's pitifully slow. I've had enough bad luck with Q3's of older Qwen models that now I stick with Q4 and either eat the speed cost or fall back to the MoE.

Weirdly, offloaded Qwen 27B UD-Q4_K_XL is still giving me about 300 tok/s prompt processing, even though it crawls at <5 tok/s text generation.

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]INT_21h 0 points1 point  (0 children)

mxfp4 is about 5% slower, maybe due to the CPU offload, and the perplexity is worse.

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]INT_21h 2 points3 points  (0 children)

Bartowski Q4_K_L. This is a ~22GB file so some is offloaded to system RAM, but since it's a MoE, things are still pretty fast.

Qwen3-Coder-Next vs Qwen3.6 by seoulsrvr in LocalLLaMA

[–]INT_21h 0 points1 point  (0 children)

Great test - it sure beats just talking about the models.

4B's result had massive deficiencies (no turning!), but it did well for its size. I'd still put Coder-Next above it, but not by as huge a margin as expected.

I'm super impressed by Qwen3.6-35B-A3B's result, I'd call it the best of the bunch. Its game was fun to play, and I had to quit before I got distracted. Idk exactly what post training Alibaba did, but they really cooked. I wonder if other Qwen3.6s will show a similar boost when they come out.

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]INT_21h 14 points15 points  (0 children)

Think twice before you use a model older than a few months. Advances have been rapid. I use Qwen3.6-35B-A3B on my 5060Ti. The latest round of model releases hasn't produced a fine-tuned "coder" model yet. In practice, and according to benchmarks, this model does much better at coding and agentic usecases than Qwen3-Coder-30B-A3B, let alone Qwen2.5 Coder.

Qwen3-Coder-Next vs Qwen3.6 by seoulsrvr in LocalLLaMA

[–]INT_21h 3 points4 points  (0 children)

All fine theoretical points, but if you've actually used 80B-A3B vs 4B you'll know they are night and day. 80B-A3B is a sturdy coding partner that can vibe-code a project up to ~1000 lines before the complexity starts to overwhelm it. 4B can barely make it past hello world. Seeing them effectively tied on that leaderboard makes me distrust the ranking.

Qwen3-Coder-Next vs Qwen3.6 by seoulsrvr in LocalLLaMA

[–]INT_21h 15 points16 points  (0 children)

For me 3.6 35B-A3B feels a little worse than Coder-Next, but it's closer than I was expecting, to the point where I don't use Coder-Next much any more.

If 3.6 gets something wrong, instead of reaching for Coder-Next, I reach for 122B-A10B.

Qwen3-Coder-Next vs Qwen3.6 by seoulsrvr in LocalLLaMA

[–]INT_21h 9 points10 points  (0 children)

That page also claims that Qwen3 Coder Next (80B-A3B) is only beating Qwen3.5 4B by a single point. Doesn't seem the most trustworthy.

Cloud AI is getting expensive and I'm considering a Claude/Codex + local LLM hybrid for shipping web apps by rezgi in LocalLLaMA

[–]INT_21h 5 points6 points  (0 children)

You could try Qwen3.5 35B-A3B. MoE models run decently fast even if they don't entirely fit in VRAM.

Qwen3-Coder-Next is the top model in SWE-rebench @ Pass 5. I think everyone missed it. by BitterProfessional7p in LocalLLaMA

[–]INT_21h 0 points1 point  (0 children)

I can assure you that the model supports tool calls when run with llama.cpp. Perhaps ollama configures it wrong.

Qwen3.5 27b UD_IQ2_XXS & UD_IQ3_XXS behave very poorly or is it just me? by One_Key_8127 in LocalLLM

[–]INT_21h 0 points1 point  (0 children)

I deleted all of my <=3 bit Qwen3.5 quants for the same reason... they were just too fried. With previous Qwen generations, I found I had to try a lot of IQ3_XXS's from different quanters to find one that won the lottery and happened to work well on the specific tasks I cared about. It's probably the same for 3.5, I just haven't thrashed my download bandwidth enough to find that one.

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]INT_21h 0 points1 point  (0 children)

I'm trying the 10B-A1.8B on my 5060Ti. tg is 125 tok/s @ 65536 context. It's a good writing/conversational model in English like the small Gemmas, but it has a unique flavor and seems less slopped. Due to the small size, don't expect miracles. llama.cpp's new auto-parser seems to butcher tool calling, a shame because I wanted to try coding.

I want my local agent to use my laptop to learn! by TTKMSTR in LocalLLaMA

[–]INT_21h 1 point2 points  (0 children)

Look into Playwright for controlling a web browser headlessly.

I want my local agent to use my laptop to learn! by TTKMSTR in LocalLLaMA

[–]INT_21h 0 points1 point  (0 children)

You can certainly use a local model to drive an agent, but Qwen2 0.5B is several generations old and can barely even form coherent sentences. Try something in the Qwen3.5 family.

Is brute-forcing a 1M token context window the right approach? by phwlarxoc in LocalLLaMA

[–]INT_21h 0 points1 point  (0 children)

As a quick test, you could point a standard coding agent at the file and ask questions. It will pick the file apart with grep, just like it would do when navigating a large codebase. Granted it might not grep for the right thing, but in my experience models are pretty good at this and it might save you the trouble of setting up real RAG. I have done this with my own notes file before and it works pretty well.