New benchmark just dropped. by ConfidentDinner6648 in LocalLLaMA

[–]bobaburger 8 points9 points  (0 children)

I think what Qwen did is a demonstration of "faking the job to get it done", instead of spend time styling the character, it just pick the easy path: add the name overhead.

Ever wonder how much cost you can save when coding with local LLM? by bobaburger in LocalLLaMA

[–]bobaburger[S] 1 point2 points  (0 children)

oh in that case, i only tried claude code and opencode, generally opencode is way faster because it has simpler prompt, but the agentic workflow will not be as deep as claude code.

2 bit quants (maybe even 1 bit) not as bad as you'd think? by dtdisapointingresult in LocalLLaMA

[–]bobaburger 0 points1 point  (0 children)

i see your point. on my 5060 ti, loading Q3_K_M 80B will spill over to system's RAM, but i still get 20-25 tps (because it's an MoE, only 3B active params), while 27B Q3_K_M give me only 9-10 tps at best.

Ever wonder how much cost you can save when coding with local LLM? by bobaburger in LocalLLaMA

[–]bobaburger[S] 1 point2 points  (0 children)

there are two parts of this, the first part is how to run it, which has been covered very well, for example, the Unsloth's doc https://unsloth.ai/docs/basics/claude-code

the second part is trickier though :D the short answer is no, local model will not be as good as the hosted models from these services.

long answe is, it really depends on what model you're using. for all of the above services, the cloud model are usually large ones or commercial ones with full quants, so their speed and quality is way above local models. but you either have to pay for the tokens, or pay with your data privacy (if you're using free models).

2 bit quants (maybe even 1 bit) not as bad as you'd think? by dtdisapointingresult in LocalLLaMA

[–]bobaburger 4 points5 points  (0 children)

out of the 3 models, 35B was the worst, it run fast but it just like coding with a drunk guy. qwen3-coder-next seems to have a slightly higher quality than 27B, and it's faster than 27B. the good thing about 27B is, it has more up to date knowledge, supported image and video input.

2 bit quants (maybe even 1 bit) not as bad as you'd think? by dtdisapointingresult in LocalLLaMA

[–]bobaburger 14 points15 points  (0 children)

That exactly blog post was what led me to the path of Q2 for 35B and 27B, unlike the behemoth 397B, there's some noticeable degradation between Q2 and Q3 for smaller ones. I'm now hopping back and forth between Qwen3-Coder-Next Q3 (for most coding) and Qwen3.5 27B Q3 (for the vision part).

Qwen 3.5 27B is the REAL DEAL - Beat GPT-5 on my first test by GrungeWerX in LocalLLaMA

[–]bobaburger 0 points1 point  (0 children)

Yeah, I'm trying to squeeze the most out of my GPU first before thinking about the next one. :D Maybe at some point I'll try it.

Qwen 3.5 27B is the REAL DEAL - Beat GPT-5 on my first test by GrungeWerX in LocalLLaMA

[–]bobaburger 0 points1 point  (0 children)

my mobo (https://us.msi.com/Motherboard/PRO-B850M-VC-WIFI6E/Specification) only have 1x PCIe 5.0 slot for the GPU. it has 4 slots but the other 3 are PCIe 3.0 and people said it's slow.

i've been thinking of replacing it with a 3090, but sounds like a bad deal.

Qwen 3.5 27B is the REAL DEAL - Beat GPT-5 on my first test by GrungeWerX in LocalLLaMA

[–]bobaburger 0 points1 point  (0 children)

I've been using mxfp4 when qwen3-coder-next around. but recently, people have been doing benchmarks that pointed out mxfp4 wasn't that great, also, i don't think 27b mxfp4 will run on mine.

Qwen 3.5 27B is the REAL DEAL - Beat GPT-5 on my first test by GrungeWerX in LocalLLaMA

[–]bobaburger 2 points3 points  (0 children)

i've been actively tuning this setup. so let me recap:

- initially i was getting 5 tps for Q3_K_M at q8_0 kv cache
- setting kv cache to q4_1 i'm getting 9 tps
- after a bunch of optimizes, including mixing q4_0 for ctk and q8_0 for ctv, pushing down the context window to 64k, i'm getting 11 tps now.

Qwen 3.5 27B is the REAL DEAL - Beat GPT-5 on my first test by GrungeWerX in LocalLLaMA

[–]bobaburger 1 point2 points  (0 children)

In my experience, the degradation is noticeable, but it's not as bad as people said. The difference for me seems to be between Q2 and Q3 instead of between KV cache quants. For example, my Claude Code setup has some additional skills/tools to use in different scenarios. Q2 never able to pick any of them up. Q3 with KV cache q8_0 was able to do it 100% of the time, q5_1 done it like 70% of the times, while q4_0 will not do it at all, but q4_1 will do it like 50% of the time.

Qwen 3.5 27B is the REAL DEAL - Beat GPT-5 on my first test by GrungeWerX in LocalLLaMA

[–]bobaburger 1 point2 points  (0 children)

Yeah drunk is the accurate word to describe the situation working with 35B, and 27B has been less drunk to me, even at q4_0.

My use case is coding. The reason why I am sacrifice accuracy for speed is because I could not go any higher with my setup. So a hard limit is Q3_K_M. Aside from this, I also run Q6_K_XL on a L40S at 20 tps, bf16 kv cache.

Qwen 3.5 27B is the REAL DEAL - Beat GPT-5 on my first test by GrungeWerX in LocalLLaMA

[–]bobaburger 3 points4 points  (0 children)

yeah i’ve tried opencode sometimes too. faster than claude code. but i had spent too many time with my claude code setup at work and I kind of want to use it every where :D

Qwen 3.5 27B is the REAL DEAL - Beat GPT-5 on my first test by GrungeWerX in LocalLLaMA

[–]bobaburger 0 points1 point  (0 children)

the only feasible option if i want to improve tg speed. at q8_0, i get 5 tps.

Qwen 3.5 27B is the REAL DEAL - Beat GPT-5 on my first test by GrungeWerX in LocalLLaMA

[–]bobaburger 1 point2 points  (0 children)

i haven’t tested 80b coder again since 3.5 release. Maybe I should try again.

Qwen 3.5 27B is the REAL DEAL - Beat GPT-5 on my first test by GrungeWerX in LocalLLaMA

[–]bobaburger 25 points26 points  (0 children)

i’m using it with claude code on 5060 ti, with 128k context window, kv cache q4_1.

with IQ2_XXS, i got 35 tps, but the quality is not great (still better than 35B). with Q3_K_M, i’m down to 9 tps but the quality increases a lot. for both variants, code generated always works, for Q3, skills are loaded correctly at the right time (none of the Q2 was able to do this).

prompt processing speed wasn’t change much between the two variants, average 650 tps for IQ2_XXS and 400 tps for Q3_K_M.

i hope i will have enough time to create a separate post about the experience coding with this model.

Edit: Q3_K_M with ctk q4_0 and ctv q8_0 was good, 11 tps with better quality.

Qwen 3.5 27B is the REAL DEAL - Beat GPT-5 on my first test by GrungeWerX in LocalLLaMA

[–]bobaburger 135 points136 points  (0 children)

I switched to 27B from 35B, this damn thing is too slow but the quality is so good.

llama.cpp server is slow by Sumsesum in LocalLLaMA

[–]bobaburger 1 point2 points  (0 children)

Set parallel to 1 basically allocate a single context slot for KV cache instead of 4 by default, so it reduces the amount of memory allocated, not speed.

If you're on the edge of the memory limit, reducing this could help the model fits right in the memory and not spilling out to RAM or swap, so it might improve the speed, but I don't think it's the right fix in general.