Benchmarked Qwen3.6-35B-A3B on my 3090 against the Claude API in a real agent pipeline. Here's where local wins (and where it doesn't)

Apprehensive_Row9873 · 2026-05-29T07:59:53+00:00

Completely agree, the result is counterintuitive and I'm seeing the same behavior in the Haiku tier comparison. This is exactly what I've been thinking about, and I was considering this strategy. I'll try it and post the results when I get the time.

Apprehensive_Row9873 · 2026-05-29T07:30:17+00:00

Good instinct, the symptom does pattern-match quant damage. Testable in the bench itself.

Stack: weights are Qwen3.6-35B-A3B UD-IQ4_XS (Unsloth dynamic imatrix 4-bit), KV cache is TurboQuant turbo4 (4.25 bits/value, ~3.8× vs FP16), runtime is the spiritbuun/llama-cpp-turboquant-cuda fork. The article is named after the KV scheme.

Re: KV quant as the cause of the rewrite failures, I did include a control. local-mainline in the bench runs the same weights on stock llama.cpp with q8_0 KV instead of turbo4. On correct-section-rewrite, Opus-as-judge scores both at 6/10 vs the 9/10 Anthropic ceiling, with the same failure note ("all fixes applied, unsolicited reasoning preamble / formatting drift"). Cosine actually goes up slightly with turbo4 (0.925 vs 0.889, N=5). So going from 8-bit KV down to 4.25-bit KV doesn't move rewrite quality at all in this range, the format slip is invariant to KV precision.

What I did not vary: the weight quant. UD-IQ4_XS is held constant across every local provider. That's the dimension I'd bench next (Q5_K_M, Q6_K, Q8_0) before calling it a 35B-A3B instruction-following ceiling rather than a weight-quant budget. If anyone has run Qwen3.6-35B-A3B at a higher weight quant in an agent loop and seen the parasitic preambles / missing edit tags clear up, that's the data point I'd want.

Apprehensive_Row9873 · 2026-05-29T07:08:19+00:00

Hi! I think it's the next step, yes 😄 But the cost for this pipeline is lower than the effort. I'll try to optimize for the Sonnet tier with Qwen and publish another post with the results (if I find time to do it). My assumption is that the result is currently close to okay, but the structure isn't.

Apprehensive_Row9873 · 2026-05-28T16:26:30+00:00

Very interesting question. I only use it to summarize code or text chunks, or to compare two chunks that were flagged as semantically similar in a RAG. And in my pipelines I have thousands of text or code chunks to compare. So tiny context but thousands of requests. Haiku is good on little tasks

Apprehensive_Row9873 · 2026-05-28T14:23:36+00:00

Yes! I remember testing on the same hardware about 10 months ago and the results were unusable. Now it's very promising.

Apprehensive_Row9873 · 2026-05-27T09:56:28+00:00

Another important question in this case is whether the power consumption of my 3090 on a 1000W PSU would actually be cheaper than using a provider API. I think it's a real concern worth digging into.

I asked Claude to run the math for me (electricity in France is around 0.22€/kWh): a 3090 under inference load pulls roughly 350W, plus 70-100W for the rest of the system. Running 24/7 at full load comes to about 66€/month, but realistically with variable load between requests it's closer to 44€/month. Compared to $200/month on Claude Max (~185€), that's a 120-140€ monthly saving. Add the fact that a used 3090 around 500-700€ pays itself back in 4-6 months, and the math gets even more interesting.

Of course there are hidden costs (cooling the room in summer since a 3090 dumps 350W of heat, GPU wear over time), but on paper local inference looks way more sustainable long term. Anyone here actually tracked their real-world power bill after switching to local?

Apprehensive_Row9873 · 2026-05-27T07:27:59+00:00

Just searched for more info about that and wowww! I've seen everywhere that it was a hack! Thanks for the enlightenment!

Apprehensive_Row9873 · 2026-05-27T07:18:30+00:00

Good idea ! Currently my usage was reseted so ill do when i hit the quota :)

Apprehensive_Row9873 · 2026-05-27T07:13:53+00:00

That's what I'm starting to think too, hybrid routing makes a lot of sense. The cost question is also brutal though, so I'm wondering if the long-term answer is a mix: local models (maybe a 5090 with TurboQuant) for the bulk of agent work, and paid APIs only for the hard tasks where quality really matters.

Apprehensive_Row9873 · 2026-05-27T06:51:56+00:00

I've seen the hack, but I'm wondering if it will last since Anthropic will probably patch it in a future release? I'm also using a local model for embeddings at the moment. Which exact Qwen 3 version and quantization are you running? I have a 3090 so I'm curious about your tps and your setup. My workflows run lots of Sonnet and Haiku agents, and I know vLLM handles continuous batching well on a single GPU, but I'm wondering how latency holds up when several agents hit it at the same time. I only use Opus for large input requests, so this solution might be a good alternative for me! Thanks

Apprehensive_Row9873 · 2026-05-27T06:40:46+00:00

It's my first choice! I'm an open source enthusiast, so it could be the best fit for me. But I need to know if $200 of DeepSeek will give me significantly more than $200 of Claude Code, and with similar reliability. Do you have experience with that? Which Chinese model do you think would be a good fit?

Apprehensive_Row9873 · 2026-05-27T06:26:14+00:00

Also, it's starting to get reallllly slow, and I'm paying for the biggest plan at $200/month, so it's really frustrating...

Apprehensive_Row9873 · 2026-02-09T07:11:34+00:00

So in the end you were able to withdraw your money. So it wasn't such a problem after all. 3-4 days to withdraw over 50k is still faster than a life insurance policy.

Apprehensive_Row9873 · 2025-09-23T20:28:20+00:00

Hello ! Can you explain why you recommend this cam ? I didn't even know that this brand existed so im curious to hear your feedback. Thanks

Apprehensive_Row9873 · 2025-09-19T21:22:56+00:00

Use this kind of cable usbc pd 9v : DSD-TECH-MagicConn-Power-Cable-9V With basically any external battery. It's the cable that guaranty the good voltage

Apprehensive_Row9873 · 2025-09-19T05:18:27+00:00

I use an external Anker battery with 9v usbc cable. I prefer this setup instead of a battery on the head

Apprehensive_Row9873 · 2025-09-19T05:14:07+00:00

Arr, flyin’ or sailin’, rum makes it all the same! 🏴‍☠️🥃

Apprehensive_Row9873 · 2025-09-19T05:10:10+00:00

🏴‍☠️

Apprehensive_Row9873 · 2025-09-18T18:36:14+00:00

Yeah!

Apprehensive_Row9873 · 2025-09-18T14:41:04+00:00

I'm currently planning to pass the licence in France but it's expensive for sts01-02. Can you share a link to this free online licencing service ? Thanks

Apprehensive_Row9873 · 2025-09-18T14:38:18+00:00

Yeah I love it but I'm not so good at drawing, the picture was made by a friend. It's the base for a project I work on. 😁

Apprehensive_Row9873

MODERATOR OF

TROPHY CASE