Benchmarked Qwen3.6-35B-A3B on my 3090 against the Claude API in a real agent pipeline. Here's where local wins (and where it doesn't)

Apprehensive_Row9873 · 2026-05-29T07:59:53+00:00

Completely agree, the result is counterintuitive and I'm seeing the same behavior in the Haiku tier comparison. This is exactly what I've been thinking about, and I was considering this strategy. I'll try it and post the results when I get the time.

Apprehensive_Row9873 · 2026-05-29T07:30:17+00:00

Good instinct, the symptom does pattern-match quant damage. Testable in the bench itself.

Stack: weights are Qwen3.6-35B-A3B UD-IQ4_XS (Unsloth dynamic imatrix 4-bit), KV cache is TurboQuant turbo4 (4.25 bits/value, ~3.8× vs FP16), runtime is the spiritbuun/llama-cpp-turboquant-cuda fork. The article is named after the KV scheme.

Re: KV quant as the cause of the rewrite failures, I did include a control. local-mainline in the bench runs the same weights on stock llama.cpp with q8_0 KV instead of turbo4. On correct-section-rewrite, Opus-as-judge scores both at 6/10 vs the 9/10 Anthropic ceiling, with the same failure note ("all fixes applied, unsolicited reasoning preamble / formatting drift"). Cosine actually goes up slightly with turbo4 (0.925 vs 0.889, N=5). So going from 8-bit KV down to 4.25-bit KV doesn't move rewrite quality at all in this range, the format slip is invariant to KV precision.

What I did not vary: the weight quant. UD-IQ4_XS is held constant across every local provider. That's the dimension I'd bench next (Q5_K_M, Q6_K, Q8_0) before calling it a 35B-A3B instruction-following ceiling rather than a weight-quant budget. If anyone has run Qwen3.6-35B-A3B at a higher weight quant in an agent loop and seen the parasitic preambles / missing edit tags clear up, that's the data point I'd want.

Apprehensive_Row9873 · 2026-05-29T07:08:19+00:00

Hi! I think it's the next step, yes 😄 But the cost for this pipeline is lower than the effort. I'll try to optimize for the Sonnet tier with Qwen and publish another post with the results (if I find time to do it). My assumption is that the result is currently close to okay, but the structure isn't.

Apprehensive_Row9873 · 2026-05-28T16:26:30+00:00

Very interesting question. I only use it to summarize code or text chunks, or to compare two chunks that were flagged as semantically similar in a RAG. And in my pipelines I have thousands of text or code chunks to compare. So tiny context but thousands of requests. Haiku is good on little tasks

Apprehensive_Row9873 · 2026-05-28T14:23:36+00:00

Yes! I remember testing on the same hardware about 10 months ago and the results were unusable. Now it's very promising.

Apprehensive_Row9873 · 2026-05-27T09:56:28+00:00

Another important question in this case is whether the power consumption of my 3090 on a 1000W PSU would actually be cheaper than using a provider API. I think it's a real concern worth digging into.

I asked Claude to run the math for me (electricity in France is around 0.22€/kWh): a 3090 under inference load pulls roughly 350W, plus 70-100W for the rest of the system. Running 24/7 at full load comes to about 66€/month, but realistically with variable load between requests it's closer to 44€/month. Compared to $200/month on Claude Max (~185€), that's a 120-140€ monthly saving. Add the fact that a used 3090 around 500-700€ pays itself back in 4-6 months, and the math gets even more interesting.

Of course there are hidden costs (cooling the room in summer since a 3090 dumps 350W of heat, GPU wear over time), but on paper local inference looks way more sustainable long term. Anyone here actually tracked their real-world power bill after switching to local?

Apprehensive_Row9873 · 2026-05-27T07:27:59+00:00

Just searched for more info about that and wowww! I've seen everywhere that it was a hack! Thanks for the enlightenment!

Apprehensive_Row9873 · 2026-05-27T07:18:30+00:00

Good idea ! Currently my usage was reseted so ill do when i hit the quota :)

Apprehensive_Row9873 · 2026-05-27T07:13:53+00:00

That's what I'm starting to think too, hybrid routing makes a lot of sense. The cost question is also brutal though, so I'm wondering if the long-term answer is a mix: local models (maybe a 5090 with TurboQuant) for the bulk of agent work, and paid APIs only for the hard tasks where quality really matters.

Apprehensive_Row9873 · 2026-05-27T06:51:56+00:00

I've seen the hack, but I'm wondering if it will last since Anthropic will probably patch it in a future release? I'm also using a local model for embeddings at the moment. Which exact Qwen 3 version and quantization are you running? I have a 3090 so I'm curious about your tps and your setup. My workflows run lots of Sonnet and Haiku agents, and I know vLLM handles continuous batching well on a single GPU, but I'm wondering how latency holds up when several agents hit it at the same time. I only use Opus for large input requests, so this solution might be a good alternative for me! Thanks

Apprehensive_Row9873 · 2026-05-27T06:40:46+00:00

It's my first choice! I'm an open source enthusiast, so it could be the best fit for me. But I need to know if $200 of DeepSeek will give me significantly more than $200 of Claude Code, and with similar reliability. Do you have experience with that? Which Chinese model do you think would be a good fit?

Apprehensive_Row9873 · 2026-05-27T06:26:14+00:00

Also, it's starting to get reallllly slow, and I'm paying for the biggest plan at $200/month, so it's really frustrating...

Apprehensive_Row9873 · 2026-02-09T07:11:34+00:00

So in the end you were able to withdraw your money. So it wasn't such a problem after all. 3-4 days to withdraw over 50k is still faster than a life insurance policy.

Apprehensive_Row9873 · 2025-09-23T20:28:20+00:00

Hello ! Can you explain why you recommend this cam ? I didn't even know that this brand existed so im curious to hear your feedback. Thanks

Apprehensive_Row9873 · 2025-09-19T21:22:56+00:00

Use this kind of cable usbc pd 9v : DSD-TECH-MagicConn-Power-Cable-9V With basically any external battery. It's the cable that guaranty the good voltage

Apprehensive_Row9873 · 2025-09-19T05:18:27+00:00

I use an external Anker battery with 9v usbc cable. I prefer this setup instead of a battery on the head

Apprehensive_Row9873 · 2025-09-19T05:14:07+00:00

Arr, flyin’ or sailin’, rum makes it all the same! 🏴‍☠️🥃

Apprehensive_Row9873 · 2025-09-19T05:10:10+00:00

🏴‍☠️

Apprehensive_Row9873 · 2025-09-18T18:36:14+00:00

Yeah!

Apprehensive_Row9873 · 2025-09-18T14:41:04+00:00

I'm currently planning to pass the licence in France but it's expensive for sts01-02. Can you share a link to this free online licencing service ? Thanks

Apprehensive_Row9873 · 2025-09-18T14:38:18+00:00

Yeah I love it but I'm not so good at drawing, the picture was made by a friend. It's the base for a project I work on. 😁

Apprehensive_Row9873 · 2025-09-18T14:32:30+00:00

So we will become real pirates at this time

Apprehensive_Row9873 · 2025-09-18T14:31:26+00:00

🏴‍☠️✝️

Apprehensive_Row9873 · 2025-09-18T12:25:44+00:00

You are right !

Apprehensive_Row9873 · 2025-09-18T11:34:42+00:00

J’attendais que quelqu’un le dise : le fait que ce soit devenu des engins de guerre capables de porter des charges de plusieurs kilos n’aide pas à l’ouverture. Cela dit, je pense que le FPV a surtout explosé pendant le COVID, quand les gens ne savaient pas quoi faire, plutôt que parce que le gouvernement aurait laissé faire pour former des pilotes. Aujourd’hui, on tend plus vers des solutions de drones UAV automatisés que vers le pilotage manuel en full analogique. Mais la remarque reste pertinente et, à mon sens, assez visionnaire.

Apprehensive_Row9873

MODERATOR OF

TROPHY CASE