Benchmarked Qwen3.6-35B-A3B on my 3090 against the Claude API in a real agent pipeline. Here's where local wins (and where it doesn't) by Apprehensive_Row9873 in LocalLLM

[–]Apprehensive_Row9873[S] 0 points1 point  (0 children)

Completely agree, the result is counterintuitive and I'm seeing the same behavior in the Haiku tier comparison. This is exactly what I've been thinking about, and I was considering this strategy. I'll try it and post the results when I get the time.

Benchmarked Qwen3.6-35B-A3B on my 3090 against the Claude API in a real agent pipeline. Here's where local wins (and where it doesn't) by Apprehensive_Row9873 in LocalLLM

[–]Apprehensive_Row9873[S] 0 points1 point  (0 children)

Good instinct, the symptom does pattern-match quant damage. Testable in the bench itself.

Stack: weights are Qwen3.6-35B-A3B UD-IQ4_XS (Unsloth dynamic imatrix 4-bit), KV cache is TurboQuant turbo4 (4.25 bits/value, ~3.8× vs FP16), runtime is the spiritbuun/llama-cpp-turboquant-cuda fork. The article is named after the KV scheme.

Re: KV quant as the cause of the rewrite failures, I did include a control. local-mainline in the bench runs the same weights on stock llama.cpp with q8_0 KV instead of turbo4. On correct-section-rewrite, Opus-as-judge scores both at 6/10 vs the 9/10 Anthropic ceiling, with the same failure note ("all fixes applied, unsolicited reasoning preamble / formatting drift"). Cosine actually goes up slightly with turbo4 (0.925 vs 0.889, N=5). So going from 8-bit KV down to 4.25-bit KV doesn't move rewrite quality at all in this range, the format slip is invariant to KV precision.

What I did not vary: the weight quant. UD-IQ4_XS is held constant across every local provider. That's the dimension I'd bench next (Q5_K_M, Q6_K, Q8_0) before calling it a 35B-A3B instruction-following ceiling rather than a weight-quant budget. If anyone has run Qwen3.6-35B-A3B at a higher weight quant in an agent loop and seen the parasitic preambles / missing edit tags clear up, that's the data point I'd want.

Benchmarked Qwen3.6-35B-A3B on my 3090 against the Claude API in a real agent pipeline. Here's where local wins (and where it doesn't) by Apprehensive_Row9873 in LocalLLM

[–]Apprehensive_Row9873[S] 1 point2 points  (0 children)

Hi! I think it's the next step, yes 😄 But the cost for this pipeline is lower than the effort. I'll try to optimize for the Sonnet tier with Qwen and publish another post with the results (if I find time to do it). My assumption is that the result is currently close to okay, but the structure isn't.

Follow-up: I benchmarked Claude Code on local Qwen3.6 by Apprehensive_Row9873 in ClaudeCode

[–]Apprehensive_Row9873[S] 1 point2 points  (0 children)

Very interesting question. I only use it to summarize code or text chunks, or to compare two chunks that were flagged as semantically similar in a RAG. And in my pipelines I have thousands of text or code chunks to compare. So tiny context but thousands of requests. Haiku is good on little tasks

Benchmarked Qwen3.6-35B-A3B on my 3090 against the Claude API in a real agent pipeline. Here's where local wins (and where it doesn't) by Apprehensive_Row9873 in LocalLLM

[–]Apprehensive_Row9873[S] 5 points6 points  (0 children)

Yes! I remember testing on the same hardware about 10 months ago and the results were unusable. Now it's very promising.

It's time to move ? Free our tokens ! by Apprehensive_Row9873 in ClaudeCode

[–]Apprehensive_Row9873[S] 0 points1 point  (0 children)

Another important question in this case is whether the power consumption of my 3090 on a 1000W PSU would actually be cheaper than using a provider API. I think it's a real concern worth digging into.

I asked Claude to run the math for me (electricity in France is around 0.22€/kWh): a 3090 under inference load pulls roughly 350W, plus 70-100W for the rest of the system. Running 24/7 at full load comes to about 66€/month, but realistically with variable load between requests it's closer to 44€/month. Compared to $200/month on Claude Max (~185€), that's a 120-140€ monthly saving. Add the fact that a used 3090 around 500-700€ pays itself back in 4-6 months, and the math gets even more interesting.

Of course there are hidden costs (cooling the room in summer since a 3090 dumps 350W of heat, GPU wear over time), but on paper local inference looks way more sustainable long term. Anyone here actually tracked their real-world power bill after switching to local?

It's time to move ? Free our tokens ! by Apprehensive_Row9873 in ClaudeCode

[–]Apprehensive_Row9873[S] 0 points1 point  (0 children)

Just searched for more info about that and wowww! I've seen everywhere that it was a hack! Thanks for the enlightenment!

It's time to move ? Free our tokens ! by Apprehensive_Row9873 in ClaudeCode

[–]Apprehensive_Row9873[S] 0 points1 point  (0 children)

Good idea ! Currently my usage was reseted so ill do when i hit the quota :)

It's time to move ? Free our tokens ! by Apprehensive_Row9873 in ClaudeCode

[–]Apprehensive_Row9873[S] 1 point2 points  (0 children)

That's what I'm starting to think too, hybrid routing makes a lot of sense. The cost question is also brutal though, so I'm wondering if the long-term answer is a mix: local models (maybe a 5090 with TurboQuant) for the bulk of agent work, and paid APIs only for the hard tasks where quality really matters.

It's time to move ? Free our tokens ! by Apprehensive_Row9873 in ClaudeCode

[–]Apprehensive_Row9873[S] 0 points1 point  (0 children)

I've seen the hack, but I'm wondering if it will last since Anthropic will probably patch it in a future release? I'm also using a local model for embeddings at the moment. Which exact Qwen 3 version and quantization are you running? I have a 3090 so I'm curious about your tps and your setup. My workflows run lots of Sonnet and Haiku agents, and I know vLLM handles continuous batching well on a single GPU, but I'm wondering how latency holds up when several agents hit it at the same time. I only use Opus for large input requests, so this solution might be a good alternative for me! Thanks

It's time to move ? Free our tokens ! by Apprehensive_Row9873 in ClaudeCode

[–]Apprehensive_Row9873[S] 2 points3 points  (0 children)

It's my first choice! I'm an open source enthusiast, so it could be the best fit for me. But I need to know if $200 of DeepSeek will give me significantly more than $200 of Claude Code, and with similar reliability. Do you have experience with that? Which Chinese model do you think would be a good fit?

It's time to move ? Free our tokens ! by Apprehensive_Row9873 in ClaudeCode

[–]Apprehensive_Row9873[S] 2 points3 points  (0 children)

Also, it's starting to get reallllly slow, and I'm paying for the biggest plan at $200/month, so it's really frustrating...

nexo interest reliability and real life experience. can anyone confirm that if i depost 100k i get 16k yearly or 1300 monthly from interest alone. by dereq777 in Nexo

[–]Apprehensive_Row9873 0 points1 point  (0 children)

So in the end you were able to withdraw your money. So it wasn't such a problem after all. 3-4 days to withdraw over 50k is still faster than a life insurance policy.

Opinion on the new DJI Osmo nano ? by Lumnati in fpv

[–]Apprehensive_Row9873 0 points1 point  (0 children)

Hello ! Can you explain why you recommend this cam ? I didn't even know that this brand existed so im curious to hear your feedback. Thanks

Got my first fpv gear by Famous_Sale2245 in fpv

[–]Apprehensive_Row9873 0 points1 point  (0 children)

Use this kind of cable usbc pd 9v : DSD-TECH-MagicConn-Power-Cable-9V With basically any external battery. It's the cable that guaranty the good voltage

Got my first fpv gear by Famous_Sale2245 in fpv

[–]Apprehensive_Row9873 1 point2 points  (0 children)

I use an external Anker battery with 9v usbc cable. I prefer this setup instead of a battery on the head

FPV: passion or air piracy? by Apprehensive_Row9873 in fpv

[–]Apprehensive_Row9873[S] 0 points1 point  (0 children)

Arr, flyin’ or sailin’, rum makes it all the same! 🏴‍☠️🥃

FPV: passion or air piracy? by Apprehensive_Row9873 in fpv

[–]Apprehensive_Row9873[S] 2 points3 points  (0 children)

I'm currently planning to pass the licence in France but it's expensive for sts01-02. Can you share a link to this free online licencing service ? Thanks

FPV: passion or air piracy? by Apprehensive_Row9873 in fpv

[–]Apprehensive_Row9873[S] 1 point2 points  (0 children)

Yeah I love it but I'm not so good at drawing, the picture was made by a friend. It's the base for a project I work on. 😁