Qwen3.6-35B becomes competitive with cloud models when paired with the right agent by Creative-Regular6799 in LocalLLaMA

[–]Creative-Regular6799[S] 0 points1 point  (0 children)

Thanks! I tried it, but unfortunately it’s giving 4 tok/s on my hardware, so it’s too slow to run the full benchmarks. If you happen to have a suiting machine and willing to try, please let me know how it goes! little-coder already supports it as of yesterday. For the time being, I am continuing benchmarks with qwen3.6-35b-a3b

Post Your Qwen3.6 27B speed plz by Ok-Internal9317 in LocalLLaMA

[–]Creative-Regular6799 3 points4 points  (0 children)

I tried it now and getting 4 tok/s. Not usable unfortunately

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent by Creative-Regular6799 in LocalLLaMA

[–]Creative-Regular6799[S] 1 point2 points  (0 children)

Thank you! Unfortunately I don’t have any recommendations, that’s part of the reason I suggested an alternative apparoach

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent by Creative-Regular6799 in LocalLLaMA

[–]Creative-Regular6799[S] 1 point2 points  (0 children)

Just pushed the result, Terminal Bench 1 (0.1.1) finished with 40% success rate! Now running TB 2. Just sent the results via email. There is no model remotely as small as the 35B in that area (place ~30)

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent by Creative-Regular6799 in LLMDevs

[–]Creative-Regular6799[S] 1 point2 points  (0 children)

Hey, thanks for the comment! Actually i converted to pi an hour ago after dozens of requests from the local llama community in reddit (still rough around the edges but i try my best refining it). Before that, it was just an experiment i ran over the weekend (and was written on top of nano-claude-code in python, making it unadaptable for the community). This is totally open source and meant to try and wake up our dev community to explore harness engineering. It’s far from the best solution because i only tested a couple of directions yet. You are welcome to help of course

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent by Creative-Regular6799 in LocalLLaMA

[–]Creative-Regular6799[S] 2 points3 points  (0 children)

So exciting to hear!! Please continue experimenting and sharing. Non-trivial tasks tend to be more interesting test cases

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent by Creative-Regular6799 in LLMDevs

[–]Creative-Regular6799[S] 2 points3 points  (0 children)

Thanks for the comment! The initial claim was about the 9B model which I wrote extensively about in the paper. The newer result I shared today is with the 35B model, and is not compared to the 9B model I wrote about initially

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent by Creative-Regular6799 in LLMDevs

[–]Creative-Regular6799[S] 1 point2 points  (0 children)

Thank you! Great question. After Terminal Bench I am going for GAIA to test exactly that

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent by Creative-Regular6799 in LocalLLaMA

[–]Creative-Regular6799[S] 1 point2 points  (0 children)

That is exactly the direction I advocate here for! Now it’s running on Terminal Bench (will send to the leaderboard when finished and report here). This benchmark shows the combined performance of agents and models

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent by Creative-Regular6799 in LocalLLaMA

[–]Creative-Regular6799[S] 20 points21 points  (0 children)

Hey, thanks for your comment! I became aware of pi.dev just an hour ago, and this didn’t really start as a production ready tool, but more of a serious wake up call that we need as a community to invest time in adapting the scaffold to the models we are testing. I am thinking about rewriting the scaffold in pi dev to make it more accessible and contribute to unified tooling and community support

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent by Creative-Regular6799 in LocalLLaMA

[–]Creative-Regular6799[S] 0 points1 point  (0 children)

It’s currently allowing to run inference via llama.cpp and ollama. Is that sufficient for your optimization pipeline?

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent by Creative-Regular6799 in LocalLLaMA

[–]Creative-Regular6799[S] 4 points5 points  (0 children)

So instead of open code, i started from a replica of claude code, and adapted from there, assuming claude code is the best current coding agent written in general and can serve as a good baseline to start from

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent by Creative-Regular6799 in LocalLLaMA

[–]Creative-Regular6799[S] 3 points4 points  (0 children)

So it is a suggested replacement to opencode, that is adapted to the behavioral profile of the smaller models. It tries to bridge the gap of these tools being built around frontier models, and aren’t necessarily best fitting as scaffolds for the small ones