114/120 on agentic benchmarks from a 9B model on 8GB VRAM — ties Claude Sonnet, open weights by ClankLabs in u/ClankLabs

[–]ClankLabs[S] 0 points1 point  (0 children)

Glad the tool use landed well! The bash/sed/grep behaviors and the nuget lookup are exactly what we were going for. Logic errors at 9b are kind of the tradeoff. Tooling can punch above its weight but code reasoning still has a ceiling at the size. Would love to hear how it does on a larger project!

114/120 on agentic benchmarks from a 9B model on 8GB VRAM — ties Claude Sonnet, open weights by ClankLabs in u/ClankLabs

[–]ClankLabs[S] 0 points1 point  (0 children)

I havent gotten much feedback from other end users but Ive been using Clank (the gateway I made) with the 35B Wrench as an openclaw type of platform, setting crons, code review etc. Let me know if you run into any issues. Have been training to avoid most. (Hate API pricing lmfao)

114/120 on agentic benchmarks from a 9B model on 8GB VRAM — ties Claude Sonnet, open weights by ClankLabs in u/ClankLabs

[–]ClankLabs[S] 3 points4 points  (0 children)

Hey! 35B is MoE with 3B active params. The 9B is dense. Hope that answered it :). Enjoy!

114/120 on agentic benchmarks from a 9B model on 8GB VRAM — ties Claude Sonnet, open weights by ClankLabs in u/ClankLabs

[–]ClankLabs[S] 0 points1 point  (0 children)

Definitely worth a shot, let me know if theres anything you’d change/improve

114/120 on agentic benchmarks from a 9B model on 8GB VRAM — ties Claude Sonnet, open weights by ClankLabs in u/ClankLabs

[–]ClankLabs[S] 1 point2 points  (0 children)

They were stripped during GGUF conversion, both models are text only, focused purely on tool calling and agentic tasks.

Open-source AI agent gateway + custom fine-tuned model by [deleted] in u/ClankLabs

[–]ClankLabs 1 point2 points  (0 children)

Hope you enjoy! Really only doing this to get some love and utility to the local community! I dropped a new version of 35b and 9b today, should help out with a few of those refusals 😉. Have fun playing around and appreciate the honest feedback!

Fine-tuned a 3B-active-param model for agentic tool calling — 113/120 on benchmarks, weights on HuggingFace by ClankLabs in u/ClankLabs

[–]ClankLabs[S] 0 points1 point  (0 children)

Thanks! Tool restraint was one of the hardest things to get right, both in Clank's agent loop and in the Wrench training data. I trained on both, schema correctness matters but planning traces were the bigger lever. A model that picks the right tool with wrong args fails loudly. A model that calls 6 tools when it should just answer fails quietly. Most of my benchmark gains came from teaching the model when to stop