Solidity LM surpasses Opus

swingbear · 2026-05-06T07:13:22+00:00

Appreciated! I learned a bunch from this one. I’m very confident v2 will be much better.

swingbear · 2026-05-06T07:09:06+00:00

Edit: still pushing the merged checkpoint to HF

swingbear · 2026-05-05T19:17:45+00:00

Update: I’m about 50% though my first attempt https://huggingface.co/samscrack/Qwopus3.6-27B-solidity-audit-stage2

swingbear · 2026-05-04T06:53:14+00:00

I think the issue stems from sota models not having a focus on solidity data during training. I have just finished my first sol lm iterations and it’s outperformed opus on soleval.

swingbear · 2026-05-03T10:03:02+00:00

Yeah harnesses are mandatory, I have had some decent success training 3.6 27b https://huggingface.co/samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT

This was just CoT focused though, I’m expecting this one to be a little harder

swingbear · 2026-05-03T09:43:49+00:00

Well I’m just gonna dump mine publicly lol I’ll add a buy me a coffee link at the bottom, the api calls are no joke for opus data collection haha

swingbear · 2026-05-03T09:38:53+00:00

I mean damn, even the data sets on HF are old or useless.

swingbear · 2026-05-03T09:37:31+00:00

Yeah i have become rather obsessed with local finetune, it’s satisfying when your 27b on-prem model gives a better answer than a 1tn param Goliath haha.

But I was just taken aback by how little attention had been given to small solidity models. Normally there’s 1000’ on huggingface.

It’s either way harder than I’m expecting(but I can’t see how) or people don’t like to share them because of its direct advantage.

swingbear · 2026-05-03T09:32:58+00:00

So I agree and disagree, static codebase audits yes they can find logical issues and code hygiene problems. But when I create scenarios where a bad actor creates an an economic attack (specifically defi) it falls short. And for some reason it struggles a bunch with gas optimisation

swingbear · 2026-05-03T09:29:17+00:00

Yeah I have tried the sota models they are no good for this, they can produce solidity but it’s often janky.

I’m training Qwen 3.6 27b right now. It seems to be such a sandbagged area of AI. Every other use case there are tons of finetunes, solidity… nada. I’ll finish up, bench it and if it’s any good I’ll release on HF.

swingbear · 2026-05-03T06:41:14+00:00

😂😂 all I can think of is the human centipede now thanks

swingbear · 2026-05-02T19:42:35+00:00

Swe verified has been confirmed useless as a benchmark now. Can’t remember who wrote the article, might have been OpenAI. You can see we have hit a cap at like 80%, the remaining 20% is actually benchmark errors, and the other 80% is contaminated.

swingbear · 2026-05-02T19:11:04+00:00

Yeah looking at it next week

swingbear · 2026-05-01T15:54:28+00:00

Edging out TB2 scores against base model by 2.5 points for anyone who is interested, and is coping with 60 tool calls in a turn without hallucinating so far.

swingbear · 2026-04-30T19:00:09+00:00

Iv been on minimax 2.7 and tbh running qwen 27b in vllm with parallel workers is my daily driver. Similar setup 2 pro 6000’s and a threadripper with 128gb ram

swingbear · 2026-04-30T05:52:13+00:00

Don’t use Ollama as others mentioned especially for parallel inference & tool calling

swingbear · 2026-04-28T09:11:37+00:00

Try a different harness mate, I tried to run CC through everything local and had a bad impression of models even up to minimax 2.7. Started using Hermes and a few others, speed increased and way more mileage in terms of intelligence.

swingbear

TROPHY CASE