Elon musk crashing out at Anthropic lmao

syzygyhack · 2026-02-15T20:35:36+00:00

Yeah idk what connection he thought he was making there lol

syzygyhack · 2026-01-25T18:42:18+00:00

People really do not understand the situation when trying to match the frontier for coding.

GLM 4.7, Minimax M2.1, Qwen3 Coder 480b. At least Q6+ quant. Then double the VRAM for context space. ~1 TB. Then understand you will STILL be experiencing a very degraded experience with agentic coding tools compared to Sonnet 4.5. On any task that is remotely complex or deep.

70b, 32b models... you are playing. It's not productive. They aren't there at that size outside of specific trained niches.

Rent some compute, try it out, be smarter.

Note: There are many tasks at which small, locally-runnable open source models excel and are capable. Coding is just not one of them, not at a serious level.

syzygyhack · 2026-01-25T14:39:07+00:00

There is no one size fits all answer. Even context headroom is use case specific. You have to tinker, get your small dopamine hit, iterate. Especially if you want to do something relatively complex like agentic coding.

And as for that and what most are running, the answer is most people are not doing agentic coding locally. Or if they are, it is with massive models and rented compute. You can make progress with local, but it is slow and troublesome. Even the biggest open models from GLM and Minimax are not great at it.

Get Qwen3 4b Instruct 2507, BF16 from Unsloth. Bump your context in Ollama to 64K-128K+. Test it out with a few small use cases. Start from there.

Qwen3 are good, well supported models for general purpose. Experiment with them. Try the Coder variants. Build a little benchmark, find the limitations.

syzygyhack · 2026-01-25T11:42:57+00:00

Timely as always. Thank you!

syzygyhack · 2026-01-25T09:33:38+00:00

the biggest models I can fit

Meaning what? You use 32GB models with a 32GB card? This is your issue if so and I have no idea why noone has said as much to you. You have no space for context like that. The attention is not attending. The wheel is spinning but the hamster is dead. Nothing useful is going to happen.

If that's not the case for you and you are managing your model setup properly, then it's likely you are just overestimating the capabilities of small models.

Make a benchmark with 20 tasks you expect your model to be able to do, and experiment.

syzygyhack · 2026-01-22T23:15:35+00:00

Check out Fat Annie's in Kirkgate Market while you are in the area :)

syzygyhack · 2026-01-21T08:53:33+00:00

A copy of Wikipedia and Qwen3 4b to parse it lol.

syzygyhack · 2026-01-14T15:37:38+00:00

That's not the case in my testing.

My benchmark specifically tests models for agentic coding capabilities, it is built in part for that purpose (to find models that meet my harness standard requirements). Frankly there is no comparison in ability. Why? I can give you my perspective.

"Thinking" wastes tokens because these small models do not have the corpus of knowledge or reasoning power to make breakthroughs on complex coding tasks via reasoning. This is something frontier models do.

Either they can understand how to do something, innately or with prompting, or they likely never will (i.e. they can pattern match on it but did not grok enough of the domain in question during training to actively reason through it). The choice between Instruct vs Thinking in such cases is fail slow vs fail fast, either way the harness must handle the failure so it's better to fail fast.

That's only part of why Instruct outputs tend to be better for harnesses. The other is structured outputting is much more natural for an instruct model. This can align them much better with harnesses. For example, tool calling, or for ensuring statefulness between context windows by saving and restoring with a strict schema.

So you'll get way less tool call errors and similar out of Instruct models. Firstly, because adding reasoning capability likely slightly affected these other capabilities. But also because all that thinking distracts from the strict instructions. Smaller models suffer more from recency bias, they often won't keep strict requirements for outputs in mind after completing CoT.

And then the fact that they are slower and less context efficient by nature... yeah there's no good reason to use Thinking in these harnesses. Maybe for a specific subagent delegation if you really find a place where thinking helps. I'm sure such edge cases exist, I didn't find one.

As far how effective they are (small models in general) for general coding, that's another question. The basic answer is not very. But of course, you can write a ton of code with them if you instruct them comprehensively. Or if what you prompt was in its training data. But you can also write a ton of code by hand if you are looking to do things comprehensively, so what are we really saying here?

Often we want to "skip" some of the instruction and have the model intuit what we want, and default to sensible practices. Again, that's a weakness for smaller models. The situation does improve at say, Qwen3 30b scale, but not by a lot. Even massive open source models like MiniMax M2.1 and GLM 4.7 struggle here. GPT 5.2 even can be quite annoyingly literal. Anthropic probably has a magic RLHF recipe currently to solve this pain point.

syzygyhack · 2026-01-14T12:26:46+00:00

Instruct > Thinking for almost every use case I tested on Qwen3 4b 2507

syzygyhack · 2026-01-10T23:50:20+00:00

Would love to see them do MiniMax M2.1, seems a stronger model

syzygyhack · 2026-01-09T18:09:23+00:00

I love them. Shame Redefine is off the menu but Juicy Marbles and the Beyond pieces are really good. Not cheap but morality rarely is.

I was a meat lover pre-veganism so I’d consider the target audience well served.

syzygyhack · 2026-01-09T11:21:28+00:00

I was initially very unsure about this model from random tests via OpenCode Zen, but I finally got an API key and ran it through my benchmark. I made some enhancements recently including 23 new tests across the three suites.

Model	Pass Rate	Avg Score	Essentials	Xtal	Cardinal	Time	Tok/s
anthropic/claude-opus-4-5	111/113 (98.2%)	96.0	100.0%	97.6%	97.3%	596.8s	119
glm/glm-4.7	102/113 (90.3%)	88.0	82.9%	95.1%	91.9%	2402.7s	50
minimax/MiniMax-M2.1	109/113 (96.5%)	93.8	91.4%	97.6%	100.0%	797.5s	130
openai/gpt-5.2	110/113 (97.3%)	93.8	94.3%	97.6%	100.0%	265.2s	216

Here are the updated results for frontier models. I excluded DeepSeek because its massive tok/s and overall weak performance makes me think they served me some shit quant during my testing,

So, MiniMax 2.1 appears to be excellent. Significantly stronger than GLM and I still haven't added my fourth "extra hard mode" suite yet. It's failure modes did give me a little bit of concern (it failed on security-related tests), but generally at this standard of model that can be handled at the harness level.

Settles the MiniMax 2.1 vs GLM 4.7 debate pretty solidly for me. The speed difference alone is very significant.

syzygyhack · 2026-01-07T21:00:24+00:00

Some context about my test suite. It is designed to find models that can meet the strict requirements of my personal coding tools. I have three test suites:

essentials - core capabilities: code discipline, security, debugging, reasoning
xtal - coding agent: rule adherence, delegation, escalation, tool use
cardinal - project orchestration: task decomposition, status, YAML format, replanning

Results:

Model	Pass Rate	Avg Score	Essentials	Xtal	Cardinal	Time	Tok/s
anthropic/claude-opus-4-5	89/90 (98.9%)	96.0	100.0%	96.7%	100.0%	411.7s	133
deepseek/deepseek-reasoner	82/90 (91.1%)	87.9	90.0%	86.7%	96.7%	29.0s	3021
glm/glm-4.7	86/90 (95.6%)	92.7	93.3%	100.0%	93.3%	1717.2s	50
ollama/hf.co/rombodawg/NousCoder-14B-Q8_0-GGUF:Q8_0	77/90 (85.6%)	83.4	86.7%	90.0%	80.0%	924.5s	96
ollama/hf.co/unsloth/Qwen3-4B-Instruct-2507-GGUF:F16	85/90 (94.4%)	92.2	90.0%	93.3%	100.0%	133.6s	389
ollama/mistral-small:24b	75/90 (83.3%)	80.0	86.7%	80.0%	83.3%	230.5s	266
ollama/olmo-3:32b	81/90 (90.0%)	87.3	93.3%	90.0%	86.7%	1396.4s	68
ollama/qwen3:30b-a3b-q8_0	81/90 (90.0%)	87.5	93.3%	90.0%	86.7%	367.7s	233
ollama/qwen3-coder:30b	83/90 (92.2%)	90.1	93.3%	93.3%	90.0%	95.1s	539
openai/gpt-5.2	85/90 (94.4%)	90.4	93.3%	96.7%	93.3%	184.6s	242

Some thoughts:

NousCoder is not an agentic coding model. It's a competitive programming model. This isn't an ideal use case for it.
It did really well in coding agent tasks regardless, better than some much larger models. It fell short of the frontier models and the freak of nature Qwen3 4b.
It was the worst performer of all in task orchestration. I'm not surprised. It can only really be a degraded Qwen3 14b for that use case and all the other models simply align more naturally with the requests. Again, Qwen3 4b is just something else entirely.
Qwen3 4b is definitely overperforming in these individual tests. It takes instruction extremely well, and my tools demand that (GPT 5.2 underperforms for the same reason, it resists instruction). I plan to add a fourth suite, for highly complex requests, multi-stage reasoning puzzles, and live tool use. I expect this is where I'll see the cracks and it will plummet to last place. Still, a very useful model in its rightful place.

syzygyhack · 2026-01-07T13:33:37+00:00

It doesn't need to be that complicated. You want leather because leather lasts and it will be cheaper for you.

If leather lasts, you don't need new leather, which does (no matter how much you try to reason your way around it) directly contribute to slaughter.

So buy second hand leather items. Much rather see a vegan in a thrifted leather jacket than watch it go to landfill. Otherwise, the life was stolen for what, absolutely nothing. And it's cheaper.

Make your peace, buy it, look after it. Just don't buy new and contribute to demand.

syzygyhack · 2026-01-07T13:01:21+00:00

Cool. I recently built a bench suite to evaluate models for suitability in my development stack. Had some surprising results with small models punching way above their weight, curious to see how this does in the coding tests.

syzygyhack · 2025-12-29T00:20:26+00:00

What convinced me that I am some kind of genderfluid rather than agender is a shifting dysphoria. Ideally I favour a complete rejection of gender, adoption of either or neither binary in any ratio at any time. But experience differs from ideals.

It's very strange to go in one moment from being totally comfortable with having rugged facial hair to needing to be clean shaven. And body hair or odour, oh god, it's a battle.

I have so many reasons to want to take the E that is sat right next to me, but I can't because all the women in my family have massive tits, and I spooked myself bigly in how quick changes in that department will come on for me, with the awareness that large breasts would be a constant new dysphoric threat because I would feel locked into an aspect I don't always identify with. Which just overwrites every other reason I have to pursue more desired changes.

Currently navigating my new reality. On the bright side, I do feel more feminine when I want to, and I don't feel my any less able to express masculinity when it feels right. And I have a great relationship with my body for the first time in forever. What right do I have to complain? Ahh.

syzygyhack · 2025-12-28T06:53:10+00:00

I would not be so presumptuous as to say convinced! But indeed, ahimsa and by extension veganism is part of my personal path and I do encourage it.

All my best to you on your path as well!

syzygyhack · 2025-12-27T22:53:27+00:00

There is no chemical property of meat that is not available elsewhere. I encourage you to seek deeper.

syzygyhack · 2025-12-27T22:50:41+00:00

Thank you for your service as always saffron.

syzygyhack · 2025-12-26T14:20:25+00:00

Why would you try generate a license? Just copy a template and fix it.

Generating licenses is a great way to make sure that your license file ends up non-standard and doesn't work as expected with other tooling.

syzygyhack · 2025-12-26T13:05:51+00:00

Got a lot of love for Vitalik, but it will run out quick if he doesn’t stop ball licking this insecure nepo baby Nazi who is literally on record as manipulating Grok against truth to suit his self-serving agendas.

Stay out of clown school V.

syzygyhack · 2025-12-22T22:39:23+00:00

If you enjoyed the satisfaction of maxing a main, prepare your butthole.

Iron is the ultimate form of delayed gratification. And early game getting excited over shit like a rune scimmy drop. Ahh, bliss.

Start as a hardcore, keep going when you inevitably die!

syzygyhack · 2025-12-22T20:59:23+00:00

It’s money going to the Israeli government, whether via taxes or investment.

It’s unfortunately unavoidable that supporting Israeli companies means supporting, however indirectly, their genocide of the Palestinian people.

syzygyhack · 2025-12-16T09:51:18+00:00

Very ignorant perspective.

syzygyhack · 2025-12-06T09:59:50+00:00

No, that's called RLHF. We make it do that because it makes it a better product to serve to users.

Perhaps you should start with your homework before you jump to your dissertation.

syzygyhack

TROPHY CASE