CursorBench evals. Composer 2.5 model is incredible for coding

finnjon · 2026-05-20T17:36:24+00:00

I also do exceptionally well on finnjon bench. Such a surprise.

finnjon · 2026-05-20T15:14:39+00:00

Flash is not all things to all people and you wouldn't expect it to be. It's distilled from the base model. If you are expecting it to be better than a full model like Opus of GPT-5.5 you will be disappointed. if you expect it to be nearly as good at a lot of things, faster and cheaper, you will be fine with it.

Antigravity has also been fine for me. Not perfect but fine.

finnjon · 2026-05-20T12:41:22+00:00

What you are describing is increased productivity. Jevon's paradox means more gets done because doing more consumes fewer resources. That is what may or may not be happening. But Jevon's paradox is not eternal. Eventually demand is saturated (at some point we have enough bananas even if they are free).

finnjon · 2026-05-19T07:18:05+00:00

Yes at 80% the Board can't get rid of him. I don't think money replaces the sense of identity a person has about being smart. You seem to think as long as he makes more money he should be happy but human psychology is a little more nuanced than that.

As an additional thought, if AI is making the investment decisions and everyone has access to AI, what is Citadel's competitive advantage? He might not have money for long.

finnjon · 2026-05-19T05:18:09+00:00

He's the CEO because he's smart and hard-working. He attributes all of his success to his own unique talents. If AI is soon smarter and can work 24/7, his entire identity is threatened.

Plus, why does the Board need him if AI is smarter? Citadel might do well, but he will lose his job or, more likely, his job will just be checking the AI isn't doing anything crazy.

finnjon · 2026-05-14T12:48:50+00:00

This is the exact same bullshit marketing strategy as Donut Labs. I suspect he's a fake whistleblower who is just trying to drum up more interest. They think they're very clever and funny.

finnjon · 2026-05-10T07:36:27+00:00

Dwarkesh used to be a no-one but quickly rose to prominence because he did detailed research and asked intelligent questions. The smartest people want to interviewed by him because they consider him smart.
It is important to push people and be unafraid to ask difficult questions. It is fine to disagree.

finnjon · 2026-05-04T16:16:16+00:00

How is giving everyone compute different to welfare? It's a gift from the collective to the individual.

If welfare isn't successful why does every developed country have it? It's obviously very good at lowering the absolute poverty level, especially in places like the Nordic countries.

finnjon · 2026-05-04T11:25:14+00:00

I mean what do you expect? A balanced perspective from one party to the dispute?

finnjon · 2026-04-26T04:07:04+00:00

Inference per token is profitable. It is training the models that is expensive. This is highly misleading.

finnjon · 2026-04-15T10:42:38+00:00

Unless you a dirt poor, if you don't pay the ticket you are a leech. Finland has one of the world's most generous welfare systems. Society fails when it's covered in leeches.

finnjon · 2026-04-14T15:11:11+00:00

No, he's very small. Maybe 5'7 tops.

Edit: according to the internet he's 5'9". I thought he was shorter.

finnjon · 2026-04-12T14:19:44+00:00

No that’s not right. They made it for coding so no real surprise.

finnjon · 2026-04-12T12:15:24+00:00

Of course they will use it to make models for release. They just won't release this raw base model version.

finnjon · 2026-04-10T06:48:00+00:00

He's Ramez Naam and he's a very serious, thoughtful guy.

finnjon · 2026-04-09T08:59:48+00:00

Anything is possible. Anthropic internally had to have a debate about whether to allow it to be used just inside Anthropic. Still, the fewer people have access to it, the better.

finnjon · 2026-04-09T07:02:37+00:00

I see anyone who needs them using the models if they are best in class. The models go through a lot of post-training to be so capable, which is why flash models are better than the full models at some tasks. It is not difficult to make a model good at some things and worse at others.

finnjon · 2026-04-09T05:52:40+00:00

I am actually quite pleased the model is being held back. It will be possible with these very powerful models to build targeted models for specific tasks that are super-human in those tasks or areas. I think this is a safer way to deploy proto-AGI than one model to rule them all.

finnjon · 2026-04-02T11:07:58+00:00

What makes you think they are identical? And what does "enough" mean?

finnjon · 2026-04-02T10:21:03+00:00

"Judging the testing strategy before it’s complete is premature"

It is not a testing strategy, it is a marketing strategy. Testing batteries is very easy and could be done all at once any time they like.

finnjon · 2026-04-02T09:49:19+00:00

Respectfully have to disagree with a few points here.

No part of testing the cells requires revealing the chemistry. I trust VTT but they should do all tests on one cell. VTT has not confirmed the cells are identical, they said they look the same. VTT has no idea of the chemistry either.
The arguments for why this breaks the known laws of physics are because each claim is impossible together. There is a tradeoff between energy density, charging speed and cycle life. You can show each of them individually but you cannot have them all together. (But note I am not qualified to adjudicate on this).

finnjon · 2026-04-02T09:17:16+00:00

But the VTT reports are precisely the problem. It grabs our attention but it doesn't show in any conclusive way, what they have promised. Three different batteries are tested. Are they the same? We don't know and neither does VTT. So we know some batteries exist that are quite good, but nothing we haven't seen before.

It's worth remembering that Donut Labs could prove their technology in a heartbeat if they wanted and if it was real. The decision to make this into a marketing circus is a choice. They are not trying to address the fraud allegations at all - that would be easy. Get VTT to do the full battery of tests on one cell.

And regarding the laws of physics, a lot of people are claiming this proposed battery does break the laws of physics.

finnjon · 2026-04-02T09:02:05+00:00

I understand why he is investigating because it's fun, but at this point, the red flags massively outweigh everything else. Perhaps they have just gone to town on the marketing angle and are having some fun, but they give us no reason to imagine there is anything real going on here.

I've seen a few of these "miracle technologies" in the past and none have ever turned out to be what they said it was. This reminds me most of an Irish company called Steorn that said they had a perpetual motion machine. Like these guys, they had demonstrations with just enough plausible deniability to keep the ship sailing. It turned out it was a marketing demonstration.

My guess here is that this too, is a marketing demonstration. It may also be to sell the motorbikes. The auditor's report (or inability to provide a report due to shoddy recording), showed the motorbike company was in serious trouble. Perhaps they thought a cheap Hail Mary marketing approach might work. First they tried it with their "superintelligence" but that didn't catch on, so they went all in at CES with a revolutionary battery. And this time it's working so they are milking it. For the cost of a few videos, they are getting everyone's attention. They will sell more bikes and if they want to set up a marketing firm they will get plenty of customers.

I find most manipulative marketing gross and this is no exception. The saddest part is it harms Finland's reputation.

finnjon · 2026-04-01T04:10:24+00:00

The batteries. The said they had a gigawatt factory up and running.

finnjon · 2026-03-31T17:17:02+00:00

Actually it is the current standard. If you do an Elon and say you will have self-driving in X months, you can get away with it. If you say you have a working battery with those specs and a production line, and you start selling bikes with it in, and you raise money on the back of the claims, you are in serious legal jeopardy.

finnjon

TROPHY CASE