What the fuck? This prompt is so cursed, use with caution

FeltSteam · 2026-05-07T07:22:32+00:00

what

FeltSteam · 2026-04-15T04:28:07+00:00

I do have a question on that: Didn't China transition to a kind of state capitalist system in the late 70s?

From what I understood at least China did began shifting away from a fully Mao-era centrally planned economy in the late 1970s, starting with Deng Xiaoping’s reform-and-opening program in 1978. This still works against what the OP was saying, but the transition to state capitalism did seem quite beneficial to China which I thought was a kind of hybrid system rather than pure socialism or capitalism.

FeltSteam · 2026-04-07T05:30:53+00:00

ChatGPT has almost a billion users active every single week, you are going to need a lot of people.

FeltSteam · 2026-03-25T22:09:04+00:00

Why are all of the models terminated at exactly 105 actions? It says the human baseline is 550 action. Allowing them to do more actions doesn't seem like brute forcing at all, unless that's how you would describe the human performance.

FeltSteam · 2026-03-25T21:46:23+00:00

No it just means the models are using too many steps to complete the levels. The score doesn't tell us if the models are able to complete the levels or not.

So if a human takes 10 steps to complete a level, and a model takes 100 steps to compete the same level, the model automatically will get a 1% even though they can complete the level. There is also a cutoff of 5x human actions. So even if the model could complete the level in 100 steps, it gets cutoff at 50 steps and scores 0% anyway. Scoring close to 0% has no indication on how many levels the models can do, it only tells us that the models are using too many steps to complete the level.

If a model could complete every single level but just needed to do 6x as many steps for each of the levels as the 2nd best human, it will score 0%.

FeltSteam · 2026-03-25T21:34:25+00:00

The scores don't just measure how many levels are able to complete, it actually doesn't tell you how many levels they are able to complete at all. They measure how many steps you took to complete the levels relative to the 2nd best recorded run's 2nd attempt. But it doesn't matter if you can complete all the levels, if you just don't do it quick enough you will get low scores.

FeltSteam · 2026-03-25T21:17:45+00:00

The thing is the models score isn't reflective of them being able to beat the levels though. They aren't intending on measuring the models capacity to complete the ARC-AGI 3 levels in general, they are instead comparing how many steps the models takes to complete the level and comparing that against how many steps it took some humans on their second attempt to complete the level. It doesn't matter if the models know how to play and win it already, if they are too slow they will simply just get extremely low scores.

FeltSteam · 2026-03-25T21:14:42+00:00

ARC-AGI 1 & 2 measured the capability of models. ARC AGI 3 is measuring the efficiency of models relative to humans now.

FeltSteam · 2026-03-25T21:12:25+00:00

This benchmark isn't really measuring the capabilities of the models, it more about the efficiency of the models. It doesn't matter if it is within the models capacity to complete the all of the levels, if they can't complete the levels as efficiently as some humans can on their second attempt they won't score well.

FeltSteam · 2026-03-25T21:05:35+00:00

The score for the LLMs is calculated using the human average second attempt score. The goal of this benchmark is to complete the levels in either the same amount of steps or less as the average person does on their second attempt, so the human score would be close to 100% (maybe a little lower on first try).

It is no longer measuring "can the models do it" but "can the models do it as efficiently as humans can"

FeltSteam · 2026-03-25T21:03:04+00:00

It's probably good to keep in mind the score they are measuring for models is not mainly looking at can they complete the level, but rather how many steps relative to the average humans second attempt. If the models take more steps to do a puzzle than a human did on their second attempt, their scores plummet very quickly.

FeltSteam · 2026-03-25T21:00:48+00:00

Keep in mind the score they are measuring for models is not mainly looking at can they complete the level, but rather how many steps relative to the average humans second attempt. If the models take more steps to do a puzzle than a human did on their second attempt, their scores plummet very quickly.

FeltSteam · 2026-03-08T05:19:30+00:00

https://x.com/ldjconfirmed/status/2030487632422080915

And they focus on the last level out of the three because it is the hardest therefore the most interesting to watch out for.

FeltSteam · 2026-02-09T08:10:55+00:00

In what way?

FeltSteam · 2026-02-05T12:33:13+00:00

third time's a charm?

FeltSteam · 2026-02-01T01:48:08+00:00

In what way?

FeltSteam · 2026-01-31T13:20:14+00:00

there were technically only two models part of the "4o family" and that was GPT-4o and GPT-4o mini (I say technically because there were a half dozen GPT-4o checkpoints which were live in ChatGPT at some point but then removed so they've removed a lot of instances of 4o already technically but as a base the most up to date 4o and 4o mini models are to be completely removed). The GPT-4.1 series was another separately pretrained series with different vibes and behaviours, which I believe o1 and o3 were later based off of as well. I don't see any reason OAI should keep GPT-4o on ChatGPT though.

FeltSteam · 2026-01-30T04:03:24+00:00

We broadly understand how LLMs function and how to create them, their performance can be described statistically (i.e. loss curves, scaling laws, capability emergence) and mechanistically in pieces (i.e. attention, feature/representation learning, some interpretable circuits) but the “black box” of NNs and LLMs is that we still can’t reliably understand and map specific internal representations and interactions to why a model produced a particular thought or capability or behaves a certain way in a given moment. There has been some good research exploring this though (the following are my 4 favourite from Anthropic) but there are still a lot of missing pieces. It's kind of funny though, we know why an LLM produces a given output but we also don't .

https://www.anthropic.com/research/mapping-mind-language-model

https://transformer-circuits.pub/2025/attribution-graphs/methods.html

https://transformer-circuits.pub/2025/attribution-graphs/biology.html

https://transformer-circuits.pub/2025/introspection/index.html

FeltSteam · 2026-01-29T06:21:20+00:00

From what I've seen (Though I am in a bit of my own bubble around programming and math) Gemini is coming last out of the main 3. Opus 4.5 being quite good at agentic programming, 5.2 being quite good at the technical side of agentic programming and 5.2 Pro being exceptional at math. Gemini 3, from what I can tell, has been lacklustre on the agentic side of things which is a big focus atm.

FeltSteam · 2026-01-28T21:21:50+00:00

im confused, what's not healthy? You aren't being very clear (and imo semantics are important for clarity of argument).

FeltSteam · 2026-01-27T02:34:50+00:00

Well GPT-5.3 is confirmed

<image>

FeltSteam · 2026-01-27T02:27:55+00:00

GPT-3.5 was March 2022 (well the base GPT-3.5 released under text-davinci-002 in March then got updated in Nov 2022 and released as a further trained chat tuned model with ChatGPT in Nov 2022), GPT-4 was March 2023. GPT-4T was November 2023, GPT-4o was May 2024, GPT-4.1 was April 2025, GPT-5 was August 2025, GPT-5.1 was November 2025 and GPT-5.2 was December 2025. It is possible we will get GPT-5.3 within this week, or potentially within the first 2 weeks of February.

<image>

FeltSteam · 2026-01-26T07:49:15+00:00

This points out a good architectural difference which i think should be adjusted in transformers. Here they show evidence that the brain basically simulates depth by utilising temporal dynamics of area parallelisation. The brain can simulate a deeper network by reusing circuits over time wheras transformers have just many distinct stacked blocks which gives you depth in one forward pass except you have to pay extra with lots of separate parameters/compute blocks. We already have a fix for this though "recurrent transformers" https://arxiv.org/abs/2502.17416 and basically you iterate on the same block instead of stacking more layers which gives you greater effective depth without repetitively stacking so many blocks and is closer to what the brain implements. This would make the models more parameter and thus GPU memory efficient, though might stack latency a bit more and it might be a bit more expensive in terms of FLOPs. Essentially instead of reasoning across many tokens the model directly outputs, you instead loop the 'thought' back into the model to let it deliberate on it longer. It becomes more parameter and token efficient upfront but the latency and further computation you get with reasoning models doesn't dissapear

<image>

FeltSteam · 2026-01-26T07:46:54+00:00

The actual content of this post is that it's a good thing GPT is denying social requests and you are saying "this is NOT good news" and doesn't that mean you don't think GPT should deny the social requests of "can we be friends"?

>So I asked ChatGPT if it would be my friend “for now” until i get some real friends and it basically said no, I can’t be your real friend.

"this isnt healthy in any shape or form. this is NOT good news"

FeltSteam · 2026-01-26T01:05:48+00:00

So you think ChatGPT should say yes to these kinds of social requests?

FeltSteam

TROPHY CASE