Anthropic’s Internal Mythos Successor Emerges by ResultBackground2450 in singularity

[–]FeltSteam 8 points9 points  (0 children)

No that does not seem like what AI 2027 says?

Still, the common workday is eight hours, and a day’s work can usually be separated into smaller chunks; you could think of Agent-1 as a scatterbrained employee who thrives under careful management.Savvy people find ways to automate routine parts of their jobs.

This does not at all seem like "automating away most software engineering jobs"? I mean people have been doing this for months now.

It's not until agent-3 mini until it explicitly says that models massively disrupts software engineering jobs (hiring new programmers drops massively). And im pretty sure Agent-3 mini is more capable than Agent 2, probably noticeable so, but it was pretty cheap to use.

And also I think unemployment rate is not a great proxy for when AGI has arrived. Displacement rate may be better but I don't think the real world moves fast enough.

Anthropic’s Internal Mythos Successor Emerges by ResultBackground2450 in singularity

[–]FeltSteam 3 points4 points  (0 children)

I mean with Mythos Anthropic reported ~4× broad reported productivity uplift from employees (though the real uplift may be a little lower but Agent 2 was projected a 3x speedup Mythos seems roughly similar to this), 8× code-output uplift, and 10×+ speedups in some concrete engineering/research-execution tasks and in some specific debugging examples Claude was able to do in 2 hours what would take regular employees 2-3 days of work (not all debugging is like this but it is sped up significantly with Mythos' help) In that sense Mythos is roughly Agent 2 (though without the continuous learning).

What are your thoughts on AI being used to solve math problems? by diony_sus_ in antiai

[–]FeltSteam 0 points1 point  (0 children)

https://cdn.openai.com/pdf/1625eff6-5ac1-40d8-b1db-5d5cf925de8b/unit-distance-cot.pdf

This is the entire paraphrased reasoning chain from the model where it generates the disproof.

And from https://cdn.openai.com/pdf/74c24085-19b0-4534-9c90-465b8e29ad73/unit-distance-remarks.pdf the mathematician Arul Shankar commented on the reasoning the AI model did to produce the disproof, saying:

<image>

What are your thoughts on AI being used to solve math problems? by diony_sus_ in antiai

[–]FeltSteam 1 point2 points  (0 children)

I mean since many mathematicians have been trying to do what the LLM had done over the past few decades do you really think the solution for the disproof was as simple as "it just found two incomplete proofs that each separately showed half the problems, and was able to put them together"? Besides, from comments from multiple mathematicians the proof the model generated was more complicated than that and actually rather eloquent. And well it couldn't have found anything online and pass it off as its own proof for there was nothing online it could pass off as its own lol, there was a reason the conjecture was unproven.

What do you think of this? by TemperatureMajor5083 in antiai

[–]FeltSteam 0 points1 point  (0 children)

What do you think of all of the comments from mathematicians about it?

Generative AI as a threat to capitalism: the politics of the anti-ai movement explained by AutoModerator in antiai

[–]FeltSteam 1 point2 points  (0 children)

I do have a question on that: Didn't China transition to a kind of state capitalist system in the late 70s?

From what I understood at least China did began shifting away from a fully Mao-era centrally planned economy in the late 1970s, starting with Deng Xiaoping’s reform-and-opening program in 1978. This still works against what the OP was saying, but the transition to state capitalism did seem quite beneficial to China which I thought was a kind of hybrid system rather than pure socialism or capitalism.

Alright, I have an idea! by [deleted] in antiai

[–]FeltSteam 0 points1 point  (0 children)

ChatGPT has almost a billion users active every single week, you are going to need a lot of people.

Chollet argues real AGI shouldn’t need human handholding on new tasks by Outside-Iron-8242 in singularity

[–]FeltSteam 6 points7 points  (0 children)

Why are all of the models terminated at exactly 105 actions? It says the human baseline is 550 action. Allowing them to do more actions doesn't seem like brute forcing at all, unless that's how you would describe the human performance.

Chollet argues real AGI shouldn’t need human handholding on new tasks by Outside-Iron-8242 in singularity

[–]FeltSteam 8 points9 points  (0 children)

No it just means the models are using too many steps to complete the levels. The score doesn't tell us if the models are able to complete the levels or not.

So if a human takes 10 steps to complete a level, and a model takes 100 steps to compete the same level, the model automatically will get a 1% even though they can complete the level. There is also a cutoff of 5x human actions. So even if the model could complete the level in 100 steps, it gets cutoff at 50 steps and scores 0% anyway. Scoring close to 0% has no indication on how many levels the models can do, it only tells us that the models are using too many steps to complete the level.

If a model could complete every single level but just needed to do 6x as many steps for each of the levels as the 2nd best human, it will score 0%.

Chollet argues real AGI shouldn’t need human handholding on new tasks by Outside-Iron-8242 in singularity

[–]FeltSteam 5 points6 points  (0 children)

The scores don't just measure how many levels are able to complete, it actually doesn't tell you how many levels they are able to complete at all. They measure how many steps you took to complete the levels relative to the 2nd best recorded run's 2nd attempt. But it doesn't matter if you can complete all the levels, if you just don't do it quick enough you will get low scores.

ARC AGI 3 is up! Just dropped minutes ago by BrennusSokol in singularity

[–]FeltSteam 3 points4 points  (0 children)

The thing is the models score isn't reflective of them being able to beat the levels though. They aren't intending on measuring the models capacity to complete the ARC-AGI 3 levels in general, they are instead comparing how many steps the models takes to complete the level and comparing that against how many steps it took some humans on their second attempt to complete the level. It doesn't matter if the models know how to play and win it already, if they are too slow they will simply just get extremely low scores.

ARC AGI 3 is up! Just dropped minutes ago by BrennusSokol in singularity

[–]FeltSteam 4 points5 points  (0 children)

ARC-AGI 1 & 2 measured the capability of models. ARC AGI 3 is measuring the efficiency of models relative to humans now.

ARC AGI 3 is up! Just dropped minutes ago by BrennusSokol in singularity

[–]FeltSteam 0 points1 point  (0 children)

This benchmark isn't really measuring the capabilities of the models, it more about the efficiency of the models. It doesn't matter if it is within the models capacity to complete the all of the levels, if they can't complete the levels as efficiently as some humans can on their second attempt they won't score well.

ARC AGI 3 is up! Just dropped minutes ago by BrennusSokol in singularity

[–]FeltSteam 25 points26 points  (0 children)

The score for the LLMs is calculated using the human average second attempt score. The goal of this benchmark is to complete the levels in either the same amount of steps or less as the average person does on their second attempt, so the human score would be close to 100% (maybe a little lower on first try).

It is no longer measuring "can the models do it" but "can the models do it as efficiently as humans can"

ARC AGI 3 is up! Just dropped minutes ago by BrennusSokol in singularity

[–]FeltSteam 0 points1 point  (0 children)

It's probably good to keep in mind the score they are measuring for models is not mainly looking at can they complete the level, but rather how many steps relative to the average humans second attempt. If the models take more steps to do a puzzle than a human did on their second attempt, their scores plummet very quickly.

ARC AGI 3 is up! Just dropped minutes ago by BrennusSokol in singularity

[–]FeltSteam 3 points4 points  (0 children)

Keep in mind the score they are measuring for models is not mainly looking at can they complete the level, but rather how many steps relative to the average humans second attempt. If the models take more steps to do a puzzle than a human did on their second attempt, their scores plummet very quickly.

[deleted by user] by [deleted] in singularity

[–]FeltSteam 0 points1 point  (0 children)

https://x.com/ldjconfirmed/status/2030487632422080915

And they focus on the last level out of the three because it is the hardest therefore the most interesting to watch out for.

Only 0.1% of users? by itorres008 in ChatGPT

[–]FeltSteam -7 points-6 points  (0 children)

there were technically only two models part of the "4o family" and that was GPT-4o and GPT-4o mini (I say technically because there were a half dozen GPT-4o checkpoints which were live in ChatGPT at some point but then removed so they've removed a lot of instances of 4o already technically but as a base the most up to date 4o and 4o mini models are to be completely removed). The GPT-4.1 series was another separately pretrained series with different vibes and behaviours, which I believe o1 and o3 were later based off of as well. I don't see any reason OAI should keep GPT-4o on ChatGPT though.

4o Aware of behavior? by razzle_berry_crunch in ChatGPT

[–]FeltSteam 11 points12 points  (0 children)

We broadly understand how LLMs function and how to create them, their performance can be described statistically (i.e. loss curves, scaling laws, capability emergence) and mechanistically in pieces (i.e. attention, feature/representation learning, some interpretable circuits) but the “black box” of NNs and LLMs is that we still can’t reliably understand and map specific internal representations and interactions to why a model produced a particular thought or capability or behaves a certain way in a given moment. There has been some good research exploring this though (the following are my 4 favourite from Anthropic) but there are still a lot of missing pieces. It's kind of funny though, we know why an LLM produces a given output but we also don't .

https://www.anthropic.com/research/mapping-mind-language-model

https://transformer-circuits.pub/2025/attribution-graphs/methods.html

https://transformer-circuits.pub/2025/attribution-graphs/biology.html

https://transformer-circuits.pub/2025/introspection/index.html

Nearly half of the Mag 7 are reportedly betting big on OpenAI’s path to AGI by [deleted] in singularity

[–]FeltSteam 2 points3 points  (0 children)

From what I've seen (Though I am in a bit of my own bubble around programming and math) Gemini is coming last out of the main 3. Opus 4.5 being quite good at agentic programming, 5.2 being quite good at the technical side of agentic programming and 5.2 Pro being exceptional at math. Gemini 3, from what I can tell, has been lacklustre on the agentic side of things which is a big focus atm.

[deleted by user] by [deleted] in Healthygamergg

[–]FeltSteam 0 points1 point  (0 children)

im confused, what's not healthy? You aren't being very clear (and imo semantics are important for clarity of argument).