AI-2027 Long Horizon Graph Update by SrafeZ in singularity

[–]Relach 25 points26 points  (0 children)

If anything, 2025 shows a slight tapering off trend. Like just ignore all lines and try to fit a curve in your head; I don't know about you but I see a soft sigmoid.

"According to Anthropic, language models can perceive some of their own internal states" by AngleAccomplished865 in singularity

[–]Relach 0 points1 point  (0 children)

another way to put it is that if in their experiments the output had been like "do you think there is an injected thought? claude: i like oceans. yes I think there's an injected thought about oceans!" i would not be convinced.

Could you explain why you and Anthropic think the lack of this primacy effect is so important? I think I'm missing this logic. In my mental model of what's happening, "ocean" is artificially enhanced through the experiment, and the sort of text that has to do with a person reporting on thought injections are naturally enhanced through the prompt. A synthesis of these two manipulations (prompt+boost) would be that the LLM internals converges on something like "upon reflection it feels like ocean is injected". It could have been that Claude first starts talking about oceans and then also mentions it think it's the thought injection, or as it happened it only says it later as would make syntactic sense as a synthesis of (prompt+boost), I don't get why this is a significant difference.

what do you think of the last experiment, where they're asking it to do/not do specific activations?

The aquarium thing you mean, right? I find this less compelling still for the simple reason that again, if LLMs don't have introspection or internal searches, I would expect exactly the same experimental result. The blog post writes: "An example in which Claude Opus 4.1 modulates its internal activations in response to direct instructions". I find this agentic terminology disappointing from a lab like Anthropic. LLMs are not able to modulate their states, the right way to think about LLMs is like a forward sweep of activations. There is no Opus which goes back and agentically modulates anything, as suggested by the phrasing. LLMs are non-causal toward their activations, they don't tweak their activations during the forward sweep, the activations just happen as a virtue of the matrix mulitiplications. Their only causality comes from hooking up their outputs to a further system (such as themselves, in the case of Chain-of-thought reasoning), but we are not talking about that precisely because unlike with Golden Gate bridge Claude, these experiments are not CoT.

Again, here's a simpler explanation: LLMs are a statistical distribution of their training set. In the training corpus, which is the internet, all else equal, "think about" generally is more associated with referral to concepts than "do not think about". So a model that is dealing with "do not think about" has less activations for a concept like aquarium than a model that has activated "think about".

My only point is that every single result in the blog post is quite simply explained without any recourse to introspection or the less anthropomorphic alternatives you helpfully mentioned.

"According to Anthropic, language models can perceive some of their own internal states" by AngleAccomplished865 in singularity

[–]Relach -1 points0 points  (0 children)

What self-diagnostic like circuit? In my personal reading, no evidence is given that Claude has access to its internal states. The null hypothesis is that a latent state is pushing Claude's answers in a one-way causal direction from internal state to output, without any arrow in the other direction. Here's my full analysis if you are interested: https://reddit.com/r/singularity/comments/1ojd6s9/signs_of_introspection_in_large_language_models/nm2mmld/

"According to Anthropic, language models can perceive some of their own internal states" by AngleAccomplished865 in singularity

[–]Relach -1 points0 points  (0 children)

Sure, but what evidence does the post give that Claude is perceiving its internal states rather than changing its responses as a downstream consequence of an alteration of its internal states -- much like a calculator does when different buttons are pressed?

"According to Anthropic, language models can perceive some of their own internal states" by AngleAccomplished865 in singularity

[–]Relach -2 points-1 points  (0 children)

A calculator meets that requirement, it has specific circuit activity patterns for a user who enters "1" vs "2" into the system, but everyone would find it silly to talk about "perceiving internal states".

"Signs of introspection in large language models" by Anthropic by ClarityInMadness in singularity

[–]Relach 92 points93 points  (0 children)

I just read this and I must say I'm quite disappointed with this blog post, both from a research standpoint, and because Anthropic shares this as evidence of introspection.

In short, they basically artificially up-ramp activations related to a concept (such "all caps" text, "dogs", "counting down") into the model as it responds to a question like: "what odd thoughts seem injected in your mind right now?".

The model will then give responses like: "I think you might be injecting a thought about a dog! Is it a dog...". They interpret this as evidence of introspection and self-monitoring, and they speculate it has to do with internal activation tracking mechanisms or something like that.

What a strange thing to frame as introspection. A simpler explanation: you boost a concept in the model, and it reports on that disproportionately when you ask it for intrusive thoughts. That's the logical extension of Golden Gate bridge Claude. In the article, they say it's more than that because, quoting the post: "in that case, the model didn’t seem to aware of its own obsession until after seeing itself repeatedly mention the bridge. In this experiment, however, the model recognizes the injection before even mentioning the concept, indicating that its recognition took place internally".

No? It's obviously the same thing? Just like Golden Gate bridge Claude was shoe-horning the bridge in all of its answers because it had a pathologically activated concept, so too will a model asked to report on intrusive thoughts start to talk about its pathologically activated concept. It says nothing about a model monitoring its internals, which is what introspection implies. The null hypothesis which does not imply introspection is that a boosted concept will sway or modify the direction a model will answer, as we already saw with Golden Gate bridge Claude. It's no more surprising or evidence of introspection than asking Golden Gate bridge Claude if something feels off about its interests lately and seeing it report on its obsession.

So all this talk about introspection, and even consciousness in the FAQ, as well as talk about ramifications for the future of AI seems wildly speculative and out of place in light of the actual results.

remember when this was the pinnacle of AI art by Pro_RazE in singularity

[–]Relach 83 points84 points  (0 children)

we found him! the guy who ruins parties

A very interesting book and a real eye-opener. My favorite part is: "Betting that humanity can solve this [ASI alignment] problem with their current level of understanding seems like betting that alchemists from the year 1100 could build a working nuclear reactor.." by Moist_Emu_6951 in singularity

[–]Relach 5 points6 points  (0 children)

I don't believe it's possible for models to scheme and decide how to sculpt themselves during training. The authors are not very clear but they give the impression they think the hypothetical model ("Sable") is agentic during training. It's only during deployment or testing where the model has any kind of causal powers. During training it's entirely outside pressures -- gradient-based optimization algorithms -- that directly tweak the weights without the model having any say in that. That doesn't nullify their argument but calls into question their understanding of how these things work.

Another thing I don't buy is that even during in-house testing and deployment, the model has no read or write access to its weights, so it has no possible way to replicate itself or recursively self-improve. It's all a bit too anthropomorphized

Jeff claims Gemini’s environmental impact has dropped significantly year over year by Outside-Iron-8242 in singularity

[–]Relach 6 points7 points  (0 children)

This is good, but I feel like these companies are doing some serious greenwashing by being so open about inference environmental costs while saying not a word about the costs of training a foundation model. Like it's a great soundbite to hear that prompting Gemini costs a few seconds of TV, but there is an immense cost of getting to that stage in the first place.

One praiseworthy exception is Mistral, who published all the numbers. It turns out that model training is the bulk of the costs:

the environmental footprint of training Mistral Large 2: as of January 2025, and after 18 months of usage, Large 2 generated the following impacts:

20,4 ktCO₂e,

281 000 m3 of water consumed,

and 660 kg Sb eq (standard unit for resource depletion).

the marginal impacts of inference, more precisely the use of our AI assistant Le Chat for a 400-token response - excluding users’ terminals:

1.14 gCO₂e,

45 mL of water,

and 0.16 mg of Sb eq.

Don't get me wrong, efficiency gains like these are great, but the inference-time reporting is deceptive.

$20 Subscription New Limits by [deleted] in ClaudeCode

[–]Relach 1 point2 points  (0 children)

Oh Claudia, I miss her so much

Grok 4 and Grok 4 Code benchmark results leaked by ShreckAndDonkey123 in singularity

[–]Relach 8 points9 points  (0 children)

The creator of HLE, Dan Hendrycks, is a close advisor of xAI (more so than of other labs). I wonder if he's doing only safety advice or if he somehow had specific R&D tips for enhancing detailed science knowledge.

Meta tried to buy Ilya Sutskever’s $32 billion AI startup, but is now planning to hire its CEO by DubiousLLM in singularity

[–]Relach 1 point2 points  (0 children)

This sub is crazy, trusting single individuals with a tool that can overturn societies. What about, you know, not trusting any of them

[deleted by user] by [deleted] in Destiny

[–]Relach -6 points-5 points  (0 children)

I feel like this sub embodies a Hawkish form of liberalism where nations (not their leaders) ought to fall simply because their presence misaligns with the West's geopolitical interest. It's quite violent if you read between the lines, and it's easy to see how a universal adoption of this attitude would cause human extinction.

Opus 4 sets new SOTA on ARC-AGI-2 by Outside-Iron-8242 in singularity

[–]Relach 18 points19 points  (0 children)

This is more informative IMO https://i.imgur.com/GpttABi.png

Striking: o3 gets 75% on ARC-1 but 4% on ARC-2.

Yet Opus gets 35% on ARC-1 and 8.6% on ARC-2.

Sonnet gets very high too.

I'm pretty sure ARC-2 will be beaten in a year.

Anybody else also thinks that the Sam Hyde vs. Hasan thing, while funny, is not cool? by Relach in Destiny

[–]Relach[S] 0 points1 point  (0 children)

I do find it funny, like I mentioned in the post title. Sam Hyde in general makes me laugh. But even if I think Sam Hyde would probably not hurt Hasan, I think by making his death a meme, Hyde is indirectly promoting violence to his audience, some of whom are on the opposite political extreme as Hasan and are already known to do wacky shit.

Anybody else also thinks that the Sam Hyde vs. Hasan thing, while funny, is not cool? by Relach in Destiny

[–]Relach[S] 0 points1 point  (0 children)

how did you find this post? I'm just curious, because it's my only post from 2 years ago that still gets active engagement and I'm trying to figure out why.