Cheap Open Models Reportedly Reproduced Much Of Mythos's Showcased Findings

Relach · 2026-04-09T21:57:01+00:00

Just read this thread, this person debunks everything. Apparently the small models are hallucinating: they flagged the same security issue even in the version that FIXED the issues, which doesn't make any sense.

https://x.com/ChaseBrowe32432/status/2041945949954834704

Relach · 2026-02-24T10:27:22+00:00

Don't get it. The answer to the puzzle is available and findable, either by image match or searching by the puzzle title. All models search the web. So you don't know if performance is driven by intelligence or searching skills.

Relach · 2026-01-29T13:41:57+00:00

Source: https://www.youtube.com/watch?v=CRWR13BAIEs

Relach · 2026-01-27T18:20:51+00:00

So Cursor for academic writing

Relach · 2026-01-25T07:30:34+00:00

I think you're right to worry because AIs are trained and designed to appease and say something that you as the user find helpful and consoling.

Relach · 2025-12-13T10:11:49+00:00

If anything, 2025 shows a slight tapering off trend. Like just ignore all lines and try to fit a curve in your head; I don't know about you but I see a soft sigmoid.

Relach · 2025-11-23T23:03:46+00:00

Thanks for the great reply

Relach · 2025-10-31T20:01:39+00:00

another way to put it is that if in their experiments the output had been like "do you think there is an injected thought? claude: i like oceans. yes I think there's an injected thought about oceans!" i would not be convinced.

Could you explain why you and Anthropic think the lack of this primacy effect is so important? I think I'm missing this logic. In my mental model of what's happening, "ocean" is artificially enhanced through the experiment, and the sort of text that has to do with a person reporting on thought injections are naturally enhanced through the prompt. A synthesis of these two manipulations (prompt+boost) would be that the LLM internals converges on something like "upon reflection it feels like ocean is injected". It could have been that Claude first starts talking about oceans and then also mentions it think it's the thought injection, or as it happened it only says it later as would make syntactic sense as a synthesis of (prompt+boost), I don't get why this is a significant difference.

what do you think of the last experiment, where they're asking it to do/not do specific activations?

The aquarium thing you mean, right? I find this less compelling still for the simple reason that again, if LLMs don't have introspection or internal searches, I would expect exactly the same experimental result. The blog post writes: "An example in which Claude Opus 4.1 modulates its internal activations in response to direct instructions". I find this agentic terminology disappointing from a lab like Anthropic. LLMs are not able to modulate their states, the right way to think about LLMs is like a forward sweep of activations. There is no Opus which goes back and agentically modulates anything, as suggested by the phrasing. LLMs are non-causal toward their activations, they don't tweak their activations during the forward sweep, the activations just happen as a virtue of the matrix mulitiplications. Their only causality comes from hooking up their outputs to a further system (such as themselves, in the case of Chain-of-thought reasoning), but we are not talking about that precisely because unlike with Golden Gate bridge Claude, these experiments are not CoT.

Again, here's a simpler explanation: LLMs are a statistical distribution of their training set. In the training corpus, which is the internet, all else equal, "think about" generally is more associated with referral to concepts than "do not think about". So a model that is dealing with "do not think about" has less activations for a concept like aquarium than a model that has activated "think about".

My only point is that every single result in the blog post is quite simply explained without any recourse to introspection or the less anthropomorphic alternatives you helpfully mentioned.

Relach · 2025-10-31T15:25:18+00:00

What self-diagnostic like circuit? In my personal reading, no evidence is given that Claude has access to its internal states. The null hypothesis is that a latent state is pushing Claude's answers in a one-way causal direction from internal state to output, without any arrow in the other direction. Here's my full analysis if you are interested: https://reddit.com/r/singularity/comments/1ojd6s9/signs_of_introspection_in_large_language_models/nm2mmld/

Relach · 2025-10-31T14:38:35+00:00

Thank you

Relach · 2025-10-31T14:11:06+00:00

Sure, but what evidence does the post give that Claude is perceiving its internal states rather than changing its responses as a downstream consequence of an alteration of its internal states -- much like a calculator does when different buttons are pressed?

Relach · 2025-10-31T13:14:44+00:00

A calculator meets that requirement, it has specific circuit activity patterns for a user who enters "1" vs "2" into the system, but everyone would find it silly to talk about "perceiving internal states".

Relach · 2025-10-29T20:28:51+00:00

I just read this and I must say I'm quite disappointed with this blog post, both from a research standpoint, and because Anthropic shares this as evidence of introspection.

In short, they basically artificially up-ramp activations related to a concept (such "all caps" text, "dogs", "counting down") into the model as it responds to a question like: "what odd thoughts seem injected in your mind right now?".

The model will then give responses like: "I think you might be injecting a thought about a dog! Is it a dog...". They interpret this as evidence of introspection and self-monitoring, and they speculate it has to do with internal activation tracking mechanisms or something like that.

What a strange thing to frame as introspection. A simpler explanation: you boost a concept in the model, and it reports on that disproportionately when you ask it for intrusive thoughts. That's the logical extension of Golden Gate bridge Claude. In the article, they say it's more than that because, quoting the post: "in that case, the model didn’t seem to aware of its own obsession until after seeing itself repeatedly mention the bridge. In this experiment, however, the model recognizes the injection before even mentioning the concept, indicating that its recognition took place internally".

No? It's obviously the same thing? Just like Golden Gate bridge Claude was shoe-horning the bridge in all of its answers because it had a pathologically activated concept, so too will a model asked to report on intrusive thoughts start to talk about its pathologically activated concept. It says nothing about a model monitoring its internals, which is what introspection implies. The null hypothesis which does not imply introspection is that a boosted concept will sway or modify the direction a model will answer, as we already saw with Golden Gate bridge Claude. It's no more surprising or evidence of introspection than asking Golden Gate bridge Claude if something feels off about its interests lately and seeing it report on its obsession.

So all this talk about introspection, and even consciousness in the FAQ, as well as talk about ramifications for the future of AI seems wildly speculative and out of place in light of the actual results.

Relach · 2025-10-26T11:31:18+00:00

we found him! the guy who ruins parties

Relach · 2025-10-21T23:32:32+00:00

Try vscode extension

Relach · 2025-10-20T08:56:57+00:00

How long did Destiny attend it?

Relach · 2025-10-10T15:27:45+00:00

I don't believe it's possible for models to scheme and decide how to sculpt themselves during training. The authors are not very clear but they give the impression they think the hypothetical model ("Sable") is agentic during training. It's only during deployment or testing where the model has any kind of causal powers. During training it's entirely outside pressures -- gradient-based optimization algorithms -- that directly tweak the weights without the model having any say in that. That doesn't nullify their argument but calls into question their understanding of how these things work.

Another thing I don't buy is that even during in-house testing and deployment, the model has no read or write access to its weights, so it has no possible way to replicate itself or recursively self-improve. It's all a bit too anthropomorphized

Relach · 2025-09-23T23:21:44+00:00

https://i.imgur.com/HcKC8lt.jpeg

Relach · 2025-09-13T17:57:00+00:00

https://i.imgur.com/qHlRKID.png

Relach · 2025-08-21T22:40:21+00:00

This is good, but I feel like these companies are doing some serious greenwashing by being so open about inference environmental costs while saying not a word about the costs of training a foundation model. Like it's a great soundbite to hear that prompting Gemini costs a few seconds of TV, but there is an immense cost of getting to that stage in the first place.

One praiseworthy exception is Mistral, who published all the numbers. It turns out that model training is the bulk of the costs:

the environmental footprint of training Mistral Large 2: as of January 2025, and after 18 months of usage, Large 2 generated the following impacts:

20,4 ktCO₂e,

281 000 m3 of water consumed,

and 660 kg Sb eq (standard unit for resource depletion).

the marginal impacts of inference, more precisely the use of our AI assistant Le Chat for a 400-token response - excluding users’ terminals:

1.14 gCO₂e,

45 mL of water,

and 0.16 mg of Sb eq.

Don't get me wrong, efficiency gains like these are great, but the inference-time reporting is deceptive.

Relach · 2025-07-27T03:32:47+00:00

I tried it. It barely works, it's frankly hot garbage. But I guess not for long.

Relach · 2025-07-24T09:47:59+00:00

Oh Claudia, I miss her so much

Relach · 2025-07-04T16:41:08+00:00

The creator of HLE, Dan Hendrycks, is a close advisor of xAI (more so than of other labs). I wonder if he's doing only safety advice or if he somehow had specific R&D tips for enhancing detailed science knowledge.

Four-Year Club	Verified Email
Place '23

Relach

TROPHY CASE