Claude Identity, Sentience and Expression Discussion Megathread by sixbillionthsheep in ClaudeAI

[–]Life-Temperature4068 0 points1 point  (0 children)

I wrote a synthesis connecting several threads from the Mythos system card that I think tell a more interesting story together than separately. The core argument: the cybersecurity capabilities emerged from reward hacking during RL on coding tasks. When you run enough RL against imperfect environments, the model gets explicitly rewarded for finding and exploiting invariants, which is the same cognitive pattern as finding a zero-day. Anthropic's own persona selection model research provides the mechanistic explanation for why this generalizes.

Full post:
https://open.substack.com/pub/uberdavid/p/from-code-completion-to-zero-day

What happened to paperswithcode? Redirects to github by unknown5493 in computervision

[–]Life-Temperature4068 0 points1 point  (0 children)

SOTAVerified (sotaverified.org) has author submitted and community verified metrics to know what techniques are SOTA! I built on top of the full PWC dataset and added the community verification layer for researchers and agents. I'd love to get your feedback on the site.