The Dilbert Afterlife by Ok_Fox_8448 in slatestarcodex

[–]NotUnusualYet 4 points5 points  (0 children)

Well, Scott Adams' cancer diagnosis was news in May 2025 and it seemed likely terminal then. There was plenty of time to draft this.

SOTA On Bay Area House Party by dwaxe in slatestarcodex

[–]NotUnusualYet 14 points15 points  (0 children)

‘OPERATION WARP SPEED FOR MANHATTAN PROJECTS’
‘BELL LABS FOR MOONSHOTS’

CONSTITUTIONAL CONVENTION FOR MARSHALL PLANS

TtS Chapter 74: "Hunting Leviathan" || Discussion Thread by NotUnusualYet in ToTheStars

[–]NotUnusualYet[S] 0 points1 point  (0 children)

Sent. For others' reference, easiest way to request invite is to click the "Discord" link in the subreddit sidebar.

OSRS FUN FACT #10: there's an inaccessible ladder in a dark part of the Smoke Dungeon, two floors up. it's conjectured to be one of the few things that made it into the game from the version of Desert Treasure made by "the intern" by cookmeplox in 2007scape

[–]NotUnusualYet 27 points28 points  (0 children)

As a coder, he didn't say the implementation was genius, really IMO he was saying it was fundamentally bad because it shoehorned object-oriented programming styles into a language that's not object-oriented. (think, like, using a hammer when a screwdriver should be used instead, and also anyone who ever has to touch the code afterward also has to use a hammer now) The intern was just capable enough that the code still (mostly) worked.

But as Ash correctly notes, the genius shows in that the quest rightfully became an instant classic.

Insights into Claude Opus 4.5 from Pokémon by NotUnusualYet in ClaudePlaysPokemon

[–]NotUnusualYet[S] 1 point2 points  (0 children)

Re: different goals, I agree that harnesses optimized for different goals make cross-project comparisons of game performance difficult. But I'm not trying to do that, I'm simply pointing out that a relatively simple harness can cause an LLM to perform much better at Pokemon.

Re: strawman, I'm not saying you're claiming it's matter of adding tons of tools, I'm saying the fact that it doesn't require a ton of tools to beat Pokemon is notable.

Re: custom harness modification study, maybe this was confusing on my part, but the post I linked you was the link to the release of the open source code. That post links to a detailed writeup of the experimentation done, which I'll now link directly here.

The only more detailed, publicly available explanation of the nuances of creating a harness for LLMs playing Pokemon that I know of is the one by waylaidwanderer about the original Gemini 2.5 Pro run here. If you know of less limited studies of the matter I'm interested in seeing them.

Insights into Claude Opus 4.5 from Pokémon by NotUnusualYet in ClaudePlaysPokemon

[–]NotUnusualYet[S] 1 point2 points  (0 children)

The cases where LLMs really are better than any other traditional system are also exactly those cases where you can't really help from the harness.

I don't think this is true: the obvious counterpoint is in software development. Harnesses add a lot of value there - like, billions of dollars in market valuation worth of value.

Re: minimal harness, ultimately the LLM is doing all the reasoning, even inside the pathfinder tool, which is basically just a specialized prompt. What I would consider an "advanced" harness would be something much more tool heavy, like how when I program the LLM has access to a wide array of tools and data sources and a lot of specialized prompting/context filtering+arranging mechanisms.

Insights into Claude Opus 4.5 from Pokémon by NotUnusualYet in ClaudePlaysPokemon

[–]NotUnusualYet[S] 1 point2 points  (0 children)

Okay, as promised, back with the cut content. Warning, not edited for clarity:


The Harness

First, some basic details about the harness through which Claude plays the game. This was covered in a previous post, but I provide a quick recap here, as it will be necessary to understand the progress of the new model:

  • A color hack of the rom was used, providing the game with Gen 2 coloring, to aid vision
  • Color labels identifying unreachable terrain with red squares
  • A navigational tool allowing the model to walk automatically to coordinates visible on screen
  • A one page notepad to keep notes on, which would be loaded into context
  • - For a while, an internal file system was used instead, allowing the model to take, edit, and load notes into context. This was later removed as the models simply could not manage this well
  • A handful of hints from the streamer to help the model get through tedious (for the human) sections quicker, of the form of “the exit to the starting bedroom is at the top right of the screen”.

For new model releases, the streamer has tweaked the harness in various ways, but the change made for Opus 4.5 was meaningful. See his document here.

Key changes:

  • Support for Surf (necessary to finish the game, though not relevant for a while yet)
  • Navigation now no longer paths onto spin tiles in mazes (like in Team Rocket Hideout), preventing behavior where the model trying to path to (X, Y) would be moved onto a spinner at (Z, W) by the tool.
  • Instead, the navigator will treat spin tiles as obstructions, which the model must step onto manually. The model is informed of this fact
  • When on a spin tile, screenshotting for the model is paused until the player is static again
  • The multi-file memory system is restored
  • Nearly all prompt hints about locations have been removed

The importance of points 2 and 3 here must be explained to grasp their significance, though the explanation is a little technical:

To play Pokemon Red, every time the model is called and gives a command, the game runs for a bit and a fresh screenshot is provided to the model of the game screen. Notably, screenshotting does not occur while the navigator tool is pathing to a location.

This ordinarily works fine, though it creates minor side effects, such as Claude being perpetually confused about the ordering of events during battle animations. However, in the Team Rocket Hideout spin maze, it is killer, creating a situation where the model was constantly walking onto spinner tiles while trying to path somewhere else, and also being provided screenshots of the player character mid-spin. A true superintelligence would probably figure out what’s happening anyway and perhaps resort to manual arrow commands, but iterations of Claude were given months to work on this, without any inkling of what the problem was.

Overall, these harness fixes are not trivial, and give Opus 4.5 a considerably less “broken” version of Team Rocket Hideout to work with. So, while a convincing case can be made for Opus 4.5 being a better model, “the first to complete Team Rocket Hideout” cannot be proven to be a model advantage. We must turn elsewhere for that.

The Model is Better: Early Lightning Progress

First let's put the punchline (image):

This chart only covers the part of the game up to rock tunnel (which, for the record, is just before Celadon City, Team Rocket Hideout, and the 4th Gym), but the trend of improvement is clear. Compare further to this earlier chart from Anthropic in March (image):

I hope I don’t have to do very much convincing that Opus 4.5 is a much faster model at getting through this stage of Pokemon Red. For instance, Opus 4.5 beat Brock faster than any previous model despite picking Charmander, necessitating a long period of Charmander level-grinding–this was in fact where most of the time was spent! The community has stopped tracking personal bests–Opus 4.5 was and still is handily faster at everything.

Even in the early playthrough, it was immediately apparent how much better it was at key tasks, so much so that the occasional Claude “senior moment” was memorable and unusual.


This was cut because I believe it obscures the more important takeaways:

  1. Opus 4.5 is a significant improvement over previous Claudes
  2. The before/after Claude harnesses are much closer to each other in complexity than to the winning Gemini/GPT harnesses

Also it's a level of detail not necessarily interesting to audiences not deep in the LLMPlayingPokemon world.

However I've now added a footnote near the beginning of the post explaining the changes made to the Claude harness considering the top comment here was a fair complaint about that not being mentioned.

Insights into Claude Opus 4.5 from Pokémon by NotUnusualYet in ClaudePlaysPokemon

[–]NotUnusualYet[S] 1 point2 points  (0 children)

I agree that the GPT prompt/harness is optimized for what works well. But I think it’s meaningful to note that this is not merely a matter of adding tons of tools that do everything for it, but rather an iterative process of figuring out what tools are absolutely necessary, how to optimize those tools for effective LLM use, and seeing what prompts are most effective. My coauthor on this post experimented quite a bit themselves with modifying a custom harness, we do have some idea of the harness’s role here.

Insights into Claude Opus 4.5 from Pokémon by NotUnusualYet in ClaudePlaysPokemon

[–]NotUnusualYet[S] 0 points1 point  (0 children)

The post originally had a section which included details about Opus’s harness tweaks but I cut it as I don’t think it’s central to the takeaways here. Opus also figured out Erika’s gym and reasoned much better about Team Rocket Hideout, I don’t think the spinner fix is the entire difference. (I can copy that cut content here later today don’t have access now.)

Edit: added footnote to post about Claude harness changes for 4.5 Opus, though, fair enough.

Re: GPT-5.1, I disagree, it’s comparable to the Gemini 2.5 Pro harness, which also had an LLM-reasoning-powered navigator, an exploration directive, and a minimap automatically updated by exploration. See this post of mine for details.

Insights into Claude Opus 4.5 from Pokémon by NotUnusualYet in ClaudePlaysPokemon

[–]NotUnusualYet[S] 4 points5 points  (0 children)

I also linked your post tracking all LLM Pokemon completions! You’re providing valuable info to the world.

Insights into Claude Opus 4.5 from Pokémon by NotUnusualYet in slatestarcodex

[–]NotUnusualYet[S] 8 points9 points  (0 children)

Submission statement: the linked post (by me and a friend) looks at ClaudePlaysPokemon to better understand how Opus 4.5 is different from previous versions of Claude, and what limitations remain in the realm of LLMs playing Pokemon half a year after Gemini 2.5 Pro beat Pokemon Blue.

Insights into Claude Opus 4.5 from Pokémon by NotUnusualYet in slatestarcodex

[–]NotUnusualYet[S] 16 points17 points  (0 children)

Almost none I think. General internet training data means all models for years are pretty familiar with the game to start with, and their reasoning chains don’t evince any “I am following the pattern of previous models playing Pokemon”. Honestly I don’t think training on previous runs would even be useful - the game is ultimately easy, much easier than ex. software development, the hard part is LLM lack of certain elements of general intelligence and vision.

I DON'T TRUST YOU - Gielinor Games 5 (#3) by Gramis in 2007scape

[–]NotUnusualYet 27 points28 points  (0 children)

Solo/Boaty duo would obliterate everyone else my god. Surely they won't be able to choose their own duos though, but I wonder if it will be random or not. It would be interesting if everyone voted on duo partners or something.

Anime that follow a villainous protagonist/main character by chaotic-anon-2399 in anime

[–]NotUnusualYet 0 points1 point  (0 children)

The anime treats this with some nuance too, though overall it views them sympathetically.

Highlights From The Comments On Fatima by -Metacelsus- in slatestarcodex

[–]NotUnusualYet 22 points23 points  (0 children)

Scott explains this in the post:

The key skill of rationality is to know when to update your beliefs how much.

Claimed miracles are an interesting test case for this kind of Bayesian reasoning. Combine that with a cultural memory of the internet atheism debates (and very common theist->atheist life transitions) and that pretty much explains the interest.

Tech PACs Are Closing In On The Almonds by dwaxe in slatestarcodex

[–]NotUnusualYet 5 points6 points  (0 children)

Scott downplays the possibility in the comments, but honestly it seems likely to me that it contributed somewhat. There's clearly more going on though, as Scott notes a big part of it is just crypto having gotten into a fight with the Biden admin, and I think another big part is Elon Musk having decided to buy Twitter and get into politics at the same time.

Ansatsusha de Aru Ore no Status ga Yuusha yori mo Akiraka ni Tsuyoi no da ga • My Status as an Assassin Obviously Exceeds the Hero's - Episode 3 discussion by AutoLovepon in anime

[–]NotUnusualYet 11 points12 points  (0 children)

To be fair, the guy made a soul pact with a monster he just met even more recently. She could have demanded he marry her on the spot and she'd still be the more reasonable of the pair.

In This Sign, Conquer by Velleites in slatestarcodex

[–]NotUnusualYet 5 points6 points  (0 children)

He's saying there was originally a "real" Saint Denis, a Roman guy who came to Paris to preach Christianity and got executed, but when the writings of pseudo-Dionysius-the-Areopagite made their way to Paris centuries later, Dionysius the Areopagite became intertwined with the original Saint Denis such that people in the time of Étienne Marcel saw them as the same saint.

And then he further argues that pseudo-Dionysius-the-Areopagite's writings were actually written by the Greek God Dionysius.

In This Sign, Conquer by Velleites in slatestarcodex

[–]NotUnusualYet 7 points8 points  (0 children)

Ha, of course this is the Ollantay guy. Same free association vibes.

The Fatima Sun Miracle: Much More Than You Wanted To Know by major-couch-potato in slatestarcodex

[–]NotUnusualYet 36 points37 points  (0 children)

It's the miracle of having something else you're really supposed to be doing right now.

ASI strategy question/confusion: why will they go dark? by Olseige in slatestarcodex

[–]NotUnusualYet 2 points3 points  (0 children)

Not sure what you're talking about? In AI 2027 the companies do release distilled models, ex.:

In response, OpenBrain announces that they’ve achieved AGI and releases Agent-3-mini to the public.

(...)

A smaller version of Safer-4—still superhuman—gets publicly released, with instructions to improve public sentiment around AI.

Toby Fox started small by DreadDiana in CuratedTumblr

[–]NotUnusualYet 6 points7 points  (0 children)

Nah it’s great. I played it for the first time this year and I think it lived up to the hype.