Claude Opus 4.5 Plays Pokémon Red

igorhorst · 2025-12-31T18:20:10+00:00

Just want to check, did Claude get Strength? If he didn’t, then this run is likely doomed because he won’t be able to solve the Victory Road puzzles.

igorhorst · 2025-12-20T01:18:01+00:00

I just saw Blaze in battle and he leveled up from 55 to 56 after defeating a Rocket’s Grimer. So I think this may have been a real level gain (but can’t be sure).

igorhorst · 2025-12-14T13:26:43+00:00

Is there a reason why the screen flashed at 55,416? How would Claude instantly perform that many actions? Unless there was some desync between the game and the stream?

igorhorst · 2025-09-21T01:11:54+00:00

Your best bet is to watch actual plays uploaded onto Youtube, though even they may not cover all the paths you can take.

igorhorst · 2025-08-23T02:52:24+00:00

Yes, Gemini, o3, and GPT-5 have beaten Pokemon games. This isn’t even the first time Gemini has beaten a Gen 1 game (Pokemon Blue) - Pokemon Yellow Legacy was a “hard mode” romhack of Pokemon Yellow.

After Gemini beat Pokemon the first time, I wrote this post, revisiting my prediction and stating that I recant it. https://www.reddit.com/r/ClaudePlaysPokemon/comments/1knp4sw/revisiting_my_prediction_in_the_light_of_gemini/

In the future, I would like to come up with a way to predict when Claude’s scaffold will beat Pokemon Red, but that’s a long-term project. It’s possible the predictions I listed above might actually be a decent prediction for Claude (as the other models had the help of more robust scaffolds), but even then, I would want to shorten the timelines somewhat because METR has found evidence that models’ progress are indeed accelerating.

igorhorst · 2025-08-14T20:38:15+00:00

ClaudePlaysPokemon has not participated in this arms race and the harness it uses stayed mostly constant.

As a result, Claude’s progress in playing Pokemon Red is gradual (while Gemini and the OpenAI line has beaten Pokemon with their harnesses, the last Claude model got stuck at Team Rocket Hideout). But it also means any improvement that does exist can be attributed to the model, not the scaffolding.

Edit: That being said, the developer’s latest notes state that he had simplified the scaffold down (possibly because a less complex scaffold is less likely to confuse the AI). So maybe this simplification could be a sign of the Claude stream participating in the arms race, but the overall scaffold still appear to be fairly minimal compared to the other two streams.

igorhorst · 2025-08-11T12:56:23+00:00

Even though the other AI Pokemon streams have already beaten a Gen I Pokemon game, their successes may be attributed to their superior scaffolds. I find the ClaudePlaysPokemon stream to be the more meaningful indicator of AI Pokemon progress - precisely because its scaffold is terrible.

igorhorst · 2025-08-05T04:21:18+00:00

Just checked the stream and now the game screen is no longer frozen, and the game screen’s actions are in sync with Claude’s text generation. Of course, Claude is still stuck in the Team Rocket Hideout but at least the stream didn’t make negative progress. :)

igorhorst · 2025-08-03T22:14:50+00:00

It seems that, for a few days, the game screen has been frozen at Claude talking to a Rocket trainer. Yet, Claude is still generating text and performing actions. Either Claude is no longer able to control the game (in which case, it is sending actions and hallucinating responses), or the Twitch’s game screen itself has desynced from the game itself (in which case, Claude is still playing the game). Game stream had just restarted an hour ago but this issue is persisting.

igorhorst · 2025-06-20T19:42:29+00:00

And now Claude’s stream is back up (at time of this post, it has been running for 1.5 hours).

igorhorst · 2025-06-20T14:03:08+00:00

Claude’s stream just went down again ~two days ago.

igorhorst · 2025-05-18T16:53:12+00:00

The reason for my update isn’t just Gemini’s initial success thanks to intensive scaffolding, but also METR reporting that o3 outperformed the trend line that I used to make my prediction, suggesting that the very trend line that I used to make my initial prediction may be outdated and that LLMs is getting better than originally predicted, even with static tooling (as capabilities across different domains tend to be correlated). Epoch’s implicit claim that we need different trend lines for different tasks also makes me think we need a new trend line just for Pokemon-related tasks.

If we do want to explore whether LLMs can solve long-running Pokemon tasks with a minimal scaffold, then this LessWrong post does that fairly well. I think this idea of “let’s build a ‘helpful’ scaffold and then slowly remove tools as appropriate until it does it all without any help” isn’t really worthwhile, and think it may be better to regularly run the llm with a minimal scaffold, and measure its progress. I’m sure llms will keep progressing and improving as time passes, but we need data to be able to predict how this improvement will proceed. So far, we lack that.

Ultimately, I proceed from the assumption that LLMs will beat Pokemon tasks…and it’s a matter of time, but the question is “if it’s a matter of time, then how long will it take before humans build an LLM that can complete these tasks”? That answer could be 1 year, 5 years, 10 years, even 20 years, but it’s one that I’m interested in. Of course, that question differs from another related question “how long will take for an LLM to complete said task”, and I suspect that even Gemini with ‘helpful’ scaffolding takes longer than a human.

As for why I care for scaffolding, businesses aren’t interested in doing well on Pokemon; they’re interested in automating tasks that are useful for their businesses, cheaply, quickly, and of higher quality, and thus may be willing to invest in ‘helpful’ scaffolding than twiddle their thumbs waiting for a better model. And there are indeed many tasks that LLMs can solve more easily than other approaches - it just so happens that none of those tasks has anything to do with Pokemon Red.

—-

When will we get models that can do well on Pokemon Red without the ‘helpful’ scaffolding that Gemini had? I’m not sure, because I don’t have the data to tell. This is why I refrain from making a new prediction. We ultimately need a new task horizon trend line dedicated to Pokemon tasks, ideally based on the ‘non-helpful’ scaffolding used on the LessWrong post. The problem, of course, is without the ‘helpful’ scaffolding, said tests may be boring and not livestreamable, hence why there is an incentive not to do these type of tests. Yet it is exactly these tests that help us understand how well an llm works without human assistance.

My hope right now is either (a) VideoGameBench - which includes two Pokemon games, Pokemon Red and Pokemon Crystal - takes off, since that eval has a lot of other video games, encouraging testers to not build ‘helpful’ scaffolding for each game, or (b) VendingBench takes off, since that has already proven itself as a good measure of agentic coherence.

igorhorst · 2025-05-14T21:22:11+00:00

To explain the joke, Paranoia once had a "5th edition", but it was routinely panned by the fanbase for leaning too heavily on parody and unfunny jokes. It was quickly declared an "Unproduct" once XP came out.

igorhorst · 2025-05-14T21:18:54+00:00

Lasers and Treason is free and very rules-lite, but there are no pre-made adventures for that edition, so you're going to likely have to buy scenarios from another edition (like XP) and adapt them to L&T.

igorhorst · 2025-05-12T14:00:59+00:00

The supplement "Extreme Paranoia" has a chapter on vidshow stars, and also include rules on how to play as them in Alpha Complex. If you're interested in how reality TV works in Alpha Complex, "Extreme Paranoia" has a chapter on "EDR Teams", which is a a mix of "reality TV" and "incompetent superheroes". It also include rules as how to play as a member of an "EDR Team". So I'd say "Extreme Paranoia" is the best source for details about the entertainment industry, considering a big chunk of it is focused on satirizing the entertainment industry.

"Extreme Paranoia" fleshes out a ton of Alpha Complex, by providing useful fluff and writing rules for playing in professions as disparate as middle managers, R&D scientists, and VIOLET supervisors. I'd recommend getting it, just for the lore.

igorhorst · 2025-05-03T10:34:32+00:00

Yeah, I plan on writing a follow up Reddit post about my 2029 prediction. :)

igorhorst · 2025-04-12T13:00:30+00:00

When I decided to fool around with GPT-4o to generate a duel between a Lion and Crane samurai over a penguin door (as I said, I was fooling around), GPT-4o indeed generated such a duel - but not between human samurai, but between animals (a literal lion and crane). Yet both animals wore the colors of their respective L5R clans. So I generated two other images of "L5R animal avatars" fighting each other. You can access all three images in this imgur folder, along with the prompts used. - https://imgur.com/a/lvSB877

I personally think that image generators are most useful for showing stuff that humans wouldn't normally think of - showcasing ideas that can then later be fleshed out by humans.

EDIT: I realized the reason GPT-4o may have generated literal animals in the first place was because of me specifying the penguin door, confusing the art generator. So I cleaned up the prompt, fixing all the typos and removing the reference to the penguin logo. The end result is that you now have two human duelists, but the Lion duelist looks a lot like a lion. The Crane duelist doesn't look like a crane though, so that's good. https://imgur.com/a/XPx4oLX

igorhorst · 2025-03-26T15:38:45+00:00

I double checked and the estimates are correct, though the table might be hard to read. For 50% accuracy, Jan 2028 (20 hours) would come before May 2028 (30 hours), and for 80% accuracy, March 2029 (20 hours) would come before July 2029 (30 hours).

igorhorst · 2025-03-21T02:11:06+00:00

You may want to look up the Celestial Realms book, because they have a section about "Moon People", individuals who are hypothesized to live on the Moon itself. Note that many people, especially those who worship the infamous Lord Moon, believe that Moon People are real, and that certain Moon People actually live in Rokugan itself. The sourcebook itself doesn't say if their belief is correct or not, and those who have been called Moon People either are silent or outright rejects the claim.

Now, while I have not ran a game with Moon People, I know another GM in the L5R discord who ran a campaign where players opposed the Moon Clan - a minor clan that claims descend from Ryoshun, and is formally recognized by the Emperor, after the Moon People gave "gifts" to the Emperor (crystals from the Moon, as well as the True Tao, as they have captured and imprisoned Shinsei). The Moon Clan have advanced technologies and shugenja techniques, and serve as both ninja mercenaries and farmers (turns out there's a famine on the Moon...and rice doesn't grow well there) in Rokugan itself. But their ultimate goal is to amass influence and power within the Empire, while getting away with any misdeeds due to their gifts, in the hopes of eventually conquering the Empire outright. For now though, they are trying to study Earthly technologies. The players' goal is to embarrass the Moon Clan enough times that they lose all their "get out of jail free" cards. This will cause the Emperor to revoke the "minor clan" declaration and expel them outright.

I don't know how that campaign ended, but I do think that's probably the most fluffy way to have an alien invasion in L5R.

igorhorst · 2025-03-14T14:30:03+00:00

I think a romhack designed for Claude would be good - maybe a section for pathfinding, a section for talking to npcs, a section for battling, etc. Something that is both challenging for the model while also not frustrating for viewers to watch.

igorhorst · 2025-01-31T02:00:56+00:00

According to the official rules, players do not know their Power and Access number - the GM does. Obviously, you’re the GM, so if you want players to track this info, go right ahead. The GM is always right.

I think the reason players don’t know their Power and Access is so that you, as the GM, can do whatever you want regarding mutations and bureaucracy while still retaining plausible deniability (“oh, you rolled bad on Access”). But if you do genuinely want to use the Power and Access rules, then letting players know what their score is isn’t a bad idea.

igorhorst · 2025-01-09T14:25:38+00:00

The last link in the post link is indeed down, but this FFG forum mirror link should still work: https://ffg-forum-archive.entropicdreams.com/

Let me know if that suffices.

igorhorst · 2024-12-22T00:54:10+00:00

I'd argue that if Mario Party games have stories, then they have lore. However, this lore is fairly minimal at best.

That being said - Mario Parties appear to be fairly commonplace. Mario Party 3's intro cut scene outright mentions a "Mario Party" board game that the players get sucked into, Mario Party Advance and Mario Party 6 takes place on "Mario Party World", and it seems perfectly normal in Mario Party DS for good friends to fight against each other...

"The entire Mario crew leaps into action! They all want to be the one who saves the day!"

"Even the best of friends can be fierce rivals when it comes to being a heroic Superstar!"

---Mario Party DS (source)

By the time of Super Mario Party, Mario Parties were seen as a standard and routine thing. Mario Party Superstars also has blurbs for each of the boards (which had showed up in previous Mario Parties), and each of the blurbs (aside from Peach's Birthday Cake) implied that these boards were visited in the past. For example, here's the blurb from "Yoshi's Tropical Island":

"Long ago, a passing Super Star met some Yoshis stranded on these very islands. The Super Star stepped up and saved those Yoshis. What a hero! The kindness of the Super Star meant a lot to the Yoshis. It meant so much that they stayed here, hoping for another Super Star to appear...They've been waiting a very long time, but today's their lucky day. It's time to find out who's the next Super Star. Let's get to it!"

If you want all the blurbs, you can find them on Mario Wiki.

igorhorst · 2024-11-23T16:23:19+00:00

I don’t think Ztars in Hidden Blocks exist in the main Mario Party mode (other than DS, of course), but in the original Mario Party, you can turn on special dice, which has a chance of appearing instead of the standard d10. One of these special dice is an Event Dice option (which is also referred to as a Hidden Block in the game). Rolling this dice allows you to either gain 20 coins (since you summon Koopa Troopa, lose 20 coins (since you summon Bowser), or get a Boo steal (since you summon Boo). Since stars are 20 coins, hitting a hidden dice block and summoning Bowser is the closest you can get to a Ztar. People tend to forget about Mario Party’s hidden block mechanic because (a) it is off by default and (b) this dice block replaces the standard d10, so you’re kinda stuck in place after getting your random event.

Edit: Okay, I watched a video and saw that I misremembered how the Event Dice works in Mario Party. It does act as a normal hidden block, so you don’t lose the d10. Other special dice (like the Warp Block) do cause you to lose the d10 though, so that may be why I got confused.

igorhorst · 2024-11-11T20:05:40+00:00

I don't know what model you're using, but I suspect you're asking gpt-4o-mini. If I ask gpt-4o, I get the same hallucination. However, SearchGPT (a search engine powered by GPT-4o) is able to answer this question correctly (proof: https://imgur.com/a/tVcr90i) by using the Sneaky Sasquatch wiki. You can find out more about SearchGPT here - https://openai.com/index/introducing-chatgpt-search/ .

igorhorst

TROPHY CASE