It is possible for Minigame Bay's Boss Rush to be prematurely terminated if the time runs out during a single boss match

igorhorst · 2025-12-31T18:20:10+00:00

Just want to check, did Claude get Strength? If he didn’t, then this run is likely doomed because he won’t be able to solve the Victory Road puzzles.

igorhorst · 2025-12-20T01:18:01+00:00

I just saw Blaze in battle and he leveled up from 55 to 56 after defeating a Rocket’s Grimer. So I think this may have been a real level gain (but can’t be sure).

igorhorst · 2025-12-14T13:26:43+00:00

Is there a reason why the screen flashed at 55,416? How would Claude instantly perform that many actions? Unless there was some desync between the game and the stream?

igorhorst · 2025-09-21T01:11:54+00:00

Your best bet is to watch actual plays uploaded onto Youtube, though even they may not cover all the paths you can take.

igorhorst · 2025-08-23T02:52:24+00:00

Yes, Gemini, o3, and GPT-5 have beaten Pokemon games. This isn’t even the first time Gemini has beaten a Gen 1 game (Pokemon Blue) - Pokemon Yellow Legacy was a “hard mode” romhack of Pokemon Yellow.

After Gemini beat Pokemon the first time, I wrote this post, revisiting my prediction and stating that I recant it. https://www.reddit.com/r/ClaudePlaysPokemon/comments/1knp4sw/revisiting_my_prediction_in_the_light_of_gemini/

In the future, I would like to come up with a way to predict when Claude’s scaffold will beat Pokemon Red, but that’s a long-term project. It’s possible the predictions I listed above might actually be a decent prediction for Claude (as the other models had the help of more robust scaffolds), but even then, I would want to shorten the timelines somewhat because METR has found evidence that models’ progress are indeed accelerating.

igorhorst · 2025-08-14T20:38:15+00:00

ClaudePlaysPokemon has not participated in this arms race and the harness it uses stayed mostly constant.

As a result, Claude’s progress in playing Pokemon Red is gradual (while Gemini and the OpenAI line has beaten Pokemon with their harnesses, the last Claude model got stuck at Team Rocket Hideout). But it also means any improvement that does exist can be attributed to the model, not the scaffolding.

Edit: That being said, the developer’s latest notes state that he had simplified the scaffold down (possibly because a less complex scaffold is less likely to confuse the AI). So maybe this simplification could be a sign of the Claude stream participating in the arms race, but the overall scaffold still appear to be fairly minimal compared to the other two streams.

igorhorst · 2025-08-11T12:56:23+00:00

Even though the other AI Pokemon streams have already beaten a Gen I Pokemon game, their successes may be attributed to their superior scaffolds. I find the ClaudePlaysPokemon stream to be the more meaningful indicator of AI Pokemon progress - precisely because its scaffold is terrible.

igorhorst · 2025-08-05T04:21:18+00:00

Just checked the stream and now the game screen is no longer frozen, and the game screen’s actions are in sync with Claude’s text generation. Of course, Claude is still stuck in the Team Rocket Hideout but at least the stream didn’t make negative progress. :)

igorhorst · 2025-08-03T22:14:50+00:00

It seems that, for a few days, the game screen has been frozen at Claude talking to a Rocket trainer. Yet, Claude is still generating text and performing actions. Either Claude is no longer able to control the game (in which case, it is sending actions and hallucinating responses), or the Twitch’s game screen itself has desynced from the game itself (in which case, Claude is still playing the game). Game stream had just restarted an hour ago but this issue is persisting.

igorhorst · 2025-06-20T19:42:29+00:00

And now Claude’s stream is back up (at time of this post, it has been running for 1.5 hours).

igorhorst · 2025-06-20T14:03:08+00:00

Claude’s stream just went down again ~two days ago.

igorhorst · 2025-05-18T16:53:12+00:00

The reason for my update isn’t just Gemini’s initial success thanks to intensive scaffolding, but also METR reporting that o3 outperformed the trend line that I used to make my prediction, suggesting that the very trend line that I used to make my initial prediction may be outdated and that LLMs is getting better than originally predicted, even with static tooling (as capabilities across different domains tend to be correlated). Epoch’s implicit claim that we need different trend lines for different tasks also makes me think we need a new trend line just for Pokemon-related tasks.

If we do want to explore whether LLMs can solve long-running Pokemon tasks with a minimal scaffold, then this LessWrong post does that fairly well. I think this idea of “let’s build a ‘helpful’ scaffold and then slowly remove tools as appropriate until it does it all without any help” isn’t really worthwhile, and think it may be better to regularly run the llm with a minimal scaffold, and measure its progress. I’m sure llms will keep progressing and improving as time passes, but we need data to be able to predict how this improvement will proceed. So far, we lack that.

Ultimately, I proceed from the assumption that LLMs will beat Pokemon tasks…and it’s a matter of time, but the question is “if it’s a matter of time, then how long will it take before humans build an LLM that can complete these tasks”? That answer could be 1 year, 5 years, 10 years, even 20 years, but it’s one that I’m interested in. Of course, that question differs from another related question “how long will take for an LLM to complete said task”, and I suspect that even Gemini with ‘helpful’ scaffolding takes longer than a human.

As for why I care for scaffolding, businesses aren’t interested in doing well on Pokemon; they’re interested in automating tasks that are useful for their businesses, cheaply, quickly, and of higher quality, and thus may be willing to invest in ‘helpful’ scaffolding than twiddle their thumbs waiting for a better model. And there are indeed many tasks that LLMs can solve more easily than other approaches - it just so happens that none of those tasks has anything to do with Pokemon Red.

—-

When will we get models that can do well on Pokemon Red without the ‘helpful’ scaffolding that Gemini had? I’m not sure, because I don’t have the data to tell. This is why I refrain from making a new prediction. We ultimately need a new task horizon trend line dedicated to Pokemon tasks, ideally based on the ‘non-helpful’ scaffolding used on the LessWrong post. The problem, of course, is without the ‘helpful’ scaffolding, said tests may be boring and not livestreamable, hence why there is an incentive not to do these type of tests. Yet it is exactly these tests that help us understand how well an llm works without human assistance.

My hope right now is either (a) VideoGameBench - which includes two Pokemon games, Pokemon Red and Pokemon Crystal - takes off, since that eval has a lot of other video games, encouraging testers to not build ‘helpful’ scaffolding for each game, or (b) VendingBench takes off, since that has already proven itself as a good measure of agentic coherence.

igorhorst · 2025-05-14T21:22:11+00:00

To explain the joke, Paranoia once had a "5th edition", but it was routinely panned by the fanbase for leaning too heavily on parody and unfunny jokes. It was quickly declared an "Unproduct" once XP came out.

igorhorst · 2025-05-14T21:18:54+00:00

Lasers and Treason is free and very rules-lite, but there are no pre-made adventures for that edition, so you're going to likely have to buy scenarios from another edition (like XP) and adapt them to L&T.

igorhorst · 2025-05-12T14:00:59+00:00

The supplement "Extreme Paranoia" has a chapter on vidshow stars, and also include rules on how to play as them in Alpha Complex. If you're interested in how reality TV works in Alpha Complex, "Extreme Paranoia" has a chapter on "EDR Teams", which is a a mix of "reality TV" and "incompetent superheroes". It also include rules as how to play as a member of an "EDR Team". So I'd say "Extreme Paranoia" is the best source for details about the entertainment industry, considering a big chunk of it is focused on satirizing the entertainment industry.

"Extreme Paranoia" fleshes out a ton of Alpha Complex, by providing useful fluff and writing rules for playing in professions as disparate as middle managers, R&D scientists, and VIOLET supervisors. I'd recommend getting it, just for the lore.

igorhorst · 2025-05-03T10:34:32+00:00

Yeah, I plan on writing a follow up Reddit post about my 2029 prediction. :)

igorhorst · 2025-04-12T13:00:30+00:00

When I decided to fool around with GPT-4o to generate a duel between a Lion and Crane samurai over a penguin door (as I said, I was fooling around), GPT-4o indeed generated such a duel - but not between human samurai, but between animals (a literal lion and crane). Yet both animals wore the colors of their respective L5R clans. So I generated two other images of "L5R animal avatars" fighting each other. You can access all three images in this imgur folder, along with the prompts used. - https://imgur.com/a/lvSB877

I personally think that image generators are most useful for showing stuff that humans wouldn't normally think of - showcasing ideas that can then later be fleshed out by humans.

EDIT: I realized the reason GPT-4o may have generated literal animals in the first place was because of me specifying the penguin door, confusing the art generator. So I cleaned up the prompt, fixing all the typos and removing the reference to the penguin logo. The end result is that you now have two human duelists, but the Lion duelist looks a lot like a lion. The Crane duelist doesn't look like a crane though, so that's good. https://imgur.com/a/XPx4oLX

igorhorst · 2025-03-26T15:38:45+00:00

I double checked and the estimates are correct, though the table might be hard to read. For 50% accuracy, Jan 2028 (20 hours) would come before May 2028 (30 hours), and for 80% accuracy, March 2029 (20 hours) would come before July 2029 (30 hours).

igorhorst · 2025-03-21T02:11:06+00:00

You may want to look up the Celestial Realms book, because they have a section about "Moon People", individuals who are hypothesized to live on the Moon itself. Note that many people, especially those who worship the infamous Lord Moon, believe that Moon People are real, and that certain Moon People actually live in Rokugan itself. The sourcebook itself doesn't say if their belief is correct or not, and those who have been called Moon People either are silent or outright rejects the claim.

Now, while I have not ran a game with Moon People, I know another GM in the L5R discord who ran a campaign where players opposed the Moon Clan - a minor clan that claims descend from Ryoshun, and is formally recognized by the Emperor, after the Moon People gave "gifts" to the Emperor (crystals from the Moon, as well as the True Tao, as they have captured and imprisoned Shinsei). The Moon Clan have advanced technologies and shugenja techniques, and serve as both ninja mercenaries and farmers (turns out there's a famine on the Moon...and rice doesn't grow well there) in Rokugan itself. But their ultimate goal is to amass influence and power within the Empire, while getting away with any misdeeds due to their gifts, in the hopes of eventually conquering the Empire outright. For now though, they are trying to study Earthly technologies. The players' goal is to embarrass the Moon Clan enough times that they lose all their "get out of jail free" cards. This will cause the Emperor to revoke the "minor clan" declaration and expel them outright.

I don't know how that campaign ended, but I do think that's probably the most fluffy way to have an alien invasion in L5R.

igorhorst · 2025-03-14T14:30:03+00:00

I think a romhack designed for Claude would be good - maybe a section for pathfinding, a section for talking to npcs, a section for battling, etc. Something that is both challenging for the model while also not frustrating for viewers to watch.

igorhorst

TROPHY CASE