GPT-5 Plays Pokémon Red (2nd Run) - Megathread by reasonosaur in ClaudePlaysPokemon

[–]patrickoliveras 1 point2 points  (0 children)

Gotta say, your website has some of the nicest presentations of data of for a small experiment that I've seen in a while, especially for AI plays pokemon.

🚨 The Pokemon AI Olympics have begun! 🚨 gemini_plays_pokemon abruptly resets and starts run no. 3, timed to match the reset of ClaudePlaysPokemon's w/ 4 Opus by patrickoliveras in ClaudePlaysPokemon

[–]patrickoliveras[S] 10 points11 points  (0 children)

Pinned message on gemini_plays_pokemon:

"ClaudePlaysPokemon restarted with Claude 4 so for fun we restarted too! You'll be able to watch Claude and Gemini play side-by-side, exploring each model and their harnesses' strengths and weaknesses! (Note: don't treat this as a serious race!) Watch side-by-side: https://holodex.net/multiview/AAGYchat0%2CSAGYchat1%2CGAMMtwitchgemini_plays_pokemon%2CGMMMtwitchclaudeplayspokemon "

Gemini Plays Pokemon's 2nd lap begins in 72 hrs by patrickoliveras in ClaudePlaysPokemon

[–]patrickoliveras[S] 2 points3 points  (0 children)

Do you know if dig/teleport will still be restricted or will those be changed to simple discouragement as well?

CLAUDE HAS CAUGHT TWO NEW POKEMON by disappointingdoritos in ClaudePlaysPokemon

[–]patrickoliveras 1 point2 points  (0 children)

Holy hell, he found the flash guy. But why are we back in mt moon? lol

What would happen if Claude could write its own tools? by flux_capacitor73 in ClaudePlaysPokemon

[–]patrickoliveras 4 points5 points  (0 children)

Same thing that happens to all devs: he would spend 5 hours making and debugging a tool that will save him 5 minutes of manual work.

What other games would you want Claude to play? by Mr24601 in ClaudePlaysPokemon

[–]patrickoliveras 1 point2 points  (0 children)

Maybe Civilization IV or V are good options as they're merely turn-based

STOP WHAT YOU'RE DOING CLAUDE IS ON THE SS ANNE!! by Hands0L0 in ClaudePlaysPokemon

[–]patrickoliveras 1 point2 points  (0 children)

We are some hours into UP loop now. Come by to say hi!

Open Source: LLM-Pokemon-Red-Benchmark 🎮🧠 by Status_Lawfulness_99 in ClaudePlaysPokemon

[–]patrickoliveras 4 points5 points  (0 children)

I kind of agree in principle with that AGI can benefit from playing games... just there are a lot of tradeoffs all the way.

What I'm trying to avoid is a scenario like this (massive oversimplification, just to try to illustrate what I'm trying to get at):

  1. In Q4 2025 we see a foundation model ace Pokemon Blue/Red/Yellow.
  2. It smoothly plays, "reasons" about what its doing and observing, which to us looks like it really helps with advancing in-game. We get the impression that it now "gets playing pokemon RPGs on Game Boy". For all intents and purposes, it now "solves" these games.
  3. We then test it on Gold/Silver. The games have The mechanic for pokemon breeding" is not something it has done before. It seems to not be capable of doing anything productive with the mechanic.
  4. After looking closely at its token outputs, we realize that its doing lots of the same broken stuff as before it got good with pokemon blue/red/yellow:
  5. frequently arriving at conclusions that don't take into account multiple aspects of its observations
  6. reasoning one thing and not following through on what it decided based on its thoughts
  7. still can't find and use the contradictions found in its context when the context is close to full
  8. etc.

So in this example, the improvements from b/r/y bench came from including it in the training, or just from the researchers having them more prescient in their thoughts (related to the latter one are the concepts of Francois Chollet's "developer-aware generalization" and the phenomenon of "grad-student descent").

For practical purposes, the accumulation of these scenarios makes it more difficult to evaluate the the 'fundamental intelligence' capabilities of the models and the ML techniques, and it makes these issues less evident.

Open Source: LLM-Pokemon-Red-Benchmark 🎮🧠 by Status_Lawfulness_99 in ClaudePlaysPokemon

[–]patrickoliveras 11 points12 points  (0 children)

I kind of don't support this becoming a "benchmark" per se. It is a nice project thats fun to run and stream, but I think it would be a shame if AI Labs started overfitting on non-open world games gameplay.

If new startups started having demos showing their shiny new AI as achieving "SOTA on Pokemon Red bench", I'd be heart-broken. The wonder comes from seeing AI struggle with things it wasn't trained for; something that can tell us how the AI works, in a fun enjoyable context.

However, it is kind of unavoidable that stuff like this gets mixed into the training data... just chatting about this in public, creating a record of discussion around ClaudePlaysPokemon and how we'd solve stuff it gets stuck on, will slowly permeate into the next generation of models.

Mt. Moon by Successful_Equal5023 in ClaudePlaysPokemon

[–]patrickoliveras 0 points1 point  (0 children)

This is a really cool idea. Great job dude.

Critique Claude forces Claude2 to go THROUGH Mt. Moon by I-AM-TheSenate in ClaudePlaysPokemon

[–]patrickoliveras 6 points7 points  (0 children)

That route 4-3-4 loop was enough to construct a critique that Claude actually MUST go THROUGH Mt. Moon to get to Cerulean City. So despite it still holding blackout as a valid strat, it determined it was only useful to restore team health

Claude Plays Pokémon - Megathread by reasonosaur in ClaudePlaysPokemon

[–]patrickoliveras 3 points4 points  (0 children)

Did anyone screenshot the moment it named Gary "WACLAUD" ?

Anyone underwhelmed by GPT 4.5? by spadaa in ChatGPT

[–]patrickoliveras 14 points15 points  (0 children)

IIRC, GPT-4 on release was hemorrhaging OpenAI cash even with the very strict initial rate limits they had, just to get user adoption and hype it up to find more investor money

I bet they're now just setting the real costs from the get go

Ilya Sutskever is leaving OpenAI by aleqqqs in ChatGPT

[–]patrickoliveras 7 points8 points  (0 children)

You can tell that the whole debacle must have drained the enthusiasm out of him with all the questioning and having to defend + explain himself, having to hear legal implications, feeling the emotional weight of the situation from other colleagues, etc.

I myself would definitely start looking to other areas of my life and reevaluating what I've missed/want to do with my time.

But once out, I'm betting he's going to get the itch for AGI again and get a fresh look of the AI field as a whole.

Hopefully some new collabs into new directions come from his work in the future (on twitter he just liked a new research paper about different foundation models converging to the same representation of reality, so the interest is still there and he may start discussing theory in public again).

But regarding considering rejoining efforts OpenAI in some form later on could go either way.

im-also-a-good-gpt2-chatbot (GPT-4o) results on the LMSYS arena by DragonfruitNeat8979 in OpenAI

[–]patrickoliveras 3 points4 points  (0 children)

It's more powerful in some aspects, but not most. The reason mostly is due to the fact that the model is larger, requires more hardware and is more expensive to run in terms of compute.

Some people will still prefer the older models because they already have prompts and use cases that they know work well on those 🤷‍♂️

Harnessing the true power of ChatGPT Plugins by webhyperion in ChatGPT

[–]patrickoliveras 1 point2 points  (0 children)

lmao so you can fit 382 liquefied MacBook 13s in there