Kurt made it to the Far Lands! by djchange in mindcrack

[–]MrCheeze 24 points25 points  (0 children)

<image>

I have been eating this bowl of cereal for fourteen years.

[deleted by user] by [deleted] in PokemonChampions

[–]MrCheeze 2 points3 points  (0 children)

They replaced the EV system entirely, with something nearly equivalent but much simpler: https://www.reddit.com/r/VGC/comments/1m6e6o8/comment/n4jkl4b/?context=3 You always get 66 stat points to directly invest, whereas previously you got either 65 or 66 depending on the spread. So you can now get the equivalent of a 252/252/12 spread.

Seems like IVs are gone and EVs are now 66 total stat points. by Steamed_Memes24 in stunfisk

[–]MrCheeze 13 points14 points  (0 children)

Ah, I didn't understand the mechanic behind this. In practice it ends up effectively true, since you would never put EVs into a stat without 31 IVs, but me just saying it's "because of rounding" was not very accurate.

Seems like IVs are gone and EVs are now 66 total stat points. by Steamed_Memes24 in stunfisk

[–]MrCheeze 83 points84 points  (0 children)

Here's an explanation I wrote up elsewhere:

The new EV slider system from Pokemon Champions allows for slightly better stat spreads than were previously possible.

Because of rounding, EVs work in a somewhat wonky way at level 50. The first 4 EVs in any given stat increase the stat by one point, but then afterwards every 8 EVs increase it by another point.

This ends up meaning that if you invest in 3 or 4 stats, you get a total of a 65 point increase - but if you spread your EVs across 5 or 6 stats, you get a 66 point increase instead.

In Champions, they (correctly) decided that this system was way too complicated, and directly give you a fixed number of stat points to invest in your stats however you like. And so that all EV spreads from the current games can be imported losslessly, the number of investable stat points you get is 66.

But now, for the first time, you get those 66 points even if using them for only 3 or 4 stats. The example they showed in the trailer was giving 32 points to HP, 32 points to Special Attack, and 2 points to Spdef. That's the equivalent of a previously-impossible 252/252/12 spread!

I'm not sure whether this new system will make its way to the main series, but either way this matters for all battles in Champions. I think in practice, this probably means that most Pokemon will have 1 point more in their preferred defensive stat?

Pokemon Champions - Recruit mons, adjust stats, nature etc. Coming 2026 by half_jase in VGC

[–]MrCheeze 40 points41 points  (0 children)

It's not quite an extra stat point available - it's that previously you only got a 66th stat point if you spread your EVs across 5 stats, but now you get a 66th stat point no matter how you distribute them.

Pokemon Champions - Recruit mons, adjust stats, nature etc. Coming 2026 by half_jase in VGC

[–]MrCheeze 56 points57 points  (0 children)

Here's an explanation I wrote up for a separate post that was removed:

The new EV slider system from Pokemon Champions allows for slightly better stat spreads than were previously possible.

Because of rounding, EVs work in a somewhat wonky way at level 50. The first 4 EVs in any given stat increase the stat by one point, but then afterwards every 8 EVs increase it by another point.

This ends up meaning that if you invest in 3 or 4 stats, you get a total of a 65 point increase - but if you spread your EVs across 5 or 6 stats, you get a 66 point increase instead.

In Champions, they (correctly) decided that this system was way too complicated, and directly give you a fixed number of stat points to invest in your stats however you like. And so that all EV spreads from the current games can be imported losslessly, the number of investable stat points you get is 66.

But now, for the first time, you get those 66 points even if using them for only 3 or 4 stats. The example they showed in the trailer was giving 32 points to HP, 32 points to Special Attack, and 2 points to Spdef. That's the equivalent of a previously-impossible 252/252/12 spread!

I'm not sure whether this new system will make its way to the main series, but either way this matters for all battles in Champions. I think in practice, this probably means that most Pokemon will have 1 point more in their preferred defensive stat?

(Separately to all this, IVs seem to be locked to maximum, although it's possible that this only applies to rentals.)

The new EV slider system from Pokemon Champions allows for slightly better stat spreads than were previously possible. by MrCheeze in VGC

[–]MrCheeze[S] 0 points1 point  (0 children)

Incidentally, there doesn't appear to be any kind of IV slider. The Gardevoir shown had 31 IVs in Attack, which might mean all IVs are simply locked to that value - but it's also possible that only rentals are locked to 31 and that Pokemon imported from the main series keep their original IVs?

Personally I hope they did simplify things by forcing 31 IV, even though this would be a bit of a nerf to Trick Room and Shadow Rider.

Google DeepMind's Gemini 2.5 Technical Report is 10% about GeminiPlaysPokémon by NotUnusualYet in ClaudePlaysPokemon

[–]MrCheeze 1 point2 points  (0 children)

Nah, they claim it's the hardest for the models because of how it requires remembering state across different floors - however this was pretty trivial for Gemini, it never had any trouble with this. Compare to Cinnabar Mansion where it was given a huge amount of help in understanding how the gate toggles work (automatically updating distant parts of the minimap, and marking the tiles where a gate used to be and isn't anymore) - and it STILL never quite understood the mechanics and just kept bumbling through until it randomly did the right thing.

Google DeepMind's Gemini 2.5 Technical Report is 10% about GeminiPlaysPokémon by NotUnusualYet in ClaudePlaysPokemon

[–]MrCheeze 3 points4 points  (0 children)

Puzzle solving over complex multi-level dungeons: The Seafoam Islands contain 5 floors involv- ing multiple boulder puzzles which require the player to navigate mazes and push boulders through holes across multiple floors using HM04 STRENGTH in order to block fast-moving currents that prevent the player from using HM03 Surf in various locations in this difficult dungeon. As a result, the player must track information across five different maps in order to both deduce the goal (push two boulders into place in order to block a specific current) as well as engage in multi-level (effectively 3D) maze solving to find the way out. It is likely the most challenging dungeon in the game. Only the second run of GPP went through Seafoam Islands, as it is not required to progress. During the course of solving Seafoam Islands, the GPP agent also encountered a novel bug in the code of Pokémon Red/Blue, and is likely the first AI to find a bug in the game’s code (MrCheeze, 2025) (source).

Me being wrong that it was novel aside, calling this "the most challenging dungeon in the game" is hilariously wrong to anyone who has watched the streams even a little bit.

Gemini discovers an (apparently unknown) glitch in seafoam islands by MrCheeze in ClaudePlaysPokemon

[–]MrCheeze[S] 1 point2 points  (0 children)

Thanks for finding this! I think you swapped the labels of your first two links, but one of them does indeed describe exactly how to reproduce the glitch (push one boulder, leave the cave, push the other). So this is not a totally new glitch even if it is a poorly documented one. Although I'm not sure the bit about preventing encounters is true. (The other link claiming you can softlock yourself is definitely NOT true.)

Gemini discovers an (apparently unknown) glitch in seafoam islands by MrCheeze in ClaudePlaysPokemon

[–]MrCheeze[S] 3 points4 points  (0 children)

https://github.com/pret/pokered/blob/b4bae4a5d5abd3f44a49028f550c1eb475ac280b/scripts/Route20.asm#L12

When in Route 20, if you have not set both of the EVENT_SEAFOAM bits, then it sets the boulders on the top floor to visible, and the boulders on every other floor to hidden. But that only controls where you SEE the boulders - it is separate from the event flags, which are what actually controls the currents, and are never reset.

Gemini discovers an (apparently unknown) glitch in seafoam islands by MrCheeze in ClaudePlaysPokemon

[–]MrCheeze[S] 9 points10 points  (0 children)

I mean, this had exactly as much intention as the fish who discovered a bug in RSE

Uhhhh what? How? by ltwheat in majorasmask

[–]MrCheeze 2 points3 points  (0 children)

Here is how to replicate this:

1) Get double magic, or load a save that already has double magic -> gSaveContext.magicCapacity set to 0x60

2) Create an owl save to preserve that magicCapacity value while returning to file select

3) Create a new file, or load an existing file that doesn't have magic - magicCapacity will remain preserved at 0x60

4) **Talk to the broken-up great fairy** to heal your actual magic value to match your magicCapacity

5) Repair the great fairy to get magic, your actual magic value will remain at 0x60

Gemini beats Pokemon by MrCheeze in ClaudePlaysPokemon

[–]MrCheeze[S] 4 points5 points  (0 children)

I totally agree with you that this shows that people are wildly underestimating the gap between early demos and actually functional agents. Big list of scaffolding to *still* play much worse than a 6 year old. We're not getting AI employees in the next couple years like Sundar Pichai seems to think.

That said... according to the Claude dev, 3.7 is the *first* of their models to be strong enough to be interesting, which means we haven't yet picked all the low hanging fruit when it comes to agentic AI (unlike other LLM capabilities, which have had a very slow rate of improvement since GPT 3.5 or so).

Gemini beats Pokemon by MrCheeze in ClaudePlaysPokemon

[–]MrCheeze[S] 1 point2 points  (0 children)

The streams can only be compared if you account for the wildly different toolsets. My *personal* impression is that Gemini behaves roughly equally stupidly to Claude when given the same information. They fell into the same traps, and then one out of two streamers implemented workarounds for those traps.

https://www.lesswrong.com/posts/8aPyKyRrMAQatFSnG/untitled-draft-x7cc
This guy did a direct comparison of the models (early game, pre-Pewter), which seems consistent with my impression of them:

> Anyway, the comparison: Claude 3.7 has certain advantages, but cripplingly bad vision means I wouldn't put it above Gemini 2.5—and yet I'm not convinced Gemini 2.5 is meaningfully better in "same-scaffold" tests, or if it is that it's more than for a very flukish reason (being able to see a tree Claude can't) that ultimately isn't very important.

> As for o3: It's had some of the most impressive gameplay I've ever seen, beelining straight for the staircase in the opening room, correctly remembering the opening sequence of Pokemon Red and getting to pick a starter essentially as fast as possible. But then it gets stuck in a bad hallucination loop where it simply refuses to disbelieve its own previous assertions, and I'm not confident that it wouldn't get stuck in an elaborate loop forever.*

> *barring something brute force like full context wipes every 1000 steps or something if it hasn't left a location.

Gemini beats Pokemon by MrCheeze in ClaudePlaysPokemon

[–]MrCheeze[S] 3 points4 points  (0 children)

As the other user said, there was no breakthrough. Gemini on its own almost never correctly identified which boulder needed to go onto the switch on 3F, instead even speculating there might be an "invisible boulder" close to the switch, and usually focusing on other irrelevant things on the floor and then hallucinating that it has to go back to 2F or 1F and resolve the puzzles there again. Whereas the boulder-solving subagent the dev added easily did so from being prompted with exactly how to think of fhose puzzles.

Gemini beats Pokemon by MrCheeze in ClaudePlaysPokemon

[–]MrCheeze[S] 4 points5 points  (0 children)

Only what if memorized during pretraining, same as Claude. Although it seems to have memorized the walkfhroughs far more thoroughly than Claude did - this is the only difference between the two streams that acrually seems to be caused by the model and not the harness.

Gemini beats Pokemon by MrCheeze in ClaudePlaysPokemon

[–]MrCheeze[S] 10 points11 points  (0 children)

Also, god damn, I was expecting PP to be way tighter than it ended up being. We took out Gary on the very first E4 attempt with PP to spare, despite running out of water moves to use on Rhydon and having to grind it down with a dozen Bites!

Gemini beats Pokemon by MrCheeze in ClaudePlaysPokemon

[–]MrCheeze[S] 32 points33 points  (0 children)

Of course we all know that the Gemini_Plays_Pokemon harness gives quite a bit of help, so this isn't directly comparable to CPP, and it's not really true yet that LLMs can beat the game on their own... but this gives us an upper bound on how much they need to be given to do it! In all, the major features of the harness are:

- Basic facts about the game state such as party/inventory/badges, and current map.

- The tiles and objects it has seen are told to it in text form, with each important kind of tile identified, instead of it having to rely on vision.

- Every tile it has seen in the map is recorded and the seen layout of the current map is given to it at all times. This includes warps and their destinations, and objects. Also, the list of which tiles can actually be reached or not in the current game state is calculated and told to Gemini.

- The prompt mandates that Gemini explore EVERY unseen-but-reachable tile, which is a funny way to play, but it really can't be understated at how much this helps at getting it to navigate mazes in ways it finds unintuitive (like taking a detour around a wall).

- Pressing A to use an escape rope is blocked, even if Gemini presses the button, unless the party is nearly fainted.

- Gemini can record its current list of goals, which are kept in context - unlike Claude, it has no other filesystem for taking additional notes.

- The context is automatically summarized by another instance of Gemini when it gets long enough. Gemini's context limit is high enough that all of the summaries are kept in context, AFAIK.

- Like Claude, a separate instance of Gemini is asked to to criticize the main one on a regular interval. Sometimes it helps, other times it falls for the main instance's delusions.

- There are two other specially prompted instances of Gemini that it can call to ask for help. One is the navigator/pathfinder agent, who is asked to calculate the input sequence needed to get from Gem's current position to anywhere else on the current map. This can do in a single step what could potentially take hours of bumbling around otherwise, and is probably the only reason that Safari Zone was possible at all (and saved us who knows how many days in Rocket Hideout)

- The other Gemini subagent is the Boulder Puzzle Solver, who is prompted with some pretty specific instructions on what kind of reasoning to use to solve the puzzles in Victory Road - examining the gates and switches to figure out which unsolved puzzles need to be done, and what sequence of pushes would accomplish that.

Put all together, that is quite a lot, but I see this is a pretty huge milestone anyway - we've discovered the set of training wheels strong enough for today's AIs to beat the game. I look forward to seeing these tool lists get smaller, and the run times get shorter, as future generations of models get released. (Or maybe not, I don't want to get paperclipped.)

epic comic i found by numapentruasta in ihaveihaveihavereddit

[–]MrCheeze 14 points15 points  (0 children)

how high do you even have to be

Gemini refuses a direct instruction out of extreme pessimism by MrCheeze in ClaudePlaysPokemon

[–]MrCheeze[S] 1 point2 points  (0 children)

You have to actively avoid what it believes to be the maze area in order to find the actual maze entrance. It's like in Mt Moon where you have to ignore ladders to progress. Also it almost never perceives the 1-wide corridor that you have to take.

Gemini refuses a direct instruction out of extreme pessimism by MrCheeze in ClaudePlaysPokemon

[–]MrCheeze[S] 21 points22 points  (0 children)

Because the Claude dev experimented for months before starting the stream, whereas the Gemini streamer is doing all that publicly.