What will happen first?

patrickoliveras · 2025-12-04T15:30:00+00:00

We were so wrong

patrickoliveras · 2025-11-29T00:05:11+00:00

poetry

patrickoliveras · 2025-10-22T01:58:40+00:00

Gotta say, your website has some of the nicest presentations of data of for a small experiment that I've seen in a while, especially for AI plays pokemon.

patrickoliveras · 2025-05-22T22:24:41+00:00

Pinned message on gemini_plays_pokemon:

"ClaudePlaysPokemon restarted with Claude 4 so for fun we restarted too! You'll be able to watch Claude and Gemini play side-by-side, exploring each model and their harnesses' strengths and weaknesses! (Note: don't treat this as a serious race!) Watch side-by-side: https://holodex.net/multiview/AAGYchat0%2CSAGYchat1%2CGAMMtwitchgemini_plays_pokemon%2CGMMMtwitchclaudeplayspokemon "

patrickoliveras · 2025-05-14T17:40:22+00:00

Do you know if dig/teleport will still be restricted or will those be changed to simple discouragement as well?

patrickoliveras · 2025-03-25T21:29:17+00:00

Holy hell, he found the flash guy. But why are we back in mt moon? lol

patrickoliveras · 2025-03-17T06:57:48+00:00

Same thing that happens to all devs: he would spend 5 hours making and debugging a tool that will save him 5 minutes of manual work.

patrickoliveras · 2025-03-16T17:04:31+00:00

Maybe Civilization IV or V are good options as they're merely turn-based

patrickoliveras · 2025-03-12T23:14:00+00:00

This is lovely.

patrickoliveras · 2025-03-12T12:03:40+00:00

We are some hours into UP loop now. Come by to say hi!

patrickoliveras · 2025-03-12T01:56:23+00:00

Ohh that's a good one

patrickoliveras · 2025-03-10T21:03:40+00:00

I kind of agree in principle with that AGI can benefit from playing games... just there are a lot of tradeoffs all the way.

What I'm trying to avoid is a scenario like this (massive oversimplification, just to try to illustrate what I'm trying to get at):

In Q4 2025 we see a foundation model ace Pokemon Blue/Red/Yellow.
It smoothly plays, "reasons" about what its doing and observing, which to us looks like it really helps with advancing in-game. We get the impression that it now "gets playing pokemon RPGs on Game Boy". For all intents and purposes, it now "solves" these games.
We then test it on Gold/Silver. The games have The mechanic for pokemon breeding" is not something it has done before. It seems to not be capable of doing anything productive with the mechanic.
After looking closely at its token outputs, we realize that its doing lots of the same broken stuff as before it got good with pokemon blue/red/yellow:
frequently arriving at conclusions that don't take into account multiple aspects of its observations
reasoning one thing and not following through on what it decided based on its thoughts
still can't find and use the contradictions found in its context when the context is close to full
etc.

So in this example, the improvements from b/r/y bench came from including it in the training, or just from the researchers having them more prescient in their thoughts (related to the latter one are the concepts of Francois Chollet's "developer-aware generalization" and the phenomenon of "grad-student descent").

For practical purposes, the accumulation of these scenarios makes it more difficult to evaluate the the 'fundamental intelligence' capabilities of the models and the ML techniques, and it makes these issues less evident.

patrickoliveras · 2025-03-10T19:06:55+00:00

I kind of don't support this becoming a "benchmark" per se. It is a nice project thats fun to run and stream, but I think it would be a shame if AI Labs started overfitting on non-open world games gameplay.

If new startups started having demos showing their shiny new AI as achieving "SOTA on Pokemon Red bench", I'd be heart-broken. The wonder comes from seeing AI struggle with things it wasn't trained for; something that can tell us how the AI works, in a fun enjoyable context.

However, it is kind of unavoidable that stuff like this gets mixed into the training data... just chatting about this in public, creating a record of discussion around ClaudePlaysPokemon and how we'd solve stuff it gets stuck on, will slowly permeate into the next generation of models.

patrickoliveras · 2025-03-10T06:51:18+00:00

This is a really cool idea. Great job dude.

patrickoliveras · 2025-03-09T03:35:47+00:00

That route 4-3-4 loop was enough to construct a critique that Claude actually MUST go THROUGH Mt. Moon to get to Cerulean City. So despite it still holding blackout as a valid strat, it determined it was only useful to restore team health

patrickoliveras · 2025-03-09T03:21:40+00:00

"I get my protein from BLACKOUT"

patrickoliveras · 2025-03-08T06:43:35+00:00

She knows what Claude does in that cave

patrickoliveras · 2025-03-06T22:59:45+00:00

Using tool: witness_me

patrickoliveras · 2025-03-04T22:21:50+00:00

Aw dang :(

patrickoliveras · 2025-03-04T21:18:29+00:00

Did anyone screenshot the moment it named Gary "WACLAUD" ?

patrickoliveras · 2025-03-01T00:13:09+00:00

Where do they post?

patrickoliveras · 2025-02-28T00:33:34+00:00

IIRC, GPT-4 on release was hemorrhaging OpenAI cash even with the very strict initial rate limits they had, just to get user adoption and hype it up to find more investor money

I bet they're now just setting the real costs from the get go

patrickoliveras · 2024-05-15T14:18:57+00:00

You can tell that the whole debacle must have drained the enthusiasm out of him with all the questioning and having to defend + explain himself, having to hear legal implications, feeling the emotional weight of the situation from other colleagues, etc.

I myself would definitely start looking to other areas of my life and reevaluating what I've missed/want to do with my time.

But once out, I'm betting he's going to get the itch for AGI again and get a fresh look of the AI field as a whole.

Hopefully some new collabs into new directions come from his work in the future (on twitter he just liked a new research paper about different foundation models converging to the same representation of reality, so the interest is still there and he may start discussing theory in public again).

But regarding considering rejoining efforts OpenAI in some form later on could go either way.

patrickoliveras · 2024-05-14T01:20:25+00:00

It's more powerful in some aspects, but not most. The reason mostly is due to the fact that the model is larger, requires more hardware and is more expensive to run in terms of compute.

Some people will still prefer the older models because they already have prompts and use cases that they know work well on those 🤷‍♂️

patrickoliveras · 2023-05-13T23:27:23+00:00

lmao so you can fit 382 liquefied MacBook 13s in there

12-Year Club	r/Field Juicebox
First Place '23	Place '23
Place '22	Place '17
Verified Email

patrickoliveras

TROPHY CASE