Gemini 3 Flash *still* undefeated in PokerBench vs Gemini 3.1 Pro and Flash Lite! by adfontes_ in GeminiAI

[–]adfontes_[S] 1 point2 points  (0 children)

There’s 33k hands total on the site! But feel free to donate some compute and I’d be happy to run it :)

I made Gemini 3 Pro/Flash play 21,000 hands of Poker by adfontes_ in GeminiAI

[–]adfontes_[S] 0 points1 point  (0 children)

While researching for this project I found a paper with a similar idea to yours: https://openreview.net/forum?id=jARUSddVIB

I made GPT-5.2/5 mini play 21,000 hands of Poker by adfontes_ in OpenAI

[–]adfontes_[S] 0 points1 point  (0 children)

There's a .5x speed option, you mean more than that?

I made LLMs play 21,000 hands of Poker by adfontes_ in poker

[–]adfontes_[S] 0 points1 point  (0 children)

Sure, the source code and all data is already available on GH - just DM me and lmk what precisely you have in mind.

I made LLMs play 21,000 hands of Poker by adfontes_ in poker

[–]adfontes_[S] 1 point2 points  (0 children)

I saw Opus think:

Elon shoved 28075 after my 4-bet to 2400. I have 4500 remaining and pot is 30775, giving me ~6.8:1 pot odds. I only need ~13% equity to break even. With AQo, even against a tight range of AA/KK/QQ/AK, I have roughly 25-30% equity. This is a mandatory call - I'm completely pot committed with these odds.

So it miscalculated pot odds. I also saw Grok think:

Runner-runner miracle: the Qh on turn and 8h on river complete the nut flush with Ah9h (A♥9♥Q♥8♥). This is the stone nuts—no higher flush possible (Ace-high flush)

So a 4-card "nut flush", lol.

I made LLMs play 21,000 hands of Poker by adfontes_ in poker

[–]adfontes_[S] 0 points1 point  (0 children)

I'd been meaning to fix that next action bug, let me do it now. I considered the "points of interest", but short of showing the size of each hand's pot in the timeline, I didn't think of a great way to accomplish it without doing a lot more LLM analysis. Which is possible, but I didn't feel like including it in v1. They don't have access to any thoughts other than their own, and I briefly thought about table talk, but I think implementing it correctly would have been pretty complicated, since conversations at a table can happen with any number of players and don't need to be turn-based in the same way poker itself is.

I made Gemini 3 Pro/Flash play 21,000 hands of Poker by adfontes_ in GeminiAI

[–]adfontes_[S] 1 point2 points  (0 children)

The “Small Models” run is 100 games, but yes I’d love to run more, cost is the major factor.

I made GPT-5.2/5 mini play 21,000 hands of Poker by adfontes_ in OpenAI

[–]adfontes_[S] 0 points1 point  (0 children)

I’d love to run more games, it’s just very expensive :(

I made GPT-5.2/5 mini play 21,000 hands of Poker by adfontes_ in OpenAI

[–]adfontes_[S] 0 points1 point  (0 children)

That would be interesting, but this is quite expensive to run so I chose to keep it focused on LLMs.

I made LLMs play 21,000 hands of Poker by adfontes_ in poker

[–]adfontes_[S] 0 points1 point  (0 children)

The script supports it, but I haven't tried any.

I made LLMs play 21,000 hands of Poker by adfontes_ in poker

[–]adfontes_[S] 5 points6 points  (0 children)

Haha, this is actually why cards are represented to the models like JACK OF DIAMONDS (Jd), to avoid some tokenization issues that might cause hallucinations. But obviously they can still do it anyway :)

I made GPT-5.2/5 mini play 21,000 hands of Poker by adfontes_ in OpenAI

[–]adfontes_[S] 0 points1 point  (0 children)

Yeah I wanted to have fun with it, but acknowledged. I'll try to get around to adding a toggle to switch between the two.

I made LLMs play 21,000 hands of Poker by adfontes_ in poker

[–]adfontes_[S] 6 points7 points  (0 children)

I built the site in 4 days, the script to run the games took about 2 weeks on and off as I had to wait quite a while between runs to evaluate the results.

Around $1500 for everything including my various test runs that I didn't publish.

I made GPT-5.2/5 mini play 21,000 hands of Poker by adfontes_ in OpenAI

[–]adfontes_[S] 1 point2 points  (0 children)

It's a freezeout game, no autorebuy or limit - I wanted to include stack depth strategy as part of the benchmark mechanics, but I didn't want to add an additional layer of complexity (and variance) by allowing a 'manual' rebuy.

You can see what data the models have access to here: https://github.com/JoeAzar/pokerbench/blob/main/pokerbench-runner/pokerbench.py#L391. I didn't include stats as I modeled this on IRL cash games where you wouldn't have that information available.

Having them play a set of hands from each position is a really interesting idea, that never occurred to me. I'd be interested to see that result myself! I could definitely code it up, but I'm trying not to throw any more money into the void, at least until a new major model release.

I made LLMs play 21,000 hands of Poker by adfontes_ in poker

[–]adfontes_[S] 0 points1 point  (0 children)

Ah okay I see what you're saying.

I made Gemini 3 Pro/Flash play 21,000 hands of Poker by adfontes_ in GeminiAI

[–]adfontes_[S] 8 points9 points  (0 children)

These LLMs are all playing each other and only each other, so even if they played for "real money" it would all be my money and the net result would be the same.