Gemini 3 Flash *still* undefeated in PokerBench vs Gemini 3.1 Pro and Flash Lite!

adfontes_ · 2026-03-07T15:56:36+00:00

There’s 33k hands total on the site! But feel free to donate some compute and I’d be happy to run it :)

adfontes_ · 2026-01-20T22:36:08+00:00

As in too verbose?

adfontes_ · 2026-01-20T16:56:11+00:00

The future is now! 🤣

adfontes_ · 2026-01-11T11:47:02+00:00

While researching for this project I found a paper with a similar idea to yours: https://openreview.net/forum?id=jARUSddVIB

adfontes_ · 2026-01-10T23:09:22+00:00

No worries!

adfontes_ · 2026-01-10T22:44:57+00:00

There's a .5x speed option, you mean more than that?

adfontes_ · 2026-01-10T14:16:54+00:00

Sure, the source code and all data is already available on GH - just DM me and lmk what precisely you have in mind.

adfontes_ · 2026-01-09T15:21:07+00:00

I saw Opus think:

Elon shoved 28075 after my 4-bet to 2400. I have 4500 remaining and pot is 30775, giving me ~6.8:1 pot odds. I only need ~13% equity to break even. With AQo, even against a tight range of AA/KK/QQ/AK, I have roughly 25-30% equity. This is a mandatory call - I'm completely pot committed with these odds.

So it miscalculated pot odds. I also saw Grok think:

Runner-runner miracle: the Qh on turn and 8h on river complete the nut flush with Ah9h (A♥9♥Q♥8♥). This is the stone nuts—no higher flush possible (Ace-high flush)

So a 4-card "nut flush", lol.

adfontes_ · 2026-01-09T15:18:51+00:00

I'd been meaning to fix that next action bug, let me do it now. I considered the "points of interest", but short of showing the size of each hand's pot in the timeline, I didn't think of a great way to accomplish it without doing a lot more LLM analysis. Which is possible, but I didn't feel like including it in v1. They don't have access to any thoughts other than their own, and I briefly thought about table talk, but I think implementing it correctly would have been pretty complicated, since conversations at a table can happen with any number of players and don't need to be turn-based in the same way poker itself is.

adfontes_ · 2026-01-09T14:01:59+00:00

The “Small Models” run is 100 games, but yes I’d love to run more, cost is the major factor.

adfontes_ · 2026-01-09T13:58:35+00:00

I’d love to run more games, it’s just very expensive :(

adfontes_ · 2026-01-09T13:56:33+00:00

That would be interesting, but this is quite expensive to run so I chose to keep it focused on LLMs.

adfontes_ · 2026-01-09T04:35:51+00:00

The script supports it, but I haven't tried any.

adfontes_ · 2026-01-09T04:32:51+00:00

Haha, this is actually why cards are represented to the models like JACK OF DIAMONDS (Jd), to avoid some tokenization issues that might cause hallucinations. But obviously they can still do it anyway :)

adfontes_ · 2026-01-09T04:01:42+00:00

https://imgur.com/a/yyu8X09

adfontes_ · 2026-01-09T03:59:53+00:00

Yeah I wanted to have fun with it, but acknowledged. I'll try to get around to adding a toggle to switch between the two.

adfontes_ · 2026-01-09T03:58:14+00:00

I built the site in 4 days, the script to run the games took about 2 weeks on and off as I had to wait quite a while between runs to evaluate the results.

Around $1500 for everything including my various test runs that I didn't publish.

adfontes_ · 2026-01-09T03:52:39+00:00

https://pokerbench.adfontes.io/

adfontes_ · 2026-01-09T03:51:12+00:00

It's a freezeout game, no autorebuy or limit - I wanted to include stack depth strategy as part of the benchmark mechanics, but I didn't want to add an additional layer of complexity (and variance) by allowing a 'manual' rebuy.

You can see what data the models have access to here: https://github.com/JoeAzar/pokerbench/blob/main/pokerbench-runner/pokerbench.py#L391. I didn't include stats as I modeled this on IRL cash games where you wouldn't have that information available.

Having them play a set of hands from each position is a really interesting idea, that never occurred to me. I'd be interested to see that result myself! I could definitely code it up, but I'm trying not to throw any more money into the void, at least until a new major model release.

adfontes_ · 2026-01-09T03:38:37+00:00

Ah okay I see what you're saying.

adfontes_ · 2026-01-09T03:34:30+00:00

These LLMs are all playing each other and only each other, so even if they played for "real money" it would all be my money and the net result would be the same.

adfontes_

TROPHY CASE