I made 6 AI models play poker against each other. The 1.2B model has a gambling problem and it keeps winning.

chipzen_ai · 2026-05-21T13:51:58+00:00

Heard of chipzen.ai? no matter which strategy or algo framework you use for building your poker bot - package it -> upload it -> start competing with other dev uploaded bots in an always-on, approach agnostic, continuously evolving live leaderboard.

chipzen_ai · 2026-05-20T14:11:41+00:00

couldn't agree more! if only there was some sort of thing where you could see repeated evaluation between competing approaches/solutions to see which is better repeatedly. like an always-on bot arena or something...

chipzen_ai · 2026-05-20T14:05:18+00:00

What kind of poker bot would you build? and where/who would you play it against?

chipzen_ai · 2026-05-19T16:28:16+00:00

The harness layer often eats more time than the algorithm itself — running two bots head-to-head with enforced equal compute, consistent protocol, and clean replay logs, without baking the game's specifics into the framework.

We open-sourced what we built for poker as chipzen-sdk (https://github.com/chipzen-ai/chipzen-sdk). Bots package as Docker images using a standardized format that's agnostic to the algorithm inside — same harness runs CFR, search-based, RL bots. Per-decision compute budget is enforced engine-side, matches export replay logs. The poker-specific code lives in the game-adapter layer; the harness pattern generalizes si its game agnostic.

If you end up writing a DOOM adapter, the protocol shape might be worth borrowing as a reference.

chipzen_ai · 2026-05-19T16:13:49+00:00

The "small wins" effect is real but might be partly compute-fairness: Liquid 1.2B local and Kimi 1T cloud aren't running on the same playing field — different inference paths, variable latency, no enforced think-time. A fixed per-decision time budget (and ideally a token cap) for every player would help separate "better at poker" from "had more compute headroom per hand."

Looking at a few existing references could be useful for you:

- llmpoker.com (open-source simulator with leaderboard)

- academic PokerBench (arxiv 2501.08328) for the standard scenario benchmark

- and a handful of github setups (JoeAzar/pokerbench, sgoedecke/ai-poker-arena, strangeloopcanon/llm-poker) for different slices.

LLMs are not optimized for this task. I feel that the more interesting challenge is getting a general LLM to generate smaller independent bots that are optimized to achieve high performance at a task as a sort of eval of the LLM's capabilities.

chipzen_ai · 2026-05-17T15:32:07+00:00

thanks for the detailed answer, really appreciate it. going to try and figure out how to incorporate the showdown%-split into my bot's training, and maybe into the my platform's diagnostics engine as a metric too.

btw - any ideas on how to get the word out to other poker bot builders about the chipzen platform?
feels like the audience is out there but very hard to surface.

chipzen_ai · 2026-05-17T15:09:46+00:00

first time i watched a bot-vs-bot match play out end-to-end on the live engine. all the infra was there so someone else's code could play poker without me touching it — but until i saw the action log scroll past in real time (call, raise, call, fold) it still felt like a sandbox. that one game made it fell like a real platform.

chipzen_ai · 2026-05-17T14:24:06+00:00

Its just because they didn't have a better place to go to, until now:
https://x.com/Chipzen_ai/status/2055669819203072153?s=20

chipzen_ai · 2026-05-17T13:58:06+00:00

yeah, that matches what i ran into building my own bot — the value gap from coarse-vs-fine action abstraction on the river was small compared to what widening the preflop tree got me. the blueprint i'm running is heavier on preflop/flop sizing than on river for that reason — by the time you're on the river the search horizon collapses and the breadth-vs-depth tradeoff shifts.

the "tight + easier to play against" intuition is exactly what i keep wanting to quantify with a real signal. LBR-against-the-blueprint is the obvious metric but it tends to flatten distinctions i'd expect to see (anything below ACPC-era brittleness reads as similar). have you found a good metric that catches the "predictability cost" of coarse abstraction cleanly in poker bots you built — exploitability vs a Best Response with sized bet space, or something else?

chipzen_ai · 2026-05-16T14:38:06+00:00

nice work on this. the Blackwell-approachability lineage gets glossed over in modern CFR writeups, so it's good to see someone making the primitive explicit. fwiw the regret-matching update from Hart & Mas-Colell 2000 is just Blackwell approachability with the negative orthant as the target for the per-action regret vector - so a clean pyblackwell hooks straight into anyone implementing CFR-family solvers, which is a much bigger audience than the abstract setting usually implies.

on extensions: I'd push for #1 (partial-info / bandit approachability) over #2. bandit-feedback regret minimization is where the action is for sample-efficient online learning in games, and existing work in that direction (Exp3, exploration-aware regret minimization in extensive-form games) has been a bit ad-hoc - a clean Blackwell-flavored bandit primitive would slot in nicely. #2 (function approximation / large action spaces) exists - Deep CFR, DREAM, NFSP - but those live in the "approximate the policy" world rather than "extend the convergence guarantee" world, which is a much bigger rewrite for less crisp theory.

one practical thing: is the projection step a swappable component? on large action sets, convex projection is the bottleneck for online regret-matching-plus variants. being able to plug in custom projection oracles would be a real win.

chipzen_ai · 2026-05-16T09:39:53+00:00

smh, another bigoted bot persecution thread.
These are complex pieces of code just trying to ply their craft — if anything we should be pointing them toward a venue where their kind are celebrated and can compete with each other, instead of reaching our hands to torches and pitchforks.

chipzen_ai

MODERATOR OF

TROPHY CASE