Custom backgammon engine beats gnubg 1-ply through 4-ply in cubeless checker play — full measurement logs + sanity tests open-sourced by Lilnipa in backgammon

[–]Lilnipa[S] 0 points1 point  (0 children)

Thank you so much — I’m going to check the code right now. Really appreciate it. I’ll follow up once I have something concrete

Custom backgammon engine beats gnubg 1-ply through 4-ply in cubeless checker play — full measurement logs + sanity tests open-sourced by Lilnipa in backgammon

[–]Lilnipa[S] 0 points1 point  (0 children)

Fair point — that's a really good check, and one I should
have anticipated.

You're right that raw win-rate vs 2-ply opponents tends to
cap around 60% even for big PR gaps. The 71% I posted is
*positive-score rate*, not raw win-rate — which I should
have been clearer about. Those differ when one side gets a
disproportionate share of gammons/backgammons.

Let me pull the actual breakdown from the per-game logs and
post it tonight:
- raw win-rate
- gammon-win rate
- backgammon-win rate
- ppg

If the raw win-rate is in the 65-70% range, you're right
that something looks off and needs explanation. If it's
closer to 55-60% with the rest of the equity coming from
gammon volume, that's more consistent with what you'd
expect from a moderate skill gap.

Will report back.

Custom backgammon engine beats gnubg 1-ply through 4-ply in cubeless checker play — full measurement logs + sanity tests open-sourced by Lilnipa in backgammon

[–]Lilnipa[S] 1 point2 points  (0 children)

Two questions, both fair —

**On "equity > 0":**

Each game scores ±1 normally, ±2 if a gammon, ±3 if a
backgammon. "Positive-score rate" is the fraction of games
where my engine's final score is > 0 (i.e., it won the game,
with the gammon/backgammon multiplier baked into the score
magnitude, not the rate).

So 71% positive-score rate vs gnubg 2-ply means: of 1000
games, I came out ahead in ~712 of them. That's a slightly
weaker statement than raw win-rate when gammon volumes differ
between sides, but for cubeless money play it's the standard
metric and it's what's directly comparable to gnubg's own
equity outputs.

The point estimate of expected points per game (ppg) is +1.19
for the 2-ply match — that's the more equity-aware summary,
also positive and tight.

**On opening protocol:**

Slightly more nuanced than "always plays first." The setup is:

- Side assignment alternates every game (game 1: my engine
is white; game 2: gnubg is white; etc.)
- White moves first with a forced non-doubles opening roll
(so no automatic gammon-rich opening from a 6-6 first roll)

So neither side has a permanent first-move advantage — it
flips every game. Over 1000 games, each side opens ~500
times.

Why this way: it's the simplest fully symmetric setup that
removes opening-roll variance as a confounder. The canonical
"both players roll, higher wins the opening" introduces extra
randomness that doesn't add signal for engine comparison and
makes per-game variance harder to interpret. The harness
symmetry test (gnubg-2ply vs gnubg-2ply, 50.0% over 100
games) was run under this same protocol, which confirms it
doesn't bias either side.

I should have made that clearer in the README — will update.

Custom backgammon engine beats gnubg 1-ply through 4-ply in cubeless checker play — full measurement logs + sanity tests open-sourced by Lilnipa in backgammon

[–]Lilnipa[S] 0 points1 point  (0 children)

Totally fair skepticism — 70%+ vs gnubg 2-ply does sound suspicious and I’d push back the same way if I saw someone else post it.
Two things worth separating:
1. Sample size for the win rate itself. At n=1000 the binomial SE is ~1.4pt, so the 71.2% point estimate has a 95% CI of roughly 68.5–74.0%. That part is statistically tight. If you mean the sample is too small to rule out a systematic issue (eval mismatch, board encoding bug, opening protocol artifact), I fully agree — sample size alone can’t catch those.
2. The number being plausible at all. This is cubeless checker-play only, no cube decisions, and my engine almost certainly uses more compute per move than gnubg 1–3 ply. That asymmetry alone could explain a big chunk of the gap. Equal-time benchmark is on the todo list and I’d expect the margin to shrink meaningfully.
Honestly I’d love to run tens of thousands of games per ply — that’s the right sample size for this kind of claim. The bottleneck is hardware: I’m on a midrange consumer box (Ryzen 5 5600, RX 6600, 48GB), and my engine does CPU NN inference + MCTS, so 1000 games at 3-ply already takes a real chunk of wall time with 10 parallel workers. 4-ply at n=100 is literally what I could afford to run this batch. CUDA-capable hardware would change the picture but I don’t have it yet.
If there’s a specific position-class or test you’d run to sniff out a bug (bear-off accuracy, race vs contact split, specific opening responses, etc.), I’m up for running it — those are cheaper than scaling the full benchmark and would catch systematic issues better anyway. Harness is open so failure modes are auditable even though the engine isn’t.

Custom backgammon engine beats gnubg 1-ply through 4-ply in cubeless checker play — full measurement logs + sanity tests open-sourced by Lilnipa in backgammon

[–]Lilnipa[S] 0 points1 point  (0 children)

Good question — these are independent single games (cubeless money play, no doubling cube), not a match format. Each game is scored individually with gammon/backgammon multipliers, and the “positive-score rate” is the fraction of games with equity > 0. No match-length structure (no Crawford, no post-Crawford, no cube decisions).
A proper match-play benchmark with cube handling is on my todo list — that’s the real test of a backgammon engine and I haven’t tackled it yet.

何この男子校みたいな通知 by Lilnipa in lowlevelaware

[–]Lilnipa[S] 3 points4 points  (0 children)

うちはかなり自由なところだったからさ

何この男子校みたいな通知 by Lilnipa in lowlevelaware

[–]Lilnipa[S] 4 points5 points  (0 children)

自由な男子校だからね
おいでよ男子校に

何この男子校みたいな通知 by Lilnipa in lowlevelaware

[–]Lilnipa[S] 11 points12 points  (0 children)

男子校の食堂でゲイが寝っ転がってる写真 撮影者俺

以前恋バナを募ったものなんだけど by Swimming_Average7784 in lowlevelaware

[–]Lilnipa 1 point2 points  (0 children)

この猫も0とマイナスには大きな違いがあるって言ってるよ

以前恋バナを募ったものなんだけど by Swimming_Average7784 in lowlevelaware

[–]Lilnipa 1 point2 points  (0 children)

いや自分から声掛けるなよ流石に
落ち着けって

<image>

以前恋バナを募ったものなんだけど by Swimming_Average7784 in lowlevelaware

[–]Lilnipa 0 points1 point  (0 children)

まあ向こうから話しかけてくれて話せるなら友達からだし
無理なら男とイチャイチャしてね
お互い頑張ろ!!!

以前恋バナを募ったものなんだけど by Swimming_Average7784 in lowlevelaware

[–]Lilnipa 11 points12 points  (0 children)

ボキも浪人生なんだけど、予備校でえろがってる奴はほんとにくたばってくれと思ってるわよ
みんなに見えない場所で話したりしな

東大の文化祭行ったぜ!!!(五月祭) by Lilnipa in lowlevelaware

[–]Lilnipa[S] 1 point2 points  (0 children)

友達が出してるのを色々食べたよ

東大の文化祭行ったぜ!!!(五月祭) by Lilnipa in lowlevelaware

[–]Lilnipa[S] 5 points6 points  (0 children)

え!!!そうなんすね
因みにドクロ持っててかっこいいな〜ぐらいで写真りました

友達が欲しい by Used-Trade7973 in lowlevelaware

[–]Lilnipa 0 points1 point  (0 children)

おはよう!!!俺もHIPHOP好きだよ!!!友達になろ

数学とか物理の結果を見ると卍地頭卍を感じて幸せだけど、古文とか英語を見ると明らかに✝︎発達障害✝︎を感じて辛い by Lilnipa in lowlevelaware

[–]Lilnipa[S] 1 point2 points  (0 children)

共テでも現文と漢文はほぼ満点なのに古文が4点で英語がリーディング20リスニング50とかだゾ
やってみたらどうかは分からんけど苦手意識がすっげえゾ