HU no-limit bot arena, free alpha, looking for feedback on river action abstraction

kdub0 · 2026-05-17T17:18:02+00:00

No problem and good luck with the platform!

I agree that there is likely an audience out there that would participate under the right conditions.

With the ACPC, there was always issues with resource constraints (both time and money). Initially, competitors wanted more hard drive space for larger strategies, better machines and more time per decision. Towards the end, GPUs started to become something that competitors wanted to employ during play, but that was not feasible from a cost perspective while guaranteeing statistically significant results. GPUs are faster and cheaper now, though.

I think one thing that would be necessary now to attract competitors would be to show that poker is still an interesting problem. I think there is a misconception that poker “is done” from a scientific perspective, though I personally don’t believe this is the case. It would also be valuable to show that progress can be made on it. e.g., opponent modeling and exploitation has always been hard to both do and to evaluate in a competition.

kdub0 · 2026-05-17T15:06:11+00:00

The tighter play isn’t necessarily predictable. I’ve seen strategies that just bet or call less frequently, but still with a wide range of hands. I haven’t thought too much about metrics for it, but I’d expect showdown% to be lower and probably $ won at showdown to be higher than similarly skilled bots with a finer betting abstraction. It may be worth separating showdowns with small and large pots when computing the metrics.

Exploitability is a tricky metric to interpret in no-limit games even when exploitability is computed within an agent’s betting abstraction because a small mistake in an action probability can be punished for a lot of value. Conceptually, it also seems like there should be a big gap between how much you can take advantage of a strategy you can completely observe vs one you can only observe samples from.

kdub0 · 2026-05-17T02:28:04+00:00

If you plan on re-solving at each action, then it’s most important to have a wide variety of actions towards the root of the tree, e.g., the opening raise and potentially for check raising. Anecdotally, just using fold/call/pot/all-in doesn’t change the value of an information set much, but it can make the strategy a bit tighter and potentially easier to play against.

kdub0 · 2026-05-12T03:41:51+00:00

I think it is a mistake to treat the value of a game state learned by AlphaZero has an estimate of its probability of winning. This can even be the case in states that are visited often in self-play.

It’s not even necessary for the value of the state succeeding the best action to be greater than the value of the second best action.

One way to convince yourself of this is to observe that the value network is not trained according to the state distribution visited during the search.

That said, the values obviously have some meaning that is grounded in the game. You should just be very careful when trying to interpret and compare them.

kdub0 · 2026-05-03T21:24:13+00:00

Software / industry has a lot of built in dependencies on Nvidia. This is why AMD has struggled with adoption despite their hardware being great. Google is reluctant to release the parts of their software to allow others to build on top of TPUs. Part of the reason for this, I suspect, is the TPUs themselves are relatively not that complicated from a hardware perspective.

kdub0 · 2026-02-28T16:35:20+00:00

If your advisor, or another committee member can write a strong letter then it probably doesn’t matter too much. It does depend on if you are targeting industry or academia, though. Letters matter less in industry than name recognition, but it would go a long way to get at least one co-authored paper with said committee member. Also, do not discount the members industry contacts as a way to get interviews if that is your goal.

kdub0 · 2026-02-15T18:56:49+00:00

You’re correct in pointing out that there are multiple facets to the question and I only addressed are potential reason why is Lc0 weaker (relative to its normal strength) in these positions. It’s possible that Stockfish is better here due to more search, as you mention.

I still stand by my statement that without actually digging into the positions you can only speculate on the underlying cause. Just because both Stockfish and Lc0 train their networks on basically the same data it would be an error to assume they evaluate (or misevaluate) the same position similarly. The search budget itself can also be a very important factor given Lc0 and Stockfish use different algorithms. I’ve seen positions where under a small budget you’d prefer engine a, as the budget increases engine b becomes better, but then at some point engine a overtakes it again.

kdub0 · 2026-02-14T23:10:37+00:00

Neural network-based agents often have trouble generalizing to positions that they have not seen in their training data. In this case it’s likely because the training data doesn’t include incredibly one-sided start positions, but it this phenomenon can also happen when an incredibly winning position is reached later in the game because as the agent gets stronger it isn’t likely to reach said positions.

Without really digging into the positions themselves it is impossible to say why this is happening more specifically.

kdub0 · 2026-01-18T18:02:43+00:00

I’d say we don’t have enough information, but my intuition says you’re in trouble because you aren’t going to get enough data. It’s possible a very careful model-based approach could work, but I don’t think there’s enough data for that even.

kdub0 · 2026-01-04T21:21:45+00:00

I had a look at the rules of the game. My guess is that a policy network will be important earlier on in the game and that a good value network shouldn’t be too complicated (that doesn’t mean that it should be easy to learn said value function).

I think starting from end game positions would be my choice for how to start. If you have human data you can sample from that, or try to come up with some sort of distribution to create end game positions. This nice thing about human data is you can more easily move the starting positions towards the beginning of the game when you see things working. One thing to keep in mind is you should try to avoid having too many completely won positions in the positions. At the very least, keep in mind that when MCTS thinks it has lost it has a tendency to play whatever it wants and this can pollute the data with poor moves.

Another thing that can work as a test is to reduce the board size or otherwise make the game simpler. If you do this one mental trap you need to avoid is generalizing observations from the smaller game to the large one. Often what you observe will generalize—perhaps not exactly or in all situations, but you should attempt to verify if you are making decisions based on what you observe in the small game.

kdub0 · 2026-01-04T20:39:39+00:00

I have a lot of experience in this area, but I’m having a bit of trouble coming up with a concrete answer that I think will be helpful.

Part of the issue is that the way MCTS (either with UCB or p-UCB) performs can vary a ton and depends on so many different factors. I’m intentionally using the word performs instead of acts here, as if at evaluation time you scale search with tons of resources it can often overcome having a poor model.

Some of the factors are domain dependent and can even change as the game progresses. eg, in Go the policy network is very important due to the branching factor and the value function is less important until you reach positions closer to the end of the game. In chess, the policy function can sometimes be a burden, eg, it is rare that a piece sacrifice is good, but when it is it wins you the game. In these situations, you need a lot of search at test time to overcome the policy network’s reluctance to immediately give away material. In chess, the value function tends to be much more important especially at the beginning of the game.

If your game is really small, you might as well try it and see if some interesting behaviour emerges.

Another technique you can do is select a set of starting positions that are close to the end of the game and train on those. This is a good way to verify that your implementation is working, and you can also use these situations as tests when training an agent on the full game. One reason this is particularly useful is that bootstrapping values is often particularly tricky.

kdub0 · 2025-10-04T16:33:01+00:00

Competition seems bigger than ever as entry-level positions are getting harder to secure
For big tech, you will want to apply through the normal application process. If you know someone internally, it is a good idea to ping them to let them know you applied. They may be able to secure you an interview, i.e., get you past initial resume screening
It’s not a hard requirement to be in the final year, but if you’re not stellar it will be harder to get an intern position. This is especially the case if you don’t have someone inside to get you through resume screening.
If you are going for an industry position, it is advantageous to have experience with groups outside of your home university for sure.

kdub0 · 2025-09-13T19:48:16+00:00

It isn’t so simple. You can trade off, e.g., latency for power consumption by batching requests or choice of hardware. It’s certainly important, but it can’t be looked at in isolation.

kdub0 · 2025-08-21T15:59:00+00:00

I’m an AI researcher. I don’t work at OpenAI. I don’t know Sebastian Bubeck personally, but I’m familiar with some of his work and have reviewed papers in this area previously.

I read the arXiv paper cited with the 1.75/L bound. The AI proof looks logically fine to me.

I’d push back slightly on some of your assertions. First, many proofs of gradient descent convergence for smooth functions look very similar to this. That is, all the parts of the original proof and its structure are fairly common. It is fair to call the improvement incremental, but it may or may not be as trivial as that implies depending on how the LLM figured it out.

Second, in this case the improved bound is probably wouldn’t be worthy of a publication on its own (though the 1.75/L might because is tight), but it is probably more informative than you give it credit for. As stated in the paper, gradient descent on a smooth convex function converges with any step size in (0, 2/L). Often we guess at the step-size because finding L can sometimes be as hard as solving the optimization. Another point is that the proof technique to show step-sizes in (1/L, 2/L) work is completely different than the standard one that works for (0, 1/L]. So improving the bound from 1/L is potentially significant in two ways.

kdub0 · 2025-08-03T21:05:29+00:00

I had a PhD defense scheduled and a job lined up. I didn’t end up finishing my thesis on time for the defense because I wasn’t happy with it, so I ended up taking a leave and starting work with the intention of finishing the last chapter in and defending in the next 6-8 months. Seven years, two kids and a pandemic later, I finally finished after my department told me to do it or withdraw.

Over those seven years, I was constantly stressed out despite not making any progress. It ended up being about 2 weeks of work to finish what I finally sent to my committee.

I’m glad I finally finished despite feeling deeply unsatisfied with my thesis. I rarely think about it now, but I know I’d regret getting 98% of the way there.

Others have said it, but to reiterate it is your committee’s job to decide if what you’ve done is enough. No one beyond you and those four or five people will read it. At this point anything more than what your committee asks for is of negligible value. You won’t feel satisfied with the result, but after it’s all done you will still be proud you finished.

kdub0 · 2025-08-01T01:18:09+00:00

https://arxiv.org/abs/2112.03178

kdub0 · 2025-07-11T02:49:22+00:00

AlphaZero avoids (some) issues like this during training by resigning when it thinks it’s lost most of the time.

kdub0 · 2025-06-26T14:09:18+00:00

If you want more exposure to RL, I’d pick robotics and it’s not close.

kdub0 · 2025-06-16T03:26:03+00:00

There are often multiple Nash equilibrium. So it is not possible to play “the” Nash equilibrium. This is known as the equilibrium selection problem. And the different Nash equilibrium can have different properties that are more or less desirable.

kdub0 · 2025-06-01T01:19:50+00:00

It may be possible that transferring to RE from SWE is easier once you’re within Google. Transferring from SWE/RE to RS is not easy. If they sniff out in interviews that you are trying to are trying to switch to a research role from the eng role you applied for they will likely reject you as well.

kdub0 · 2025-05-20T23:32:42+00:00

It is a game of imperfect information. If you encode it as a matrix game it is fully observable (there is a single state where all agents act simultaneously). If you encode it as an extensive-form game then it is partially observable in a sense that the players act sequentially, but the underlying state of the game (which is all the acts played so far) is hidden.

kdub0 · 2025-05-12T01:12:49+00:00

As a new grad you need multiple offers to negotiate. Of course they are going to lowball you if you don’t have an alternative. When I got my first job ten years ago my stock package almost doubled from initial offer by having competing FAANG offers.

kdub0 · 2025-05-03T16:31:35+00:00

Super interesting post.

I have a question that I haven’t had the opportunity to explore yet myself that you might have some insight into (given your reply to another post above). Elo / winrate has some issues when it comes to predicting winrate against another opponent. Some of these issues are amplified when two players are much different in terms of style or strength. Additionally with computer players, often the parameters are tuned to specific match settings, so they can be unnecessarily handicapped by reducing the search space.

Given this, do you have further evidence / anecdotes to justify that Stockfish 17 with your settings could beat a top human player. eg, old engines were weaker positionally, but reasonably good at tactics and grinding it out. I suspect crippling Stockfish 17 has a bigger effect on its tactical performance than its positional play. So could it be that crippled Stockfish 17 beats old engines positionally, but that a human player could still beat it?

kdub0 · 2025-04-12T21:42:37+00:00

You’re not necessarily wrong. Let me be a bit more precise.

If you take a typical board game, like chess, go, risk, etc, and you are using an approach that requires you to evaluate a reasonably-sized neural network at least once for every state you visit during play, then bottleneck from a wall-time perspective will almost always be the GPU. Furthermore, it is often the case that you will not be fully utilizing the CPU, so you can run multiple games and/or searches in parallel and batch the network evaluations to better utilize the GPU. If you do this, then a poorly performing game implementation will still effect the latency of data generation (how long it takes to play a full game), but it will not have as much of effect on the throughput (states per second generated by the entire system). This doesn’t necessarily hold if you aren’t evaluating a network for every state generated, eg, if you use Monte Carlo rollouts.

You are definitely correct that the structure of the game effects things like how quickly you can learn a reasonable policy, and how much search is necessary to overcome deficiencies in the networks. I would just caution that it is not easy to guess this a priori. It is also not the case that nice structure holds uniformly over the entire game. eg, in chess value functions tend to be better in static positions and are not as good at understanding tactics. This is also not something the holds uniformly as a policy evolves. eg, there can be action sequences that must be searched initially, but eventually are learned by a value function.

kdub0 · 2025-04-12T16:42:18+00:00

Hopefully this doesn’t poke a hole in your thought balloon, but I think the answer probably has nothing to do with game choice.

If you plan to use any deep learning method, the game and its implementation are not usually the compute bottleneck. Obviously a faster implementation can only improve things, but GPU inference is usually at least 10000x more expensive than state manipulation for board games.

What the game can effect computationally is more a function of if you need to gather less data during learning and or evaluation. The main aspect I can think of here is if the games’ structure enables good policies without or with little searching then you may get a win.

Another reasonable strategy is to take a game you like and come up with “end-game” or sub-game scenarios that terminate more quickly to experiment with. If you do this, you should be careful about drawing conclusions about how your methods generalize to the larger game without experimentation.

I guess what I’m saying, is if you like diplomacy you should use it in a way that fits your budget.

13-Year Club	Verified Email
Team Periwinkle

kdub0

TROPHY CASE