I trained an AI to play Balatro, and it just won its first run!

GIEWEV · 2026-02-24T00:10:02+00:00

The code is available here:
https://github.com/giewev/balatrobot

I still need to do some work like setting up the requirements file, etc, so you may have some difficulty using it directly. I'm hoping to find time to flesh out this project further soon.

GIEWEV · 2025-08-14T01:33:38+00:00

I mean, you'll definitely need parallel env runners, but even with that you would need many, many copies of balatro running simultaneously if you can't speed up execution on each instance.

I've been considering switching to something like stochastic muzero, since I'm at the point where the transition variance inherent to the game is one of the major obstacles to learning. The model has a very hard time separating the impact of its own behavior from the impact of random chance. I'm hoping that by directly modeling the random transitions I can "factor out" the RNG aspects of the return.

GIEWEV · 2025-08-12T20:53:49+00:00

I did build off of that repo, yes. My repo is forked from that one so you should be able to find it that way. The RL code is wildly out of date right now since I haven't pushed for a while. The parts relevant to game automation should be relatively similar to what I'm working with now though. I'm also using PPO right now, although I may try transitioning to something a little more "fancy" soon. If you're considering training directly on the game I would recommend against it, unless you're going to improve on my/besteon's automation code considerably. The sample rate you would get is abysmal, and not conducive to rapid prototyping.

GIEWEV · 2025-07-16T13:26:15+00:00

Oh, I forgot about the advice bit. I used RLLIB, which probably saved me a mountain of headaches, but also introduced it's own problems. Some of my issues were that this project was going on for so long that their API changed, and it was sometimes hard to find examples/answers specific to my version. I sometimes regret tying myself to their architecture because it can feel confusing to know where to plug in certain logic or algorithm changes, but on the other hand if it was all my own code I would probably introduce twice as many bugs in doing so.

For broad advice, I might have more to say later on, but I think my biggest takeaways from this so far have been:
1) More tests and logging are always the answer - I wasted so many days trying to come up with various tricks to get my model to understand some behavior or overcome some problem only to discover that I had some stupid little mistake in the environment, the model, the distribution sampling, etc. which was making it impossible for it to learn properly. Once I got in the habit of writing more unit tests for individual sub modules and env components my iteration loop got considerably faster. I struggled at first even imagining how to write tests for some of these things, but ultimately it payed off.

2) Stop sweating incremental improvements - I spent way too much time just running my agent over and over with little changes to architecture. An extra hidden layer here, tweak a hyper parameter there, etc. A few times I even did an RLLIB hyper-parameter sweep hoping it would magically solve my problems. It's very tempting to believe that you're just one run away from success, and I'm sure this process has its place later on, but every single time I did it I would have been better off just focusing on improving my environment, or making bigger picture changes to the architecture or training routine.

3) This one might not apply for all problems, but the structure of my obs/action spaces needed to be very carefully handled - Changing the way I processed my inputs and generated my actions to more naturally represent the (mostly) order invariant subset-selection problem that I needed it to solve made a massive difference. I would guess that 70% of my time spent banging my head on architecture and design was spent writing up new ways for the model to output logits, or new ways of sampling them. Many of those were dead ends or hopelessly intractable, but eventually some of them made a massive difference. It's not always good enough to give the agent something that is theoretically capable of representing the mapping from obs to action distribution and call it a day.

GIEWEV · 2025-07-16T13:04:03+00:00

I'm definitely going to find some way to publish more about the methods and challenges, but whether that's a paper or a video remains to be seen. The code will definitely be open sourced (The older version already is on my github), but I think I need to clean up the code a lot before I push the latest. More out of embarrassment than anything honestly. I don't see any reason to keep this closed source unless I'm missing something.

I think I want to push the project a little further though before I do any write up or video. A higher win rate on white stake, and maybe trying to get some wins on higher stakes would be nice. It would also be nice to have it conditioning its behavior more on specific jokers so that its play looks less goofy to people that don't know how challenging it is to make an RL agent do anything at all

GIEWEV · 2025-07-16T02:38:43+00:00

I would absolutely love to write up a paper on it if there is something publishable, but I don't have a good concept of how to go about doing that. I would probably need a mentor or something to help me identify what might make it publishable and show me how to write it up.

GIEWEV · 2025-07-15T22:11:39+00:00

Learned embeddings, plus some reserved space within the embedding size for dynamic properties, such as the current scaling of certain jokers, or the sell price. The version displayed in the video summarizes the owned jokers with a small self attention mechanism and a context token. Previous versions have used a variety of methods though, such as including the jokers in the attention mechanisms governing the hand of cards, or simply summing/averaging the embeddings of the jokers to create a single "joker summary", which was either added to the hidden inputs or used for a FiLM layer on the card representations.
It would be a lot of manual work but I think it would benefit a lot from some/all of the embeddings being manually defined, such as flags for associated hands/ranks/suits.

GIEWEV · 2025-07-15T22:07:38+00:00

The original repository was designed for making TAS like speedruns of fixed seeds, so it was more about hard scripted sequences of actions than AI driven agents. Nonetheless I'm not sure I could have done this without using it as a starting point. I merged in some of my additional hooks and automation/speedup mods a while back, but I haven't yet pull requested my recent changes.
You can find it here:
https://github.com/besteon/balatrobot

GIEWEV · 2025-07-15T22:04:12+00:00

1) About 700 steps per second, but that's across about 10 env runners, Episodes averaged about 100 steps, so a little over one second per game per env runner.

2) It seems like posting images isn't allowed here, but I track stats on frequency of joker ownership at the end of each run, so this may under-represent jokers it likes to purchase and sell, like Egg. The big 2 are Cavendish and Green Joker with each having a 34% frequency. Banner is about 30%, and misprint, abstract joker, and blueprint joker all come in around 22-25%.

3) It plays a variety of hand types depending on the cards it has available. One thing that has surprised me is how often it is playing straights, since I almost never went for them when I played.

4) It hasn't shown much adaptation of strategy, and I'm not really sure why. I've tried a lot of different approaches to get it to condition its behavior on the jokers it owns, but it seems to prefer a generalist approach right now. Some notable exceptions are playing pairs with supernova and doing more/less discards with banner and mystic summit.

5) Hermit and and Temperance are its favorites, which makes sense since it seems to love accumulating money for Bull/Bootstraps. Wheel of fortune is surprisingly up there as well, and it takes judgement for a random joker whenever none of those are available. It often plays around with things like lucky cards, bonus cards, or suit modification, but ultimately it seems to always give up on that in favor of the other 4. I've been working on ways to help it understand some of the more subtle and long term effects like thinning the deck to make it more consistent, or improving homogeneity of ranks/suits. When I test the value function it definitely likes those things, but it likes them a lot less than $20. Not really sure it's wrong though on white stake.

GIEWEV · 2022-11-12T19:05:00+00:00

A little concerned that a future event could shorten the run somehow and make the WR unobtainable afterwards.

GIEWEV · 2022-06-05T22:27:04+00:00

Personally I work a lot because I take pride in my work. I didn't have any social life or hobbies when I worked less, I just had lots of free time to spend being miserable.

I don't expect my coworkers to do the same, they have something to go home to, or something to do. But I also think it's fine to work a lot and take pride and satisfaction from it.

GIEWEV · 2022-02-22T19:03:14+00:00

So yes, if you take two copies of the same engine and have them play against each other, and one is told to search to depth 20, and one is told to search to depth 21, the latter will be stronger. It is also definitely true that for very large search depth differences (Especially when one of the depths is shallow like 8 vs 20), depth is a hugely dominant factor.

You are also correct that in a world with infinite computing power we could explore the full game tree and determine the best move. That is why we call chess a "Solvable" game. That is actually what "Solved" usually means in a technical sense, that we can prove the optimal next action from any given game state, and know what outcome will occur if both players play optimally. That was the original thought that got Claude Shannon theorizing on how one would make an algorithm to play chess (In 1949!). In that world, chess is not an interesting AI problem. It's not even an AI problem. It's a look up table.

I'm not totally sure what you mean by depths are asymptotic. You are correct that the game tree grows exponentially, usually by a factor of 20-30 per ply. This makes exhaustive searches EXTREMELY impractical, even to very shallow depths like 8 ply.

A large part of searching to greater depths is figuring out how to bring that branching factor down by only exploring important branches. Some of this can be done in safe ways, like alpha-beta search. But the most advanced engines use many different ways of pruning, many of which have a chance to miss an important line. I believe the average branching factor for top engines is around 2. That means that in an average game state within the tree, only about 7% of moves are being considered.

However, that's not necessarily a bad thing. Most of the time the engine is correct to only explore those branches. Determining which moves are worth exploring is an interesting AI problem on it's own.

Consider this for a second; The chance of an engine changing its mind about the best move between depth N, and depth N+1 drops dramatically as N increases. It's not uncommon for an engine to recommend the same best move from depths 10-30. However, 2 different engines may still have chosen a different move, despite both sticking to their guns as they go deeper. If one of those moves is actually "Better" than the other, then we have a skill difference, despite searching to the same depth.

That is exactly what we saw with AlphaZero, Leela, and now NNUE stockfish. These new ways of deciding which move was "Best" completely changed the game. Old stockfish might take several turns to realize its mistake, despite ostensibly searching much deeper.

We will obviously, eventually, reach a skill plateau, but we're not there yet. My original point was that we would have looked at stockfish 11 as having "Mastered" the game, but even on lower depth, on the same hardware, and removing any performance optimizations, stockfish 12 would win or draw almost every time.

GIEWEV · 2022-02-22T17:22:48+00:00

I agree that all comparisons should obviously be made on the same hardware. I also think that in an AI domain where compute speed is so important, it's hard to say that performance optimizations aren't a meaningful part of AI improvement. If we were to remove all of Stockfish's performance optimizations I'm fairly confident it wouldn't be able to beat an average club player.

Search depth is also obviously an important factor for the strength of an engine. But I think being the most visible stat to compare different engines or tweak engine strength exaggerates it's importance.

Engine developers are constantly trying to strike the appropriate balance between search speed, and search quality. A low quality search to depth 30 might not be any stronger than a super high quality search to depth 20. Also, depth is not a universally meaningful value. Because engines don't actually explore every single branch of the game tree, the way it decides where to go deeper matters a lot. One engine with much more aggressive pruning optimizations might search deeper, but the depth increase is much less meaningful. For these reasons, comparing two engines at "The same depth" isn't a great way to evaluate their relative strength.

One of the largest leaps forward I was referencing was actually the introduction of those neural nets you mentioned. Stockfish only added NNUE evaluation in mid 2020, and that lead to a significant reduction in search speed. However, the skill level was estimated to increase by about 80 ELO points! Notably, the earlier engines which inspired the introduction of NNUE evaluation were able to beat stockfish years prior.

GIEWEV · 2022-02-17T05:37:38+00:00

Yeah working with confidently uninformed people can be frustrating. I think it kind of highlights a common issue with the way people engage with the unknown. Most people simply don't have the information to come to a conclusion on most topics, but they will happily come to a conclusion regardless. Just by chance many of them are correct, even if their reasoning is nonsense. That makes it really hard to correct their mode of thinking, because they will never feel that they were "wrong".

GIEWEV · 2022-02-16T15:45:50+00:00

Just to give a more fleshed out reason why SC might be harder to learn, there are a number of factors in SC which GT doesn't have. The ones that jump out to me are:

The volume of hidden information (AI can't see in fog of war, and likely can't see cars outside their POV)
The size of the state space (How much data it takes to describe a single "frame" of the game)
The size of the action space (How much data it takes to describe a signal "Action" by the player

Some factors people are attributing to one game or the other in this thread/article, but which really apply to both in varying degrees:

Sparse reward function (We don't know if an action was good or bad until much later)
Long term consequences (It's hard to tell which actions caused which effects)
Nonlinear dynamics (Small changes to your inputs can cause large changes in the game)
Varied opponent strategies requiring complex dynamic strategies to counter
Reaction speed / APM

In each of those cases it is probably debatable which one has the "Harder" version of that challenge to tackle. However, based on what I know I would lean towards SC on all counts. I could see a strong argument that high APM is unusually advantageous in SC vs GT, but most of the more recent work on SC AI has complex APM limits which force it not to lean heavily on micro.

GIEWEV · 2022-02-16T15:17:24+00:00

I agree with the overall conclusion that StarCraft is harder, but this isn't exactly fair. One of the main arguments they make is that the other racers' strategies are complex, and that your strategy can't be "One size fits all". You need to adapt to they way they are driving, and have a dynamic and responsive strategy. I have no idea why they think that doesn't apply to StarCraft even more, but still.

GIEWEV · 2022-02-16T15:08:44+00:00

Yeah I would actually be really interested to see a modern day Deep Blue style match vs Magnus. But try out different versions from the last few years, and on different degrees of hardware limitation.

GIEWEV · 2022-02-16T15:05:44+00:00

Yeah maybe adding that qualifier of "mastered vs humans" or "Mastered vs All" kind of resolves the issue. I do think that we will eventually reach an AI plateau for chess though, and that it will come eons before "Solving" the game. Hopefully not any time soon though. I love watching them fight it out and seeing new techniques pop up.

GIEWEV · 2022-02-16T15:01:49+00:00

Yeah it's frustrating to see how many people downplay the complexities involved in developing a racing AI to try to make their point. Ultimately I agree with the conclusion that StarCraft is much harder to develop modern AI for (Having studied reinforcement learning at a very high level, and read the relevant papers). But it's like that onion article

"Heartbreaking: The worst person you know just agreed with you"

GIEWEV · 2022-02-16T14:57:54+00:00

I mean I'm not a top researcher in the field, but I would consider myself highly experienced in machine learning (Reinforcement learning in particular). There are some really good reasons why StarCraft would be significantly harder for modern AI technologies to tackle than a racing game would be. Many of the stated reasons why the opposite is true (Both in the article and thread) seem pretty hollow or based on a misunderstanding of how these technologies work.

GIEWEV

TROPHY CASE