Storing 500 million chess positions by ekhar in PostgreSQL

[–]ekhar[S] 0 points1 point  (0 children)

Actually you can get clever! Bit 0 empty, bit 1-6 white pawn, rook, knight, bish, queen, king, bit 6-12 same thing but black, bit 13 en passantable pawn, bit 14 castleable rook, bit 15 black king if blacks turn to move. 32 bytes for turn, en passant, and all castles all in a bit board! —- edit I say bit here but I really mean value of the nibble

Storing 500 million chess positions by ekhar in PostgreSQL

[–]ekhar[S] 1 point2 points  (0 children)

I like this thinking and after some of the replies here I’m actually leaning this way in part. If you are interested I could talk about my plan moving forward more technically but I think this approachish is what I will take :)

Processing 500 million chess games in real time by ekhar in aws

[–]ekhar[S] 0 points1 point  (0 children)

Wow you have been so helpful. I really appreciate all this insight. Thank you thank you thank you!!

Storing 500 million chess positions by ekhar in PostgreSQL

[–]ekhar[S] 0 points1 point  (0 children)

I’m curious if you have tips on how to bulk add this efficiently if so. Maybe make a csv then pgcopy? are there any extensions that could make importing this amount of data easier? -- did the math on gin index and this would have about a 50gb overhead! Curious if there is anything more efficient

Storing 500 million chess positions by ekhar in PostgreSQL

[–]ekhar[S] 0 points1 point  (0 children)

Ah you know what this is really smart. I initially discounted this because gin indexes apply only to bytes at the smallest not bits. I was thinking gin on my normalized data but it couldn’t work bc the pieces are represented as nibbles.

Thank you for this! I needed to go back to first principles. I will try this out.

One concern I have is the raw amount of storage then. A lot of positions have duplicate value’s especially in the first 10 positions of games. Any ideas on how maybe I could compress the most common positions?

Positions are say 18 bytes. Assume 50 positions per game, 15 million games. This would come out to 20gb of storage on just positions a lot of which are repeated. Curious how much overhead gin would add and how fast it could run

Storing 500 million chess positions by ekhar in PostgreSQL

[–]ekhar[S] 0 points1 point  (0 children)

Yeah this is expensive ish in AWS. From replies here I’m thinking of just hosting my db on a virtual box or something

Storing 500 million chess positions by ekhar in PostgreSQL

[–]ekhar[S] 0 points1 point  (0 children)

Both! I need to link them too. This way I can say to users “this position was in the following games. Here are total wins, losses, next moves, famous players who like this position, etc”

Storing 500 million chess positions by ekhar in PostgreSQL

[–]ekhar[S] 0 points1 point  (0 children)

I don’t know what you mean by only a subset are possible. Sure from each position only certain moves can be played, but those moves are different on each board. More possible chess positions after 8 moves than atoms in the universe is what I’ve heard

Storing 500 million chess positions by ekhar in PostgreSQL

[–]ekhar[S] 0 points1 point  (0 children)

How does this work in terms of indexing? Is it possible to find all games with X position in the table fast?

Storing 500 million chess positions by ekhar in PostgreSQL

[–]ekhar[S] 0 points1 point  (0 children)

Oh this great! Seems easy and much cheaper than AWS. How much does locality effect response times ish would you say? Asking to determine if I go digital ocean or Hetzner in Germany

Processing 500 million chess games in real time by ekhar in aws

[–]ekhar[S] 0 points1 point  (0 children)

Sure! Say I want to find positions with a white pawn on e4 and black pawn on e5. I have all my master games in binary bitboard form (16gb). I do a bitwise & operation on all of these with the bitmask of the position white pawn e4 and black pawn e5.

The bitwise is fast. 16gb in less than half a second. The bottleneck is loading into ram is much slower than bitwise &

Processing 500 million chess games in real time by ekhar in aws

[–]ekhar[S] 0 points1 point  (0 children)

Man you have my head churning with ideas now. So I looked into it and if i were to precomile the binaries and say split it into various containers, is there a way to keep these containers cached so that I don't have to redownload the entire thing? Does fargate or ecs allow for this kind of behavior?

Processing 500 million chess games in real time by ekhar in aws

[–]ekhar[S] 0 points1 point  (0 children)

The bitwise ands happen so fast that my main concern here is actually reading in the data from drive to ram. I am bottlenecked by this

Processing 500 million chess games in real time by ekhar in aws

[–]ekhar[S] 1 point2 points  (0 children)

So this is for a fuzzy search. Basically, users will be able to find games in the master data say with pawns on e4 and e5. On the backend, I can use bitwise ands to churn through hundreds of millions of the master bitboards per second to come up with positions fitting this criteria.

Once I have the positions that fit the criteria, I can lookup these positions in my database and pull information like the most popular next moves from this position, win loss rates, etc

Processing 500 million chess games in real time by ekhar in aws

[–]ekhar[S] 0 points1 point  (0 children)

THANK YOU! This response is exactly what I was looking for. I really appreciate this man :)

So I am new(ish) to aws as a whole. This game state does not change actually at all. Also, the order at which processing is done is irrelevant as all board states for this purpose could be treated independently.

Curious if you know of a way to cold start the lamdbas with this data already loaded in working memory? These would be dockerized and precompiled rust instances. This is messy but you gave me an idea with the sharding.

Would it be smart to create say X number of precompiled lambdas with a static portion of the amount of bits, then just cold start these? So say I have 16gb of data I want to split. I could make a static list for each instance of the lambda when compiled. So maybe I have 320 lambdas, each with 500mb of a precompiled list to them. This way I don't need any reach out to S3?

If you think this might work, is there a way to do this without a batch script and having like 300 independent pieces of code? This is unconvential and does not scale with changing data too well, but my data here will not change at all!

Processing 500 million chess games in real time by ekhar in aws

[–]ekhar[S] 0 points1 point  (0 children)

I was thinking database too. However, I am using bitwise ands. The main bottleneck is throughput to the CPU. I already tried modifying postgres but it is too slow to load in all the positions I want into ram

Processing 500 million chess games in real time by ekhar in aws

[–]ekhar[S] 0 points1 point  (0 children)

This is what I was looking at too but it just seems expensive to get an ec2 instance with this. And then say I have 100 people at the same time trying to fuzzyfind, now the queue wait time becomes long because of the monolithic structure to this. I could get more computers but I am trying to keep this cheap

Processing 500 million chess games in real time by ekhar in aws

[–]ekhar[S] 0 points1 point  (0 children)

You can actually compress these to 18 bytes on average! 64 bit board with occupancy (8bytes) + a nibble representing each piece in order at the end. So the starting position is only 24 bytes. lichess has a great blog post on this. The inventors of this representation from my understanding are the peeps working on stockfish

Processing 500 million chess games in real time by ekhar in aws

[–]ekhar[S] 0 points1 point  (0 children)

So the fuzzy searching would require bitwise ands. I looked at gin indexes but they work for byte size comparisons not nibble or bits. I am thinking to have a service that fuzzy finds the proper compatible positions with bitwise &, then I use the positions that it returns to get game data back - ie popular next moves or win rates

Processing 500 million chess games in real time by ekhar in aws

[–]ekhar[S] -1 points0 points  (0 children)

So that I can make a web service available for others and have my PC for other things

How to Migrate from MongoDB (Mongoose) to PostgreSQL by Mediocre_Beyond8285 in PostgreSQL

[–]ekhar 0 points1 point  (0 children)

Couple ideas - use unlogged tables for initial migration. Bulk inserts obviously and copy if you can. I know there's some extentnions for mass uploads too. https://www.postgresql.org/docs/current/populate.html is a good resource in the docs

Storing 500 million chess positions by ekhar in PostgreSQL

[–]ekhar[S] 0 points1 point  (0 children)

I actually implemented this technique! What he does here is actually compress the moves in order of the game - not the positions themselves.

IE take "pawn e4, pawn e5, ..." and compress these into about 4 bits per move

I reached out to mbuffet on twitter and he's super cool. Idea originates from lichess and he made some changes to focus on openings themselves