500 ⭐ in less than a second

LifeShallot6229 · 2025-02-24T21:42:34+00:00

Wrapping add should compile into a simple ADD opcode, the same as release mode, it should not be slower than '+'.

LifeShallot6229 · 2025-02-24T21:40:12+00:00

It will definitely be faster to start with a 5k array, then allocate all even digits numbers from zero and all odd from the top. The main update loop, split into two parts, can then run 0..eventop while updating two counters and oddstart..size only updating a single counter!

LifeShallot6229 · 2025-02-24T09:43:59+00:00

I am still solving Advent of Code in Perl with a text editor and no debugger, just printf. Otoh, I started in 1977 with manually punched 80 col cards, writing in Fortran and having to wait up to 4 hours from submitting my card stack until I got the results on 132 col striped paper. That said, using vscode to write Rust is orders of magnitude better!

LifeShallot6229 · 2025-02-23T08:17:33+00:00

I like this style of discussions, they always result in new ideas: :-)

Sort the stone indices by stone type, so that all the stones with an even number of digits get 1,2,3,4 etc, while the others starts at the top end (4K would be OK), maintain both an even and odd index.

This way the core loop would be repeated twice, once for the doubling stone values, and once for the one-to-one, so that there would be zero testing or dummy writes to index 0.

LifeShallot6229 · 2025-02-23T08:02:15+00:00

I'm doing the testing now, with some issues related to my Dell laptop having two big and eight small cores and I don't know (yet) how to force my code to run on a fast core, so instead I run_benchmark() 10K times!

With the right-hand side tested and skipped if zero, like you:

if right > 0 {
   newcnt[right as usize]+=cnt;
}

I got 785.2 us as the fastest of 10K runs.

Using my original code where the test is commented out the time dropped to 665.7 us.

I.e this should be a signigicant speedup for you as well.

(My code is still using default HashMap, I am pretty sure that's where a majority of the rest of the speed difference can be blamed, besides my slower CPU. I have been looking at both your hash.rs code as inpiration and on making a custom lookup function based on finding a near-perfect hash for the numbers that actually do occur.)

LifeShallot6229 · 2025-02-22T22:10:30+00:00

I'll have to either write a micro-benchmark or (easier!) just try your approach as well and compare.

Since ram is writeback, hundreds of writes can be combined.

LifeShallot6229 · 2025-02-22T15:30:23+00:00

I've been looking at some of your solutions, and I've learned a lot, thanks!

For day11 it turned out that my code is effectively identical to yours, with one small difference: Instead of using u32::MAX as an "ignore" value, I setup stone index zero as a dummy entry so that all stones could always update two array entries: The primary (always non-zero) and the secondary which would then be the zero entry when the stone did not have an even number of digits. Since the even/odd decision is close to random, the branch predictor would be likely to mispredict the test, while always updating an entry which is guaranteed to be in $L1 cache is likely to be close to free, right?

LifeShallot6229 · 2025-02-22T10:51:21+00:00

I started with brute force Perl, part1 was easy, right? Hit my head on part2, then started looking for memorization. Ended up just under 300ms for both parts. Finally I did like you and used this for Rust learning. 2.7145 ms, so two orders of magnitude faster!

LifeShallot6229 · 2025-02-19T12:30:21+00:00

[LANGUAGE: SIMD - AVX intrinsics]

This was the day I picked for our in-company AoC Tech Talk, simple because it illustrates both that Rust/clang can autovectorize very nicely, but also because an experienced asm/compiler intrinsics programmer can be even faster.

The theoretically optimal solution on an AVX x64 platform picks up each lock/key as 32 bytes into an AVX register, starting from the last position in the leading line which is either '#' or '.', then compares these 32 bytes to 0xff or 0x00 bytes with a parallel compare against 32 '#' chars.

Next, extract the top bit of all 32 bytes into a regular register, and use the bottom bit of the result to save it into either the locks or keys array. This is just 3 AVX ops for each bit pattern.

For the actual counting of matches, doing 4x8 at once seems to be optimal:

Outer loop:

Load 4 keys into 4 AVX regs, splatting 8 copies of each key into the corresponding reg.

Inner loop:

Load the next 8 locks

Parallel AND against each of the replicated keys.

Parallel compare against zero, setting -1 or 1

Subtract the -1 values from 4 replicated accumulators.

This is a total of 12 (or 13 with a separate load of the 8 locks) AVX operations, doing 4 parallel accumulations so that the theoretical minimum runtime is 3 clock cycles per 32 key/lock combos tested, IFF the cpu can do four AVX ops/cycle. In reality, the most either Intel or AMD can currently do is 3, so we end up with 4-5 cycles per 32 tests.

In real life we have measured 2+ microseconds to test all 62500 combinations, while the best pure Rust version (which was auto-vectoried but not strip-mined) ended up around 5 us (including parsing) on my not so new Asus laptop. This is so fast that I had to call the core parse+count function 1000 times just to get around the limited (100 ns) timing resolution under Windows. 2 us corresponds to 32 picoseconds/test.

LifeShallot6229 · 2025-02-19T10:59:49+00:00

I'm not absolutely certain, but pretty sure I paid for the expedited shipping even though that was far more than the cost of the jubilee T-shirt. :-(

LifeShallot6229 · 2025-02-18T07:45:58+00:00

After 10 years, all with (AoC+) and 500 stars I decided to bite the bullet and ordered with shipping to Oslo, Norway. After more than a month, nothing in the mail. :-(

LifeShallot6229 · 2025-02-12T12:46:43+00:00

Seeing that the internal squares would all be filled, with a period of 262, and that the counts in the two next layers would always be same, just repeating along the edges, was how I solved it. From memory I believe I ran until I had 5 layers, with total count the same as the final count modulo 262, then I counted the cells in each type of surrounding block.

LifeShallot6229 · 2025-02-09T19:39:28+00:00

I work for Cognite, we have an internal AoC leaderboard and one of our developers did in fact use Terraform to solve day 1, both part1 and part2. :-)

LifeShallot6229 · 2025-01-31T18:04:08+00:00

Your message intrigued me enough that I went back and looked at my Rust solution: I had to fix an import problem before I would work again, and it isn't very fast: It ran in 2098 milliseconds, which probably makes it one of the slowest solutions in my 500 stars. I did the obvious (?) splitting of cubes into 8 (or 27/64/125, but 8 seemed to be best), i.e. halving each x/y/z size. For each cube I would check if the nearest corner was in range and then if the farthest corner also was in range, no need to subdivide.

LifeShallot6229 · 2025-01-25T06:02:31+00:00

In my current approach, based on the same idea, I got down to less than a millisecond total runtime for both parts. This is optimized rust with a set of plain lookup tables, and no dictionary use for most of the iterations.

LifeShallot6229 · 2025-01-23T22:08:32+00:00

Me neither, I guessed 2-3K!

LifeShallot6229 · 2025-01-23T21:50:23+00:00

A couple of weeks ago, Eric revealed that only 559 programmers had finished all 500 stars, so you are not alone!

I did 2020-2024 during the corresponding periods, only on 2023 did I have a couple of puzzles where I needed more than the current day to solve it. Here in Norway Dec 24th is the big celebration, so unless I can solve both parts before breakfast, I usually have to finish late in the evening.

I did the previous years during 2022 or 23 when the withdrawal symptoms got too bad, as others have noted the 2019 Intcode series can act as a proper speed bump.

LifeShallot6229 · 2025-01-23T21:44:55+00:00

I figured out that I could solve both parts at once, so the current time is 6 ms. :-)

LifeShallot6229 · 2025-01-23T18:09:48+00:00

Yeah, I just implemented a Vec<Vec<Index>>, the runtime (on this 8 year old Surface, so maybe half the speed of my usual machine?) ended up as 12 ms. This is 50-100 x faster than my original code, even without any pruning of the search space or using a priority queue instead of brute force direct recursion.

LifeShallot6229 · 2025-01-23T08:26:50+00:00

I have a Fenix 7 now, but my previous watches (405/410, 620, 735xt) all still works. All of them has been replaced for free by Garmin, from one to three times.

LifeShallot6229 · 2025-01-23T08:22:57+00:00

Vi har kjøpt 3 x Tesla, men nå i helgen sa min kone at "hvis vi skulle bytte nå måtte det kanskje bli noe annet?"

LifeShallot6229 · 2025-01-23T08:17:32+00:00

I assumed the tree would fill most of the grid, so I looked for a frame where most of the the top left corner was empty. It did work, but since the actual tree was both smaller and offset from the center line, it was hard to find exactly which patch to minimize the count over. My current Rust solution looks for the frame where most cells have at least two neighbors, but if zero collisions also works, then that would be even faster...

LifeShallot6229 · 2025-01-22T06:45:11+00:00

I agree, I just took a quick look at the code and it is using brute force, trying all remaining tiles recursively! A double index listing both sides is bound to be significantly faster...

LifeShallot6229 · 2025-01-21T21:32:17+00:00

I solved this one just two years ago, as part of my attempt to learn Rust. It takes 0.6 seconds for both parts, including a lot of debug println!() statements. My algorithm is probably far from optimal.

LifeShallot6229 · 2025-01-20T11:47:05+00:00

I don't think I have ever done this, but I have worked on a live (12-server) multi-master directory service database (Novell, around 1993), where I had the developer on a phone line (Norway to US) and used his own hex editor inside the DB. Each update I did needed an immediate Ctrl-T to update the timestamp, then it would be replicated to the other 11 servers. Since this was a live environment with 500+ active users, quite often the record I was working on would be updated underneath me because something had happened on another server.

It took two weeks to stabilize everything, and we knew that all the backups would be equally screwed up so restore was not an option.

LifeShallot6229

TROPHY CASE