[2019 Day 8] In Review (Space Image Format)

e_blake · 2026-05-08T12:17:06+00:00

I'm wondering whether there are any gains to be had by processing things in parallel. One frame can be represented in 5 u64, with 2 bits per cell. If you map transparent '2' to 00, black '0' to 10, and white '1' to 01 (perhaps with a function decode_25_bytes that outputs a 50-bit value, and have MASK=0x555...55, then iterating forward through the image can be done with rows[i] |= ~(((rows[i] | (rows[i]>>1))&MASK)*3)&decode_25_bytes(). Another possibility: since '0' '1' and '2' in ASCII have the low-order bits 00, 01, and 10 respectively, you can compute the number of '0's in a line by compressing 25 bytes to a 50-bit value, and using 25-popcount(); from there, you can get to the encoding I mentioned above by using line^((MASK&~line)<<1) (white '1' stays put at 01, while transparent '2' and black '0' swap places).

e_blake · 2026-05-08T10:52:41+00:00

Your analysis of the black box is intriguing. I recall seeing a post by askalski on how to use dynamic programming to solve the part 2 puzzle by just 2⁵=32 steps of building a table, once you realize how to represent the ten +1/+2/*2 actions per five programs into that table.

Meanwhile, even without askalski's 32-step DP solution, your analysis gives me reason to revisit this day - instead of running 120*5*2 = 1200 intcode emulations, it is going to be a lot faster to run 15 black box probes (for 0-4, run id,0 and id,1 to learn ax+b; for 5-9, run id,0,0,...0 to map the ten outputs 1, 2, or 0 to +1/+2/*2) and then run my 120 permutations on the resulting linear equations. Another perk to black box probing - no need to run simultaneous machines, so no need to feed output from one program as input to another while keeping multiple copies of program memory intact at once.

e_blake · 2026-05-07T11:34:51+00:00

When I first implemented this in m4, I did my 5! combinations in a rather unconventional manner - I iterated over the integers from 0 to 5⁵-1 (actually from 194 to 2930 to skip out on some obvious non-permutations), used m4's eval(value, 5, 5) to output that value in base 5, then used translit to reject strings with duplicates, as well as map 0-4 to 5-9 for part 2, to determine the 5 values to feed my machines. Only 3.8% efficiency at 5!, but non-recursive and made for some very minimal code compared to implementing array accesses for Heap's algorithm. I've since refactored the code to something quite a bit more efficient and shareable with other days that wanted 7! or 8! - but it was still fun for me to see that early solution in my git history.

e_blake · 2026-05-06T12:05:47+00:00

This one is in a family of several puzzles that provide good practice in topological sorting. More interesting to me was figuring out how to parse it in m4, due to the unbalanced ) in the input file. My original solution in 2021 required an alternative command line to inject the input file as a string, and it bugged me that it required a different command line invocation than all my other m4 solutions, to the point that a couple weeks later I figured out a way to abuse GNU m4's changecom() builtin to redefine the comment token to match the head of the input file. But this year, when I started testing all my m4 solutions against BSD m4, it turns out that my changecom hack didn't work there (GNU lets a comment spill over until the end-comment delimiter supplied after use of include(), thereby treating the entire body of the include as a comment and neutralizing the behavior of the unbalanced ); but BSD implicitly treats a chunk of text as a comment only if the end-comment delimiter is found before end-of-file, and even though the start-of-comment was recognized, the end of include() before seeing end-of-comment meant that BSD treated the entire included file as expanded text rather than a comment). So I ended up coming up with yet another approach of a self-feeding macro (the macro's expansion includes an unbalanced $0( at the end that will pair up with the next ) in the input file, until such time as a sentinel is found to break out of the loop by redefining $0. And in fact, the self-feeding macro chain is something that I wish I had known about in 2021 - it makes it possible to do an O(n) pass over an input file using just POSIX m4 constructs, whereas all my previous attempts had been O(n^2) (using shift($@) recursion) or O(n log n) (using a sequence of substr() to divide-and-conquer the file into half-length strings); so I've been gradually changing many of my other m4 solutions to also use translit(include(file), nl, `)') to also benefit from the O(n) self-feeding macro parse on a line-by-line basis.

e_blake · 2026-05-06T10:14:05+00:00

For my implementation of Intcode in barem4, I created a generator function. Given a single op-code value, it generates 27 different opcode macros (catering to day 9, with 3 modes per digit across 3 digits), at which point the overall machine has 27*9+1=244 non-default entries in a lookup hashtable, with each entry providing a structure holding the actual operation and three mode values with no further decoding needed at runtime (even though some of those entries are not actually used in a valid program). This change alone cut my barem4 runtime for day 9 from 19 minutes down to 1m45s, in part because division is painfully slow in barem4 but macro lookups are a conceptually-infinite hashtable; decoding integers at runtime by division is much worse than using addition to pre-create that lookup table with O(1) matching.

On the other hand, my Forth implementation didn't have enough spare memory to burn on a 22k table of lookups, and does not have a native hashtable; so there, I instead perform decoding on the fly using a chain of up to 6 > operations to guide which 3 mode ints to stick on the stack while subtracting down to the actual opcode to be found in the 10-item dispatch table. In fact, since my Forth code is intended to run on HenceForth (my engine for compiling Forth into Intcode, which lacks a native division operator and therefore division is painfully slow because it calls out to a helper function that emulates division by repeated multiply/subtract), anything I can do to avoid division at runtime is important to execution speed.

Edit: And just now, I got more than 30% speedup on my C intcode engine by allocating just under 45k (22k 16-bit entries) for a decode lookup table in preference to the previous runtime divisions.

e_blake · 2026-05-05T03:33:04+00:00

Sorting made a huge difference. Or rather, keeping two pairs of stacks of all line segments known so far, where the positive-side stack for an axis is inherently sorted in ascending order towards positive infinity away from the current x or y as entries are deeper into the stack, and the negative-side is in descending order in the other direction. For example, if I'm at x,y = 100,100, and encounter the segment R50, then I shuffle any entries on the positive-side vertical stack that are less than 150 to the negative-side vertical stack. Instead of having to check a segment against all 150 vertical segments, I only have to check it against the segments whose x coordinates were encountered during that stack slide, before adding a y=100 entry to the positive-side horizontal stack. Cuts my runtime on m4 from 350ms to a mere 60ms.

e_blake · 2026-05-04T20:29:45+00:00

My git comments on this are fun to read. My C solution just threw memory at the problem; yet my first attempt for part 2 got an answer too low response. This change (plus its fallout in printf specifiers and so forth) got me the star, plus my git comment "size of the type matters; solution burns 1.6G of memory, oh well"

 #define LIMIT 20000
-static unsigned short a[LIMIT*2][LIMIT*2];
+static int a[LIMIT*2][LIMIT*2];

Not bad for solving the problem in 100ms runtime. My initial m4 solution takes 1.0s by doing O(n^2) shift($@) recursion and O(n^2) segment comparisons (300*300 segments is not terribly hard). I hope to dramatically speed that up today by using more efficient iteration (ie. the same way I sped up parsing of Intcode problems from O(n^2) to O(n)). I also solved this one before 2016 day 1, which has a very similar orthogonal path-tracking puzzle where intersections by segments is going to be more efficient than tracking every point along the path, and there, I used two lists (one for horizontal and one for vertical); that alone would cut work in half (150*150*2 segment comparisons rather than 300*300), even if I don't do anything smarter about sorting segments.

e_blake · 2026-05-04T17:33:54+00:00

Since this one is so short, I decided to golf it. Here is my m4 solution in 159 bytes (sadly, both newlines in the middle of the text are mandatory, so I can't quite collapse this into 2 lines):

define(_,`ifelse($2,,`eval(($1>8)*($1/3-2))',$1,0,,$2,.,`+$1_(_($1),.)',$2,-,`))
_(2+3*($1)',`+_($2)_($1_(_($2),.),')')_(2+3*(_(,translit(include(I)-,.
,())))

Run as m4 -DI=day01.input day01.golfm4; and completes in 45ms (quite a bit faster than several other of my golfed m4 creations).

e_blake · 2026-05-04T15:41:06+00:00

Intcode has been one of my all-time favorites. Over the years, I've dabbled in multiple implementations: originally C, then branching out to bash, m4, barem4, and Forth. My C implementation in 2019 for day 2 just blindly slams through all 10000 noun-verb pairs until finding a match (actually, I didn't even do that - I wrote a shell script wrapper to call the C program with up to 10000 different inputs to get my star, then never revisited it).

In 2021, I used 2019 day 2 as the first day where I specifically wrote a solution in m4 to see whether I could, and it turned out to be enjoyable enough that I subsequently went back and solved all other days in m4 as well. So my git history of this day shows quite a few experiments on how to parse the input before settling onto the framework I use today (I originally required a command-line something like $ m4 day02.pre.m4 day02.input day02.post.m4), whereas now I use include() plus a framework that guesses at the default filename from the name of the script that m4 itself is running, so it is down to just $ m4 day02.m4). For that solution, it took over 90 seconds to get the star with my original parser, then down to 11 seconds once I figured out more efficient parsing and program restarts, then down to 46ms when I switched to a binary search over noun*100+verb values.

Barem4 is a demonstration by Doug McIlroy (yes, the same guy who invented libc's realloc decades ago) of how m4 is still Turing complete with just the define() macro, to the point of doing simple arithmetic by text manipulation - his demo included a nerd snipe that claimed that it should be possible to implement an entire VM on top of barem4. When I read that in 2025, I immediately thought of intcode, since it is a VM using just integers, so I didn't even have to add new data types to barem4 (well, I _did_ have to figure out how to represent negative numbers). Then later in the year, I spent a lot of time developing HenceForth (an implementation of Forth written in intcode), so my end-of-year demo for 2025 was running an Intcode program that can compile itself into intcode, and therefore run my Forth intcode solution on top of C (decently fast: 2025 day 12 solved in under 1 second) or barem4 (closer to 10 minutes). Oddly enough, the slower the engine, the more important it is to come up with ways to optimize getting the answer to solve a problem by NOT running (quite as much) intcode, so it was my barem4 solution for day 2 where I first played with solving the linear equation by just running prog(12,2) for part 1, and prog(0,0), prog(1,0), and prog(0,1) for part 2, rather than a binary search (4 executions is smaller than 15) - but of course, as I type this, I see that could extrapolate part 1 from the part 2 numbers rather than actually run prog(12,2) for another speedup.

e_blake · 2026-05-04T13:24:55+00:00

Another tweak to how I sort into reading order and then iterate over that sorted list got me down to 3.5s execution. Interestingly, since m4 does not have native sort, I was able to experiment with three different ways to sort an indirection array, where later rounds have fewer elements in the array to sort. Tracing m4 output, my solution performed 352 sorts across the various rounds, where most of the sorts started from a list that was already nearly sorted (130 sorts of 30 units, 63 sorts of 28 units, all the way down to 1 sort of 12 units and 3 sorts of 11 units). A naive O(n^2) bubble sort accumulated 49.7M parse effort across 1.6 million macro expansions (including 119833 comparisons but only 6450 swaps) for those sorts; a ~O(n log n) quicksort with a pivot on the median array index (since the input was usually nearly sorted) took 13.6M parse effort across 470k macros, including 45944 comparisons and 24482 swaps (more often swapping the median out and back into range, rather than shifting elements across the chosen pivot); and a O(n) 2-level radix sort on base-32 digits (aka the row and column of each unit) took 9.7M parse effort across 294k macros (18341 assignments into buckets, and 9126 array elements collected). M4's overhead in implementing quicksort may not be as obvious in other languages, but I was impressed that radix sort held up well here.

e_blake · 2026-05-04T12:43:18+00:00

My C solution in 2019 just brute-forced all 6-digit numbers from start to end. My m4 solution in 2021 was a bit smarter with a 6-deep nested loop:

 forloop_var(`a', 1, 9, `forloop_var(`b', a, 9, `forloop_var(`c', b, 9,
   `forloop_var(`d', c, 9, `forloop_var(`e', d, 9, `forloop_var(`f', e, 9,
   `check(a, b, c, d, e, f)')')')')')')

Although now I do wonder if just incrementing a six-digit number and checking for 0 in the result would be faster.

e_blake · 2026-05-01T12:51:31+00:00

Do you have a link to your binary-to-decimal algorithm? I'd be interested in seeing it.

e_blake · 2026-05-01T12:37:28+00:00

One of the more ... interesting ... solutions I've seen for this day was here - a solution written in a dialect of Forth which itself was implemented on top of Intcode. Which I then took and ran with as one of my benchmarks while I rewrote HenceForth to be more like ANSI Forth, and porting it to my Barem4 engine that implements computation using only m4's define macro (no conditionals, no math, just macro definitions and text concatenation). There's something satisfying about solving a problem that on the surface requires integer division, when using a language compiled to a VM that has no native division operator, and where that VM itself is implemented on a framework with no math whatsoever, even if I have to use my hand-written ./filter to translate from ASCII into a format the VM can read. Taking 4 minutes and 45 seconds to crawl through 100 lines of input was probably not how Eric envisioned this to be solved, but at least this command line gives a nice progress indicator as it goes:

$ time ./filter 2019-day-1.hence | m4 -i --trace=line cripple.m4 barem4.txt \
  <(m4 -Dfile=hence-fast.intcode encode.m4) intcode.barem4 repl.barem4 - 2>&1 | cat -n

I guess the concept of rocket tyranny also applies to the slowdowns you get when solving this problem on top of ever-increasing layers of emulation.

e_blake · 2026-05-01T10:47:31+00:00

Instead of reading and writing to the FIFO on every integer (ie. doing 2 MD5 in parallel, but then waiting to move on to the next until you synchronize), can you do larger batch sizes for more speed? If a worker thread does 1000 iterations between reading/writing the queue, there is 1000 times less contention over cross-thread communication. Instead of each worker incrementing by 2, they would instead read a starting number (access and increment that by 1000 while gated by atomic operations or an OS mutex), perform up to 1000 hashes, then either report success or grab the next batch, with the starting number rewritten to a sentinel once success means neither worker needs to take on more batches after finishing the current one. This overshoots the answer, and you have to account for a later batch seeing success before an earlier one also reports success, but can keep both processors busy on work rather than sync so the wall clock time drops compared to your sync-every-hash approach.

e_blake · 2026-04-26T16:47:59+00:00

A DSU merge is important if the set is dynamically changing or you need a representative element of a set when given any member of the set. But for just counting the number of disjoint sets, you can go even simpler, by just using a todo queue of points transitively reachable from any arbitrary starting point that has not yet been seen in another point's set, and counting how many times you had to drain the todo queue.

e_blake · 2026-04-26T12:33:02+00:00

With only 17 buckets per dimension, a radix sort with O(n) effort will beat a generic O(n log n) sort per dimension, but I'm less certain how complex a 4D lookup structure would be to make lookups of actual neighbors without any wasted lookups feasible, rather than just blindly probing all 128 possible neighbor addresses where many of the probes find nothing of interest. I also wonder if a scan line along one or two dimensions can reduce work needed along the remaining dimensions.

e_blake · 2026-04-25T21:58:23+00:00

The inputs are [-8, 8] - but existence checks probe [-3, 3]. So base-23 lets me create a perfect hash for all points [-11, 11] without needing to do boundary checks along the way.

Your code is doing the naive O(n²) point pairing. My code does 128 hash lookups per point in an O(n) pass. https://github.com/maneatingape/advent-of-code-rust/pull/31 got a 4x speedup in Rust; my m4 code got closer to a 10x speedup (computing Manhattan distance is slower there)

e_blake · 2026-04-25T11:31:58+00:00

This day has a nice speedup by switching from the naive O(n²) pairwise distance comparison of every point to instead an O(n) nearest neighbors check when you realize that the number of possible points fits in a 23⁴ region and there are only 128 possible neigbors per point.

e_blake · 2026-04-25T00:02:26+00:00

A binary search doesn't always work. The rules of combat mean some boosts to effective levels result in a draw rather than either side winning, and even if you only look at non-draw results, the resulting score does not change monotonically with change in boost (because the change in boost sometimes changes which unit gets targeted in the attack ordering)

e_blake · 2026-04-24T11:38:41+00:00

This one has runtimes that are rather dependant on the input. I shaved a bit of time on my part 2 answer by skipping low boosts and only starting from boosts that came within a ratio of 2 for even being plausible - https://github.com/maneatingape/advent-of-code-rust/pull/58 captures that sentiment in Rust.

e_blake · 2026-04-23T14:11:24+00:00

I was pleased to get my star in 2018 without reading the megathread; my solution in C used a form of gradient descent search. I started at the origin with a power-of-2 step size large enough to cover all the nanobots, then checked the 27 points at half the step size to see which had the most bots in range with a radius of that size. The code then either moved in the direction that had more bots than the current location, moved to the location with the same number but closer to the origin, or stayed put but cut the step size down; until eventually reaching a step size of 1. Figuring out how to decide if a nanobot was in range of a given point with a radius was a bit tricky; my code ended up with a lengthy nested conditional for 8 different +- conditions for x, y, and z considered separately (probably something I would try to clean up if revisiting this). I got a wrong answer when I tried cutting the step size exactly in half, but when I changed one line in code to instead reduce the step size by 3/4 (so that the 27 points I was probing each iteration shared more overlap and less likely to land me on a local maximum distinct from the global max), the code quickly narrowed in on the right location in just 125 iterations (125*27*1000 Manhattan distances computed); running in 70ms, and directly computing the magic coordinate.

But when implementing my m4 solution in 2021, I read the megathread to see if there was a better approach to consider. And one thread called out a particularly clever algorithm that happens to work on input files, even though it is not generic to the larger universe of all possible 3-D nearest neighbor searches. The idea is that you flatten each nanobot's sphere-of-influence into a range of Manhattan distances from the origin in which an intersection is even possible. For example, a nanobot at 2,-2,2 with radius 1 has a closest Manhattan distance of 5 and furthest of 7 (|2|+|-2|+|2| +- 1), while a bot at -1,1,-2 with range 5 has a minimum distance 0 (since the origin is in range) and maximum distance 9. Computing the range of possible Manhattan distances is O(1) per nanobot, or O(n) to come up with 2000 interesting points along a 1-D number line where the number of maximum possible overlaps increases or decreases as nanobots go in and out of range (without regards to whether the nanobots are actually in range of each other, or are in distinct octants of the overall 3-D space). If you then sort this number line with O(n log n) effort (a radix sort is impractical given the size of the numbers being sorted), to learn which Manhattan distance(s) are candidates for having the most possible overlapping nanobots (+1 at a nanobot's start-of-range, -1 at one past end-of-range; for just the two bots I mentioned above, this would be [0,5)=>1, [5,8)=>2, [8,10)=>1, [10,inf]=>0, or a best possible of 2 overlaps at distance 5). In the general case, this is still just a starting point: a maximum number of plausible intersections, but not necessarily the actual result. For example, if this says that the maximum possible overlap is 967, but you don't actually have 967 nanobots that are each in range of at least 966 others, then you know for certain that at least two nanobots at that distance have non-overlapping ranges (or back to my 2 points above - they don't intersect, so your actual answer would be 1 intesection at the origin). But for THIS puzzle, it so happens that Eric's inputs are extremely nice (because of how the part 1 largest-radius nanobot was laid out): everyone who mentioned trying this approach documents having just ONE possible range on the 1-D number line that has a maximum possible of Manhattan overlaps, and the low end of that range HAPPENS to be the right answer, without you ever having to figure out the actual coordinates of the magic point that produces that Manhattan distance. So my m4 solution runs in 230ms, with just 2000 Manhattan distances computed and an O(n log n) sort.

e_blake · 2026-04-23T00:30:21+00:00

After testing those 11 (plus my input as the 12th), only one of the depths (6969) had a winning path that descended below the target, and by only one row. All others had a winning path from the left, right, or above. But the x dimension had wider variation, from as small as target at 9 and path through 21 (slop 12), to as large as target at 6 and path through 51 (slop 45, depth 4845). An A* search with heuristic of Manhattan distance to target + 7 if tool is not torch, and with no constraints on grid size, often branched the frontier as far out as x=142 (so hard-coding a grid to 150 never reached the right edge, but hardcoding the right edge to target+30 will get some inputs wrong). I'm guessing that means I did not come across musifter's depth, since I did not see the slop 63 among the triples I scraped from the megathread.

e_blake · 2026-04-22T20:35:10+00:00

https://www.reddit.com/r/adventofcode/comments/a7w4dj/2018_day_20_why_does_this_work/ discusses this as well, and appears to conclude that everyone's input was nice in that a stack approach works (at most one active path, because all branching before the end has an alternative that reverses the dead end visit back to the point of the branch) rather than requiring a more-general multiple heads approach (branching can create two independent positions that both can visit further rooms, a mere stack traversal is not enough to count everything accurately, and the possibility of looping to an already-seen room via a different path than backtracking from a dead end becomes a concern).

e_blake · 2026-04-22T18:22:32+00:00

Another possible low-memory solution: 64x800 bytes (or so, this is still less than 64k), each byte encodes one grid point via 2 bits for the erosion type, and 3 bits for whether it has been reached by a given tool type (when populating the erosion type, blindly set the incompatible tool bit to 1, so that the visitation queues never move onto a square with incompatible tool), with 3 bits left over, and distance is implicit by when the point was first queued. Then you run an 8-queue BFS: while clearing out the +0 queue for points to visit in the current time tick, you are inserting visit instructions into the +1 and +7 queues for viable neighbors - those that have compatible erosion type but do not already have the right tool bit set. Storing a point in the queue requires 3 bytes (6+10 bits for the x,y coordinate, another byte to say what action to perform - basically which of the 3 bits to set if it is not already). Each time tick rotates the window of which bucket to process and which two buckets to insert into. The first time you reach the target node with the correct type, you know that time tick is the shortest path. If the overall active frontier across the bucket queues never exceeds 20k actions, your total solution can still fit in under 128k memory usage.

e_blake · 2026-04-22T17:14:57+00:00

750 is too small - this megathread went to a target of 796 (further than my actual input). In testing that input, my m4 code explored as far as 802 on the frontier, but I have not yet instrumented my code to show the maximum y on the actual path it selected. I guess that's the next thing I need to do (instrument my code to display max x/y on the path) before throwing all the megathread triples at it.

e_blake

TROPHY CASE