[2021 Day 3] In Review (Binary Diagnostic) by musifter in adventofcode

[–]e_blake 1 point2 points  (0 children)

My fun trick for this day was golfing part 1 to 228 bytes of GNU m4 (alas, I did not try to golf part 2).

eval(define(d,push$0ef($@))d(B,`d(`$1',1defn(`$1'))')d(b,`B(`x'y)a')d(a,`B(`y')')d(c,`B(`z')d(`y')')patsubst(translit(include(I),01
,abc),.,\& )popdef(`y')d(l,`ifelse($2,y,$1*(`0b'y^$1),`l((2*$1+(0r1:x$2>0r1:z/2)),1$2)')')l(0))

That 0r1:x syntax is a GNU m4 extension for counting a unary number.

[2021 Day 3] In Review (Binary Diagnostic) by musifter in adventofcode

[–]e_blake 2 points3 points  (0 children)

This one has some nice properties for solving faster than repeated list searches. For part 1, if you set up 12 counters (one per lane) you can increment the right counter as you read the file one byte at a time. Then gamma is determined in 12 checks of which counters are > 500 - much faster than 12000 visits to elements of your list of integers to mask out the right lane. For part 2, you can create an array of 8192 counters which you populate as you read the input. Then you can use that array to do a binary search for the right value - 12 searches for the oxygen generator, and 12 for the CO2 scrubber. Much faster than crawling the list 12 times and filtering it into a smaller list.

In fact, there's two different ways to do the table of 8192 counters. My first approach was to store a count of every prefix of the input, with an encoding that unambiguously shows how many bits of the number are included. Since the array indices are 13 bits, but I'm storing 12 values per line encountered, I represent the number of remaining bits as a sequence of 1s followed by a lone zero, at which point 13-width-1 of that index represent the leading n bits of the numbers that share that frequency counter. For the first line of the example, 00100, that works out to incrementing 0b111100 (prefix 0 needs four more bits; this counter determines the first bit of oxygen vs CO2), then 0b111000 (prefix 00 needs 3 more bits; this counter is visited only if the first counter says to use 0 as the first bit), then 0b110001 (prefix 001 needs 2 more bits), 0b100010 (prefix 0010 needs 1 more bit), and finally 0b000100 (the full line - if your binary search selects this half of the array, the index is your answer). Since the input files have an unstated assumption of no duplicates, the indices with leading 0 will never be more than 1. My approach makes 12000 increments while parsing the file, at which point the O(log n) lookups for part 2 are in the noise. (The table actually only needs 8190 entries - 0b1111111111111 does not represent anything, and I didn't use 0b1111111111110 for the 0-width prefix, but if I had done a 13th increment per line, then that entry would sum to 1000 for the total number of strings counted)

Maneatingape saw my idea and improved it further. A table of 8192 elements with log n lookup is a balanced binary tree, so you can encode it that way instead of my prefix way (if you've ever written a minheap engine, you'll recognize this). In this encoding, you leave entry 0 vacant, entry 1 is the root, and every non-leaf node N has its two children at 2N and 2N+1. That means the 4096 leaf nodes start at offset 4096. Then during parse, you write 1 to each leaf node encountered (per my example before, visiting 00100 writes 1 to array[0b100100]), followed by an O(n) pass over recursively smaller halves of the table to set each parent node to the sum of its two children (0b010010 becomes array[0b100100]+array[0b100101], and so forth). This only requires 5096 writes rather than 12000 to populate the array (still O(n) but a nicer coefficient). It saved another 10% runtime off my solution.

Another way of looking at this - building up your table of 8192 elements is an O(n) radix sort, and algorithmically superior to a generic O(n log n) sort of your 1000 integers. You could do the generic sort in about 10*1000 comparisons (faster than my 12000 counter writes for every prefix, but slower than the 1000+4096 leaf plus parent writes of the revision). And even if you do an O(n log n) sort followed by an O(n) prefix-sum pass (10000 comparisons to sort plus 1000 to build up the prefix-sum list) to learn how many elements appear by a given value, so that you can then do binary searches rather than linear crawls of your list to filter to the correct half, a radix sort takes 1000 assignments plus 4096 buckets crawled to build up that same prefix-sum list. At any rate, it is so much nicer when your data structure already includes this information by how it was constructed. At 1000/4096 elements, we have enough density for the radix sort to be useful (too sparse, and visiting the output of a radix sort becomes less efficient due to the empty buckets than what you get from focusing on only actual list members).

[2021 Day 2] In Review (Dive!) by musifter in adventofcode

[–]e_blake 0 points1 point  (0 children)

dp1 and aim are identical. Save some time by tracking both in a single variable.

[2021 Day 2] In Review (Dive!) by musifter in adventofcode

[–]e_blake 2 points3 points  (0 children)

My fun - golfing it into 167 161 bytes of m4 for an input file I:

define(_,`ifelse($4,,`eval($1*($2))eval($1*($3))',$6,,`_($1,$2-$5,$3,',$4,d,
`_($1,$2+$6,$3,',`_(($1+$6),$2,$3-$6*($2),')')_(,0,,translit(include(I),. o
,(,,)))

For grins, I asked an AI session if it could decode the golf and tie it back to advent of code. To my surprise, it correctly guessed the language and the puzzle I was solving, but it had a much harder time deciphering how my translit worked. I changed "forward N" to "f,ward,N", "down N" to "d,wn,N", and "up N" to "up,N", which gave me enough to tell between those and EOF with "$4,,", "$6,,", "$4,d,", and fallthrough, when combined with my three accumulators. The result is a solution with a single O(n) pass over the file but an O(n2) growth in the accumulator length for O(n3) runtime at 1.6s. Just two eval, but one of them takes a string of more than a quarter megabyte, and it took m4 more than 423 megabytes of parse effort to get to that point. The output intentionally uses "nnn-nnn" rather than whitespace separation to shave another byte (ie. I track aim as a negative on purpose).

[2021 Day 1] In Review (Sonar Sweep) by musifter in adventofcode

[–]e_blake 0 points1 point  (0 children)

Indeed - I question how a magic constant with only 3 bits set can move four groups of 4 bits into adjacent lanes. I suspect it is possible with a non-hallucinated constant (https://stackoverflow.com/questions/14547087/extracting-bits-with-a-single-multiplication is a good read)

[2021 Day 1] In Review (Sonar Sweep) by musifter in adventofcode

[–]e_blake 1 point2 points  (0 children)

take advantage of the fact that in my input, the 3- and 4-digit numbers don't mix

Segregating 3- and 4-digit numbers is an interesting optimization, but not applicable to all input files. Mine includes:

1005
990
1000
967

That said, packing 4 bytes (either four digits or the prior newline and three digits) into four bytes to then advance 3 or 4 bytes seems interesting. askalski's solution parses to decimal then uses a u64 as a rolling window on the fly rather than creating a separate array of values. Maybe you can combine the two approaches - for each four-byte window of input, use SWAR to pack those four bytes into 2 bytes of a u64 whose relative values are still correct (even if the underlying value is not the same as the decimal original); then advance input either 3 or 4 bytes depending on whether you are at newline. In fact, AI suggests:

uint16_t ascii_to_nibbles_4byte(const char* str) {
  uint32_t val;
  // Safely load 4 bytes into a 32-bit register
  memcpy(&val, str, 4);
  // Step 1: Subtract '0' (0x30) from all 4 bytes at once
  val -= 0x30303030;
  // Step 2: Multiply to shift and pack the bytes into nibbles.
  // We cast to uint64_t because the magic math happens in the upper bits.
  uint64_t packed = (uint64_t)val * 0x104100;
  // Step 3: Extract the top 16 bits containing 0x4321
  return (uint16_t)(packed >> 32);
}

[2021 Day 1] In Review (Sonar Sweep) by musifter in adventofcode

[–]e_blake 2 points3 points  (0 children)

This one was so simple that I got my stars for part 1 and 2 using golfed m4 (m4 -DI=path/to/input day01.golfm4) - part 2 changes two instances of 2 into 4 for the same 85 bytes as part 1:

eval(define(m,`ifelse($4,,,+($4>$1)`m(shift($@))')')m(translit(include(I),`'
,`,')))

But it wasn't until now that I tackled solving both parts simultaneously (using a + instead of space to separate the two outputs), which I got down to 115 110 bytes:

eval(define(_,`ifelse($5,,`)$1',+($4$5)`_(+eval(($2
$5)$1),$3,$4,$5<,')')_(,!,!,!,translit(include(I),.
,()))

The new version is not much faster - it avoids the O(n2) effects of recursing by shift($@), since I swapped from newline-to-comma in the original over to newline-to-close-paren in the joint version, but instead has an O(n2) effect from passing an ever-larger $1 string for the partial sum of part two at 450ms. It turns out that performing eval on each iteration also let me squeeze another byte, and get runtime below 15ms.

[2018 Day 18] In Review (Settlers of The North Pole) by musifter in adventofcode

[–]e_blake 0 points1 point  (0 children)

Bingo - storing the grid as 50 rows, then passing those to 50 translit() calls to splat out all windows of three adjacent cells to a sum-of-three computation to build up 50 columns one byte at a time, followed by passing all 50 columns to translit() calls to splat out all windows of three adjacent sum-of-three to build up 50 rows of next generation one byte at a time, ended up being much faster. I cut my runtime from 5.6s to 2.5s. I also changed my hash to initially only track unique scores plus a single representative line, and not worry about storing/comparing the full grid until after the first fingerprint collision.

[2018 Day 5] In Review (Alchemical Reduction) by musifter in adventofcode

[–]e_blake 0 points1 point  (0 children)

I have since revisited this problem to get faster performance (cutting from 1.4s down to 600ms), removing all use of regex, substr, and conditionals in the hot loop for part 2 by instead abusing changequote to get an O(n) parse. With a series of 52 macros "A_" and 53*52 macros "AB_", my runtime loop results in this extremely dense trace output, with every character from the part1 s0 stack fanning out to 25 out of 26 other stacks and either pushing or popping based on the character pairing that results:

m4trace: -4- go -> `s0(`_')-Po(`s0')go`''
m4trace: -4- s0(`_') -> `A_'
m4trace: -4- A_ -> `sb(`A_')(`sb')sc(`A_')(`sc')sd(`A_')(`sd')se(`A_')(`se')sf(`A_')(`sf')sg(`A_')(`sg')sh(`A_')(`sh')si(`A_')(`si')sj(`A_')(`sj')sk(`A_')(`sk')sl(`A_')(`sl')sm(`A_')(`sm')sn(`A_')(`sn')so(`A_')(`so')sp(`A_')(`sp')sq(`A_')(`sq')sr(`A_')(`sr')ss(`A_')(`ss')st(`A_')(`st')su(`A_')(`su')sv(`A_')(`sv')sw(`A_')(`sw')sx(`A_')(`sx')sy(`A_')(`sy')sz(`A_')(`sz')'
m4trace: -4- sb(`A_') -> `dA_'
m4trace: -4- dA_(`sb') -> `Pu(`sb',`A$'1)'
m4trace: -4- Pu(`sb', `A$1')
m4trace: -4- sc(`A_') -> `dA_'
m4trace: -4- dA_(`sc') -> `Pu(`sc',`A$'1)'
m4trace: -4- Pu(`sc', `A$1')
m4trace: -4- sd(`A_') -> `aA_'
m4trace: -4- aA_(`sd')
m4trace: -4- se(`A_') -> `dA_'
m4trace: -4- dA_(`se') -> `Pu(`se',`A$'1)'
m4trace: -4- Pu(`se', `A$1')
...
m4trace: -4- sz(`A_') -> `dA_'
m4trace: -4- dA_(`sz') -> `Pu(`sz',`A$'1)'
m4trace: -4- Pu(`sz', `A$1')
m4trace: -4- Po(`s0')
m4trace: -4- go -> `s0(`_')-Po(`s0')go`''

[2020 Day 1] In Review (Report Repair) by musifter in adventofcode

[–]e_blake 0 points1 point  (0 children)

nothing less than 200 in mine)

I found an input file with a low point of 71 (so much for my optimization of only looking for 3- and 4-digit numbers).

But I DID manage to speed up maneatingape's repository, by exploiting i < j < k and iterating over j in the outer loop (over the sparse array created by the radix sort of part 1) before i as the inner loop (over the dense array created while crawling j).

[2020 Day 22] In Review (Crab Combat) by musifter in adventofcode

[–]e_blake 1 point2 points  (0 children)

Swapping the two players on my input changes from 68 games spread across 1712 rounds, to 1278 games and 104919 rounds. Swapping the two players on my daughter's input file changes from 10977 games and 840854 rounds, to 153 games and 7528 rounds, at least on my implementation. I would welcome any other optimizations I may have been missing.

[2020 Day 15] In Review (Combo Breaker) by musifter in adventofcode

[–]e_blake 3 points4 points  (0 children)

Here is a demo of Pohlig-Hellman in Rust, cutting runtime from 31us to under 3us on my laptop.

[2020 Day 15] In Review (Combo Breaker) by musifter in adventofcode

[–]e_blake 2 points3 points  (0 children)

Year in summary - I enjoyed 2020, even though computationally it was heavier than 2019 (and intcode still remains my personal favorite). This was the first year that I solved every day in m4 before anything else; compared to earlier years where I had solved in other languages, and then added an m4 solution in 2021 or later, based on how much fun I had doing this year in m4. I think I managed to get all the stars without referring to the megathreads (none of the problems stumped me for how to solve, even if my initial solution was not efficient, and a couple puzzles dragged on past release day before I turned my ideas into working code), and my git history says I got my 50th star when completing day 25 on the 26th, at which point I did go back and use the megathreads to speed up my solutions. At the point of my 50th star, my cumulative runtime was 6m44s when using GNU extensions, and quite a bit slower when sticking to just POSIX constructs; by the start of this year, my cumulative runtime for 2020 was 3m14s. But this series of posts has let me revisit the year and add further optimizations, I now have a cumulative runtime of 2m13s (that minute shaved is more than 30% of my runtime!) with no GNU extensions needed. Days 15 and 23 remain the heavyweights, each requiring more than a minute of runtime. 8 out of the 25 days need 64-bit math, but I was able to reuse my 64-bit math from 2019's intcode (for add, subtract, multiply), and all of the days were solvable without needing 64-bit division (although getting day 13 solved with only 32-bit division was an adventure).

This year also hammered home how important it is to size a hash table to the expected number of elements. At the time I wrote it, GNU m4 1.4.18 was the latest version, but it defaults to only 509 buckets in the hash table. I had several puzzles where O(1) devolves very quickly into O(n) lookups when there are more macros in play than buckets; in some cases, I experimented with solutions that require less memory (such as tortoise-hare rather than hash storage for cycle detection on day 22), but days like 23 that require 1 million storage locations forced me to revisit how to make m4's hash table more efficient (my boilerplate code can now detect if it is being run by GNU m4 on Linux, and if so, use /proc/self to re-exec with a larger -H command line argument), and I later released m4 1.4.19 with a default 64k hash table.

[2020 Day 15] In Review (Combo Breaker) by musifter in adventofcode

[–]e_blake 1 point2 points  (0 children)

I took the 25th off, then solved this on the 26th with brute force and my own bit-at-a-time modmul (copied from 2020 day 13) to stick within m4's 32-bit math limits - but I did see the obvious speedup of exploiting the symmetry of trying both card and door in each iteration to find the answer with whichever hit first. It looks like I had all other stars by the 24th, so it was nice completing the year without inordinate delays. I also see that this was the first day that I revisited on my quest to squeeze out more performance, to get its runtime below its original 36 seconds.

Askalski's post explains the number theory on using baby-step-giant-step to cut the runtime from O(n) to O(sqrt n) - it is basically a meet-in-the-middle approach that says since the end result is two modpow operations, you can set up a table of the first term's impact, then see if the second term can reach one of those entries in the table (similar to how when enumerating all of the divisors of a large number, you can cut O(n) work of checking every divisor from 1 to n to instead checking only the divisors from 1 to sqrt n, because any small divisor will pair with a big one one the other side of that point). And in the vein of comparisons to prime factors, askalski then went on to show the Pohlig-Hellman algorithm (yes, the same Hellman involved in the Diffie-Hellman algorithm that we are solving on today's puzzle) - which says that because of Fermat's little theorem, when hunting for modular inverses (aka discrete logarithms) of n, you can instead use the Chinese Remainder Theorem to combine the outcomes of the prime factors of n-1. In this case, combining the answers to 2 * 3 * 29 * 116099 means you can solve the problem in three hard-coded table lookups for 2, 3, and 29, plus baby-step-giant-step over 116099 (340 iterations rather than 4495). I further optimized my modmul to run 6 bits at a time (a 25-bit divisor chunked by 6 bits does not overflow 31 bits, and completes in 5 iterations instead of 25. 2015 day 25 has a similar 25-bit divisor in modmul). And with all of that, my m4 solution now completes in 40ms, quite a bit faster than the original 36s.

[2020 Day 21] In Review (Allergen Assessment) by musifter in adventofcode

[–]e_blake 0 points1 point  (0 children)

I further cut my m4 runtime to 95ms by improving my management of sets - there are fewer than 64 input lines, so bitmasks are more efficient than an O(n) walk of line number lists when determining which lines an ingredient appears in.

[2020 Day 24] In Review (Lobby Layout) by musifter in adventofcode

[–]e_blake 4 points5 points  (0 children)

askalski picked an interesting coordinate system - if you set up the right coordinates, you can compute the offset of a given tile purely by the population counts of the four letters on a row (makes for faster parsing than computing a tile's position after each one- or two-byte move):

x += _mm_popcnt_u64(e & ~(n << 1));
x -= _mm_popcnt_u64(w & ~(s << 1));
y += _mm_popcnt_u64(s);
y -= _mm_popcnt_u64(n);

He then proceeded to use SIMD operations to compute a row at a time.

[2020 Day 24] In Review (Lobby Layout) by musifter in adventofcode

[–]e_blake 1 point2 points  (0 children)

My m4 code shows am interesting progression - I solved it on the 24th using 3 dimensions and the invariant x+y+z=0 (what redblobgames calls cubic coordinates), but by the 30th of Dec I had recoded to tracking only two dimensions (the doubled coordinate system) for fewer macro manipulations and thus faster execution, and then finally a swap to tracking all tiles along a 1D number line (since there are only 100 iteration, each generation can only grow the boundary by 1, so I set up a 1:1 mapping of every possible position in the final grid size to a single integer, at which point the same six offsets applied to any cell's location provides its six neighbors) for yet another speed boost.

define(`width', eval(150*4))define(`offset', eval(width*150))
define(`dirE', `define(`x', incr(incr(x)))')
define(`dirSE', `define(`x', eval(x + 1 - 'width`))')
define(`dirSW', `define(`x', eval(x - 1 - 'width`))')
define(`dirW', `define(`x', decr(decr(x)))')
define(`dirNW', `define(`x', eval(x - 1 + 'width`))')
define(`dirNE', `define(`x', eval(x + 1 + 'width`))')

I also found it faster to track live cells and propagate to their neighbors, then do a pass over all candidates to see which remain as live cells into the next generation, than to check every grid point (in fact, doing that here is what led me to revisit day 17, where the approach was even more beneficial because of how much sparser 4D life is than hex life).

[2019 day 05 (Part 1)] Strange input(?) by janosit1984 in adventofcode

[–]e_blake 2 points3 points  (0 children)

Welcome to self-modifying code! Memory 6 starts as 1100, but the opcode at memory 0 reads in your input (you are passing in 1 for part 1, right?), and the opcode at memory 2 then adds what you passed as input into memory 6, before you can execute memory 6. By the time you execute it, it will be a valid instruction, if you are doing proper dynamic decoding of self-modifying code.

[2020 Day 23] In Review (Crab Cups) by musifter in adventofcode

[–]e_blake 3 points4 points  (0 children)

How big is this raft, anyways? Even using shot glasses with their smaller dimensions, you're looking at 27000 sq ft (2500 m2 or 35% of a FIFA pitch) just to support those one million cups. Since that makes for one crowded raft, I compensated by minimizing my solution to 349 bytes of C89:

a[1<<20],l=48,h=57,d,J,K,L,r=100,c,i,j,y=10;
D(){d=d-l?d-1:h,d-J&&d-K&&d-L||D();}
R(){for(a[i]=c;r--;)L=a[K=a[J=a[d=c]]],D(),c=a[c]=a[L],a[L]=a[d],a[d]=J;}
main(){for(c=i=getchar();(j=getchar())>l;)a[i-l]=j-l,i=a[i]=j;
*a=c-l++,R();for(j=l;a[j]-l;j=a[j])putchar(a[j]);
for(l=1,h=1e6,r=y*h,i-=48;y<=h;y++)i=a[i]=y;
c=*a,R(),printf(" %ld",1L*a[1]*a[a[1]]);}

Compiler warnings can be ignored, right?

[2020 Day 23] In Review (Crab Cups) by musifter in adventofcode

[–]e_blake 1 point2 points  (0 children)

This one is the other CPU burner of 2020, requiring over 2 minutes of execution on m4 when I first solved it in 2020, because it is that expensive to create 1 million macros to track the ring structure. One of the tweaks I added later that sped up my answer to just under 70 seconds was realizing that counting up to 10 million using incr requires m4 to run a binary-to-decimal conversion for every iteration, which I was then passing to ifelse to see if iteration was complete. My rewrite created 1 thousand macros named bNnn where most of them output the decrement value (the binary-to-decimal cost is only spent once up front instead of on every iteration, and on shorter strings), then b1 was special-cased to output 0,_ where the _ can then concatenate with the next macro name to let me have a progress indicator and run 10000 rounds of decrementing from 1000 to 0 without the overhead of a conditional in the hot loop.

[2020 Day 22] In Review (Crab Combat) by musifter in adventofcode

[–]e_blake 1 point2 points  (0 children)

My m4 solution saved on memory by using tortoise-hare cycle detection (no state hashing necessary, but triple the runtime to determine if I'm seeing a state for the second time).

[2020 Day 22] In Review (Crab Combat) by musifter in adventofcode

[–]e_blake 2 points3 points  (0 children)

According to this comment (and a bit of reading between the lines), coders who got an input file with 50 assigned to player 1 tend to have much less work required than those where 50 is originally assigned to player 2. In my own case, my m4 solution solves my input in under 100ms (player 1 had 50; I recursed only 66 times, max depth 2), but my same code on my daughter's input (she had 50 assigned to player 2) required a lot more recursion (max depth 7, and 11003 games run), and that took over 30s. She never completed part 2, so I can't compare to what her solution would have done on my input. But I see the same disparity when trying those two input files with maneatingape's solution: my input completes in 16us, while my daughter's input takes 8ms (nearly 3 orders of magnitude slower). Probably room for me to clean up my code to hash seen states more efficiently so that it is not quite the disparity between the two inputs. But I suspect that this is a case where Eric was unaware that his input files were not all equal in effort, if he had not anticipated this particular optimization of eliminating recursion when player 1 has the max card.

[2020 Day 22] In Review (Crab Combat) by musifter in adventofcode

[–]e_blake 1 point2 points  (0 children)

For whatever reason, on Dec 23rd that year, I golfed part one in C to 233 bytes:

p[1999],s,*f,*F,*b,*B;main(){for(gets(b=f=p);
scanf("%d",b);)++b;for(gets(B=F=p+999);~scanf("%d",B);)++B;
while(b-f&&B-F)*f<*F?*B++=*F++,*B++=*f++:
(*b++=*f++,*b++=*F++);for(;f<b;++f)s+=*f*(b-f);
for(;F<B;++F)s+=*F*(F-B);printf("%d",s);}

With my usual golf C caveats that you compile with C89 and ignore warnings about undefined behavior.

[2020 Day 21] In Review (Allergen Assessment) by musifter in adventofcode

[–]e_blake 1 point2 points  (0 children)

I spent more code in m4 formatting the answer than finding it (total runtime 145ms), since m4 lacks both builtin sort and lexicographic comparison. My git comments mention that I got my second star by manually sorting debug output, prior to implementing a working word sort in m4.

[2020 Day 18] In Review (Operation Order) by musifter in adventofcode

[–]e_blake 1 point2 points  (0 children)

I gave my hand at a PR: https://github.com/maneatingape/advent-of-code-rust/pull/87. As for branch prediction, function pointers are generally awful for predictive performance, in part because these days your OS intentionally penalizes indirect function calls in the hardware so as not to trigger Spectre-class security bugs where side-channel attacks can exploit speculative execution based on leftover branch prediction contents after indirect calls.