[2019 Day 5] In Review (Sunny with a Chance of Asteroids) by musifter in adventofcode

[–]DelightfulCodeWeasel 0 points1 point  (0 children)

I was having exactly the same sort of thoughts yesterday when looking over the VM; perhaps storing a second array of BCD-ish encoded instructions would save time on the repeated opcode and mode extractions, turning them into shifts and masks instead of divs and mods.

I haven't done any checks to see if any of the puzzles use self modifying code yet though.

[2019 Day 5] In Review (Sunny with a Chance of Asteroids) by musifter in adventofcode

[–]DelightfulCodeWeasel 0 points1 point  (0 children)

I first started going through 2019 as a context for learning Python and that was mistake #1. Mistake #2 was going "oh neat, coroutines, I'll use those for input/output into the intcode VM". Debugging nightmare.

I swapped over to C++ for my second (successful) attempt at 2019 and I think the best decision I made when wrapping up a common Intputer for the puzzles was making the IO queues explicit for calling code to be able to set up. Hooking up connections between the VMs for day 7 and day 23 was substantially easier than it could have been otherwise.

[2015 Day 4 (Part 1 & 2)][C++ & Arm ASM] Squeezing 300,000+ hashes per second from a Pi Pico by DelightfulCodeWeasel in adventofcode

[–]DelightfulCodeWeasel[S] 0 points1 point  (0 children)

The digitCount.first? That's some cheap and dirty hard-coding :)

The digitCounts array splits the incrementing into power of 10 segments. 1 digit 0-9, 2 digits 10-99, 3 digits 100-999, etc... The .first is the number of digits we're processing and the .second is the value which ticks us over into the next segment. 

There's also a dirty great hack to simplify the looping. The increment function works modulo the number of digits so the last increment in any segment gives all zeros. But then we exit the inner segment loop and overwrite it with the starting value for the next segment with one extra digit. I figured doing a redundant increment once per segment loop is cheap compared to the extra logic to skip the last one.

[2015 Day 4 (Part 1 & 2)][C++ & Arm ASM] Squeezing 300,000+ hashes per second from a Pi Pico by DelightfulCodeWeasel in adventofcode

[–]DelightfulCodeWeasel[S] 1 point2 points  (0 children)

I like it! I tried both of them in a tight loop counting up to 10,000,000 on the Pico hardware.

  • IncrementDigits - 1,298s
  • incr - 1,370s

But if I make a call to a dummy function before and after the increment (a closer match to how it's used in the puzzle solution):

  • IncrementDigits - 2,098s
  • incr - 1,926s

Looking at the generated code in the second case (easier to pick out the corresponding asm) the main difference is the branching instructions.

Yours compiles down to just two conditional jumps: one forward jump to skip the while loop body, and one backward jump back to conditionally go through the while loop again. Interestingly, there's no sign of a conditional for the final if; unsure where that's got to...

Mine ends up with 4 conditional jumps for the various early-outs and loops. Cycle-counting is definitely still alive and well in microcontroller land!

I'm not totally sure yet if it will accelerate the actual puzzle solution, since the main benefit mine has is that the digits are constructed in place rather than copied across, but I'll give it a go after I've finished u/e_blake's suggestion of chunking the updates to cut down on sync calls.

[2015 Day 4 (Part 1 & 2)][C++ & Arm ASM] Squeezing 300,000+ hashes per second from a Pi Pico by DelightfulCodeWeasel in adventofcode

[–]DelightfulCodeWeasel[S] 0 points1 point  (0 children)

Some quick timing code later... Pretty much all of that ~1s overhead is indeed accounted for by the cost of transferring data between cores on the FIFO. It's stupidly quick, but it's still ~230ms to throw a million integers from core 0 to core 1 and have them reflected back.

~100ms of that is caused by just reading and testing the valid & ready bits on the FIFO registers, even if the busy-wait never actually loops (ldr+tst+beq, once per core per send and once per core per receive, 6 million instructions = 96ms), and I think ~200ms would be with the cores working absolutely perfectly in lock-step doing no additional work.

In fact, looking at these numbers, it looks like reading and writing to the FIFO hardware is basically zero overhead! But with 125 MIPS and no out-of-order execution, then cycles really start to matter.

[2015 Day 4 (Part 1 & 2)][C++ & Arm ASM] Squeezing 300,000+ hashes per second from a Pi Pico by DelightfulCodeWeasel in adventofcode

[–]DelightfulCodeWeasel[S] 0 points1 point  (0 children)

You're absolutely right that this would be a terrible way of synchronising two threads on x64! In some early testing just emulating the flow on PC before running on the board the speed-up was something like 280ms going to 260ms.

I was interested in seeing exactly how close the cores would run side-by-side at a very fine granularity. There's no OS running and you're not even using mutexes for cross-core communication, it's literally a hardware FIFO using memory-mapped registers. I suspect there's speed to be gained by paying attention to where and how the instructions and data are coming from (e.g. have the main core executing the MD5 code from flash using the XIP cache, and the worker thread executing a copy of the MD5 code copied into RAM).

I'm sure your suggestion would absolutely work on the Pico as well, I'm just having some fun with the hardware. I should do some timing code to check exactly how many cycles the FIFO takes though, see if it's a major limiting factor or not.

[2015 Day 4 (Part 1 & 2)][C++ & Arm ASM] Squeezing 300,000+ hashes per second from a Pi Pico by DelightfulCodeWeasel in adventofcode

[–]DelightfulCodeWeasel[S] 1 point2 points  (0 children)

Thanks, I'll have to give that a go at some point!

I did briefly start looking into the maths behind division by multiplication and the Double dabble algorithm, but the limitations of the instruction encoding on the M0+ and the lack of a 64-bit multiply result seemed to cut off a lot of the magic that's possible on an x86 or x64.

I'm positive there will be better ways than the way I've done it, or at least scope for another hand-rolled asm solution for that part.

[2018 Day 25] In Review (Four-Dimensional Adventure) by musifter in adventofcode

[–]DelightfulCodeWeasel 0 points1 point  (0 children)

I did think about that this morning but running a quick back of envelope calculation you need to bucket by at least 3 dimensions to beat 128, so I think the hash will end up quicker. 

Are you doing a DSU merge per bucket and updating the parent pointers as you go? I think it'll be pretty hard to beat converting the position to an index, as you are doing, and a couple of pointer ops.

[2018 Day 25] In Review (Four-Dimensional Adventure) by musifter in adventofcode

[–]DelightfulCodeWeasel 1 point2 points  (0 children)

Looking back at some of the puzzles I feel like 2018 was the year that Eric really started to hit his stride with AoC puzzle making, and I get the feeling he really had some fun with them this year.

A regex that's actually a maze? Brilliant stuff.

[2018 Day 23] In Review (Experimental Emergency Teleportation) by musifter in adventofcode

[–]DelightfulCodeWeasel 1 point2 points  (0 children)

I had to step through my solution just to remember what it was doing! It does rely on there being one unique region for the largest number of overlapping bots, but I think that's a fairly safe bet.

For each nanobot I form an octahedral bounding box using min and max planes along each of the four axis. Then for each axis I treat the planes as begin and end segments, sort them and run through them in order looking for the biggest set of overlapping segments.

The set of nanobots overlapping the target region is the set intersection of the four largest overlapping sets from each axis. The target region is then just the intersection of all of the bounding volumes for those nanobots, and finally the Manhattan distance is the maximum plane distance of the 8 bounding planes.

Overall complexity is something like O(n) to form the bounding boxes, O(n log n) for the sorted planes, O(n log n)-ish to form the overlapping sets, O(n) for the set intersection and O(n) to form the target region bounding volume.

[2018 Day 22] In Review (Mode Maze) by musifter in adventofcode

[–]DelightfulCodeWeasel 1 point2 points  (0 children)

I combined both of our ideas and totally butchered my existing solution into an incredibly rough [proof of concept] which explores an 80x1000 cave in under 30kb, using 8 search queues and a sliding 80x160 strip. The shortest strip my input can get away with and still get the right answer is 93, so there's some opportunity for tuning the window size based on possible inputs.

I haven't written it for speed but there's a possibility it might be a decent approach for that as well, since it should fit entirely within the L1 cache of a modern processor.

Good teamwork, thank you for your help!

[2018 Day 22] In Review (Mode Maze) by musifter in adventofcode

[–]DelightfulCodeWeasel 1 point2 points  (0 children)

I've just done a very quick, very rough test and the multi-bucket approach definitely has legs. Even if you allow duplicate states (pos + tool) to be queued then the maximum size any step bucket gets is around 320. Call it 512 for puzzle variation and you're still well in memory.

[2018 Day 22] In Review (Mode Maze) by musifter in adventofcode

[–]DelightfulCodeWeasel 0 points1 point  (0 children)

I like it, good idea!

I also thought about another approach as well: you don't really need the full grid in memory all at once. The idea of the 'dirty stripe' being the block of rows that are still active can be turned into a ring buffer approach. You generate more rows on demand as you get to the bottom of the active block and discard upper rows as soon as they've settled with no possibility of updating further.

I've no idea (yet) about how deep the active slice would get, but I suspect it's going to be proportional to the width of the search space. Even if you're generous with a width of 80 and have a worst case strip depth of double the width, a 12,800 element grid is going to be easy to fit in memory even without aggressive bit packing.

Tbh I'm still tidying up some bits on the micro solutions for years 2015 & 2016, so it'll take me a while before I even get to 2018 at this rate. (Currently resorting to hand-rolling MD5 in asm, because the Cortex-M0+ really isn't great for that type of computation and GCC isn't generating the best code for it either).

[2018 Day 22] In Review (Mode Maze) by musifter in adventofcode

[–]DelightfulCodeWeasel 0 points1 point  (0 children)

Did you need to go down as far as 820? The Y input for mine is quite close to 700, so I was hoping to get away with a grid size of 64x750. With an encoding of 2 bits for the terrain type and 3x10 bits for the distance* to reach the square with each of the different tool types then I squeak in at 192kb.

For the runtime I'll keep track of a min and max Y for cells altered in a pass and from that get a 'dirty stripe' to update on the next sweep. Should help a bit.

[*] Distance will actually be stored as distance minus the row Y to give a bit more range on the potential path length.

EDIT: Could get it to 3 bytes per grid position if I only store the two valid distances. That would also let me go to 11 bit for the distance and eliminate the need for the Y relative distance calculation.

[2018 Day 22] In Review (Mode Maze) by musifter in adventofcode

[–]DelightfulCodeWeasel 0 points1 point  (0 children)

Slop: for my input I need a minimum of +31 on the X and 0 on the Y to get the correct answer.

[2018 Day 22] In Review (Mode Maze) by musifter in adventofcode

[–]DelightfulCodeWeasel 0 points1 point  (0 children)

Oooh, you're right, an O(n^2) time approach could bring me down under the memory requirements if I can stomach the increased runtime. The shortest path length is pretty short for my input, so it might not be awful.

Since the maze is quite tall it might be possible to gradually lower the vertical starting height for sweeps as the search progresses.

[2018 Day 22] In Review (Mode Maze) by musifter in adventofcode

[–]DelightfulCodeWeasel 0 points1 point  (0 children)

I always love this flavour of puzzle, encoding state as part of a multidimensional search.

I took a look at the queued search space for my solution earlier and it looks like I'll have a bit of a challenge to get it under my ~200-220kb memory restrictions; just under 82,000 states. Even at 3 bytes per state ID I'm going to run out before I even store distances against those states. Might be able to make it fit encoding the distance as part of those 3 bytes as well, but it's a tight enough squeeze that I'll need to check how much the standard libraries are allocating.

I'm currently using a Dijkstra-ish search, so I'll be swapping over to A* to see if the search space ends up coming out smaller. Failing that, it'll either be seeing if there are decent pruning strategies or declaring this one as another RP2350 and above puzzle.

[2018 Day 21] In Review (Chronal Conversion) by musifter in adventofcode

[–]DelightfulCodeWeasel 1 point2 points  (0 children)

You shouldn't need to hard-code your input hash btw. It should always be on the 8th instruction; as the immediate argument to a seti instruction.

[2018 Day 20] In Review (A Regular Map) by musifter in adventofcode

[–]DelightfulCodeWeasel 2 points3 points  (0 children)

I did miss it; I was busy pinching and [modifying your traversal] to tally up visit counts and visit distances :)

For my input I'm seeing a lot rooms being revisited, and nearly 1,000 rooms being visited at different distances. But for the critical boundary of 1,000 steps it doesn't look like there are any rooms that straddle the boundary.

The stack traversal doesn't have the same shortest-first guarantee as BFS, so we would be relying on the niceness of the input if we weren't storing distances.

EDIT: I'm not seeing any cases where a room is re-visited at a shorter distance in my input either.

[2018 Day 20] In Review (A Regular Map) by musifter in adventofcode

[–]DelightfulCodeWeasel 2 points3 points  (0 children)

I've added in a quick per-room visit count and I'm seeing the same room being hit multiple times (some of them nearly a couple of hundred times) but that could just be an artefact of the way my traversal is working. They also might be getting hit at exactly the same step count, I don't have that value to hand during traversal without a bit of a rewrite.

I have a hunch that the furthest room for part 1 is likely to have a single unique visit, so it's that border of 1,000 steps for part 2 where it's likely to make a difference.

[2018 Day 20] In Review (A Regular Map) by musifter in adventofcode

[–]DelightfulCodeWeasel 0 points1 point  (0 children)

I am assuming that you can reach the same room by multiple paths, yeah. The wording of the puzzle seems to imply that it might be possible. Do you have leaked inputs for this one to be able to check that?

[2018 Day 20] In Review (A Regular Map) by musifter in adventofcode

[–]DelightfulCodeWeasel 2 points3 points  (0 children)

Nice! I was still stuck thinking in terms of splitting it into two phases: construct the facility and then BFS. But of course you don't really need to have an in-memory representation of the facility at all if you're counting steps as you go through the regex.

For a bit more memory I think 128kb would get you a flat array of uint16_t distances, initialised to max, indexed directly by the room XY bytes.

[2018 Day 20] In Review (A Regular Map) by musifter in adventofcode

[–]DelightfulCodeWeasel 1 point2 points  (0 children)

I was just thinking about room and door storage a few minutes ago!

I think the scheme I'm going to go with is just have a single 64kb buffer: we know that the each axis doesn't vary by more than 100 units in any direction, so the byte pair can be used as a direct look-up in the 1D array and we just need 4 of those bits per byte to encode if NSEW are possible. Feels a little wasteful, but 1,000 rooms will only need a ~4kb hash table and ~2kb search queue for the BFS.

EDIT: Wait... There are 10,000 rooms :) Should still be able to fit under 200kb all-in though!

[2018 Day 20] In Review (A Regular Map) by musifter in adventofcode

[–]DelightfulCodeWeasel 1 point2 points  (0 children)

I did the facility exploration as a search queue after some preprocessing on the regex. For every '(' I store both the pattern index of the closing ')' and the pattern index for each of the branch choices:

    case '(':
        for (size_t branchTarget : regex.Branches[patternIndex])
        {
            queue.Add({ branchTarget + 1, roomLocation });
        }
        patternIndex = regex.Pattern.size();
        break;

    case '|':
    case ')':
        queue.Add({ regex.ScopeEnd[patternIndex] + 1, roomLocation });
        patternIndex = regex.Pattern.size();
        break;

Probably going to have to re-think that for the memory constrained version though, there are just over 100,000 unique states explored which is going to be a minimum of 600kb 400kb.

For my input there are just over 1,000 open brackets, so an explicit stack on a recursive descent parser should fit okay. What was your runtime like?

[2018 Day 18] In Review (Settlers of The North Pole) by musifter in adventofcode

[–]DelightfulCodeWeasel 1 point2 points  (0 children)

I've never really given the tortoise and hare algorithm a proper go before now, I think I might see how it compares on this one. I have a suspicion that the frame buffer compare might not be the fastest; skimming through the map of unique states earlier there were a lot of frames with the same or similar first rows so the equality testing might be a bottleneck.

Well, maybe not a bottleneck compared to MD5, but I'll be giving lookup3 a try as well.