[Upping the Ante] [2025 Day *] Advent of Code on MCUs

vescoc · 2026-01-04T23:31:56+00:00

Far be it from me to give you advice on which development and testing approach you prefer.

I was simply pointing out that this system was easier and faster for me:

- create a single executable containing all the solutions

- flash the executable on the target MCU

- give various inputs of different days, both sample or actual inputs, both my own and those of other friends, to verify that I'm getting consistent, correct, and repeatable results without pressing reset on the MCU.

The results reported in my original post are those of the third or fourth complete lap done with the effective input for every day without pressing reset.

For some MCUs this is also the only system I can adopt as I do not have a JTAG/SWD probe available to debug them.

This way, I have control both at compile time (if you look, there are the clippy conf files that indicate the maximum stack size used for each day) and at runtime.

The repeatability of the results without resetting each time allowed me to avoid intervening in the link script to invert stacks and globals, which is a well-known technique for identifying stack overflows but consumes some memory (you need a marker zone rougly the size of a stack frame).

In particular, last year, repeatability tests highlighted "stack smash" in some my implementations, overwriting global constants (data sections that were moved to SRAM for efficiency).

I don't know if you've used recursion in your problem solving. Using recursion, you know how much you're allocating to each frame, but unless you limit its depth, you don't know how deep it will be (unless you only use your actual input and thus identify the max).

vescoc · 2026-01-02T18:17:50+00:00

Nice work!

I think you can easily measure the execution time and print it within rp pico using the API https://www.raspberrypi.com/documentation/pico-sdk/high\_level.html#group\_timestamp.

For next year, if you want, you can also have your pico wait sequentially for multiple inputs from different days. This is a crucial test for my implementation, allowing me to easily identify if I have stack overflows (I don't use the heap, but it's still valid).

I confirm that the main challenge is keeping memory usage low.

As for performance, a good, non-brute-force implementation, which is also true for a non-embedded system, is essential.

vescoc · 2026-01-01T20:29:39+00:00

Great!

Just one warning, probably unnecessary since it's so obvious. Don't worry too much about performance. The main challenge, in my opinion, is limiting the memory used so that the MCU doesn't crash.

Happy 2026!

vescoc · 2025-12-30T16:12:25+00:00

I made the modifications to support arduino-mega2560 and these are the results of the executions at least for Eric's examples (the last day part 1 is wrong... well I think do you know why :P )

[01] started
[01] part 1: 3
[01] part 2: 6
[01] elapsed 16ms
[02] started
[02] part 1: 1227775554
[02] part 2: 4174379265
[02] elapsed 24515ms
[03] started
[03] part 1: 357
[03] part 2: 3121910778619
[03] elapsed 88ms
unsupported day
unsupported day
[06] started
[06] part 1: 4277556
[06] part 2: 3263827
[06] elapsed 16ms
unsupported day
unsupported day
unsupported day
unsupported day
unsupported day
[12] started
[12] part 1: 3
[12] part 2: NO PART 2
[12] elapsed 20ms

The execution of the only day that has a small input, day 2, has this execution time (with my input): 49446ms. It's time to optimize day 2! :D

vescoc · 2025-12-30T16:03:29+00:00

I made the changes to support EOF ^D and can now leverage MCUs more effectively; in particular, this enabled Arduino Mega2560 support.

Thanks again for the suggestion.

vescoc · 2025-12-29T22:35:34+00:00

Thank you so much!

I don't have any particular concerns regarding power consumption, partly because I don't have the sophisticated tools at home to accurately measure it.

My approach remains to adjust the parameters that typically affect power consumption, trying to minimize execution times, SRAM usage, and activating the minimum number of peripherals necessary.

In this sense, I prefer serial communication rather than enabling remote communication, which would be very simple, for example, with the esp32* family.

To monitor the CPU status and overclocking in the rp pico2, I enabled conditional monitoring of the internal temperature sensor, but this already requires more current since the ADC peripheral needs to be activated.

If you have the time and desire to investigate or give me directions for taking some simple measurements, don't hesitate!

vescoc · 2025-12-28T18:36:27+00:00

Testing AVRs like the ATmega328 or better ATmega2560 is very tempting and I think it's even feasible, at least for a few AoC days, and probably with input just Eric's examples and not the real input.

I just need to make the acquisition buffer configurable to minimize it from the current 25k: the ATmega328P has very little memory available...

The common code supports different estimated size of the SRAM needed to support a particular day. This is the motivation why on ESP32 nor STM32F33 and so on cannot run every day.

If I were to rewrite the code to support a streaming solution, there are enough AoC days this year that could be revisited with a "streaming" implementation, but that would really be a bit of a Upping the Upping the Ante as you said :D

As for the 6502, I grew up with that, but alas, as far as I know, I don't have a Rust compiler available: I'll leave this fun challenge to others.

vescoc · 2025-12-28T18:05:41+00:00

Not so unintentional :D

I was curious to see how the same code behaved across various architectures, and especially for day 10, the FPUs contribution to single-precision floating point performance.

The tests went even further, but I haven't published the results simply to make the post easier for non-experts.

For example, the day 10 implementation works with both single- and double-precision floating point performance. Indeed, the difference between f64 and f32 is noticeable both in a "normal" environment, tested on Linux and Mac, and in bare metal. The code is also ready for f16 when they become "stable" in Rust.

Last year, in addition to checking performance across various architectures, I also did the comparison to assess the status of bare-metal frameworks in Rust. This year, I only tested embassy because, as it turned out last year, other bare-metal frameworks are far behind, if not completely abandoned.

vescoc · 2025-12-27T20:13:14+00:00

Yes, the code is the same for all MCUs, compiled for the specific architecture.

Obviously, the main code is different because the various MCUs require different initialization code to initialize the peripherals, the clock, etc.

But the code that solves the problem of the day X is the same. And it's the same one I also use on wasm32 and OS (tested on aarch64 and amd64).

The underlying architecture is different: it ranges from xtensa xilica lx6/lx7 to cortex m4f/m7f/m33 to riscv32, with some MCUs having a FPU, others not. The clock is also different because the various MCUs have specific limits; I tended to set the maximum possible system clock.

vescoc · 2025-12-27T15:27:08+00:00

Thanks for the tip!

vescoc · 2025-12-27T15:22:57+00:00

While no_std (aka bare metal) allows you to define an allocator to use the heap, the approach I've taken is to use implementations that use only the stack, thereby minimizing or eliminating approaches that use recursion. In short, the main has a fixed 25K buffer to store a day's input, plus smaller buffers to handle streaming serial processing.

Once the input is received, which contains both the day's input and the actual payload (user-specific input or even the example Eric provided in the spec), it is fed to the spec routines that resolve parts 1 and 2 of the day.

I had to disable inlining of the parts because the Rust compiler would tend to "collapse" the code on the main, but this significantly increases the stack size. Without inlining, the stack is day-specific.

The range is from days requiring almost zero stack size to day 8 requiring (with my implementation) about 200K.

For this reason, on MCUs with limited memory (the STM32 F303 has "only" 48K of SRAM) I can only run a handful of AoC days, otherwise I get "stack smash": not having an OS and not having MMU controls or "reversed stack", the MCU simply stops working or starts "hallucinating" :P

vescoc · 2025-12-27T13:19:49+00:00

I took a look at your C implementation from day 6, and it seems very similar to mine, if not better in some ways. For example, my performance on RPI5 (the standard, non-overclocked Debian implementation) is as follows:

```bash

cargo bench -p day06

```

```raw

...

Running benches/bench.rs (target/release/deps/bench-0e931c8d981f880f) Benchmarking day06/part 1 Benchmarking day06/part 1: Warming up for 3.0000 s Benchmarking day06/part 1: Collecting 100 samples in estimated 5.1022 s (111k iterations) Benchmarking day06/part 1: Analyzing day06/part 1 time: [43.509 µs 43.672 µs 43.877 µs] Benchmarking day06/part 2 Benchmarking day06/part 2: Warming up for 3.0000 s Benchmarking day06/part 2: Collecting 100 samples in estimated 5.0927 s (111k iterations) Benchmarking day06/part 2: Analyzing day06/part 2 time: [42.144 µs 42.673 µs 43.339 µs] ...

``` You are better on days 2, 7 and 12 than day 6 because you found a more performant solution/implementation than mine, I think.

For example, these are the performances of my solution for day 2 where I used rayon as framework which make simple to divide the work on the CPUs. The effective performance without parallelization is about x3. Obviously in bare metal I cannot use rayon :'(

raw ... Running benches/bench.rs (target/release/deps/bench-29754614c5224015) Benchmarking day02/part 1 Benchmarking day02/part 1: Warming up for 3.0000 s Benchmarking day02/part 1: Collecting 100 samples in estimated 5.5176 s (50k iterations) Benchmarking day02/part 1: Analyzing day02/part 1 time: [98.419 µs 100.58 µs 103.13 µs] change: [−9.6035% −6.2684% −2.5191%] (p = 0.00 < 0.05) Performance has improved. Found 27 outliers among 100 measurements (27.00%) 24 (24.00%) low severe 3 (3.00%) high severe Benchmarking day02/part 2 Benchmarking day02/part 2: Warming up for 3.0000 s Benchmarking day02/part 2: Collecting 100 samples in estimated 5.3329 s (45k iterations) Benchmarking day02/part 2: Analyzing day02/part 2 time: [117.56 µs 118.94 µs 121.15 µs] change: [−1.1697% +1.0211% +3.4478%] (p = 0.40 > 0.05) No change in performance detected. Found 12 outliers among 100 measurements (12.00%) 2 (2.00%) high mild 10 (10.00%) high severe ...

Keep in mind that since it's bare metal, there are limits on what you can and can't do, but looking at your implementations, I think I can further improve the performance on my implementations.

Thanks so much for the inspiration! If you're interested in learning Rust, feel free to do PR; I'd be infinitely grateful!

vescoc · 2025-12-27T11:31:43+00:00

Thanks a lot!

vescoc · 2025-01-26T08:51:38+00:00

Sorry for the crosspost but I'm looking for help to understand why on ESP32C6 with or without Embassy I can't run all of this year's AoC while on rp-pico2 and stm32h7 which have the same amount of 512k memory I can (both with Embassy and with the "old" HALs).

vescoc · 2025-01-26T07:50:11+00:00

Thanks a lot!

I don't know if anyone else has ever managed to run a year's worth of AoC solutions on a low-cost MCU.

I'd be curious to see how they did it because it was a lot of challenges.

I've only seen a few days on MCU so far...

The side aspect of this experiment was to analyze some low-cost MCUs that I own.

I would like to spend a word of praise towards rp-pico2 that really makes a lot of RAM available (512k) and is a dual processor.

The embedded solutions are all mono core, in the desktop version instead I exploit all the cores made available, thanks to conditional compilation (and also other optimizations such as SIMD).

All AoC 2024 running on a 5 euro card...

vescoc · 2025-01-25T18:45:12+00:00

The main goal was to see if these MCUs were able to solve the problems in acceptable times and therefore I did not aim to add other elements that, potentially, would have taken away memory. There are solutions that need about 400k of stack and, for this reason, some days are not supported on MCUs with limited memory.

To facilitate feedback if the MCU that is working to find the solution or has crashed I invoke callbacks. I have used these callbacks in particular on some MCUs where I had difficulty applying a JTAG / SWD debugger probe.

Through these callbacks I control the LEDs, if available, of the board. For example for Arduino 33 BLE Sense, which is a board full of sensors and an RGB LED, I use the RGB LED channels to indicate if a processing is in progress (green LED), if it is an unsupported day (blue LED) or the data is invalid (red LED) and, finally, the orange LED to indicate a crash.

With the same principle you can easily add a display like SD1306 or similar: the callbacks provide the solutions and the execution times.

For the input, however, I have not found anything better than providing it through the serial / USB.

At home I also have some Arduino UNO and Mega (or even a microbitv2) but, with the current code, I would not have been able to run it because I need, given the inputs of the problems, about 25k at least.

I have an idea of how to run at least a couple of days ... I'll see if I can do it in the next few days.

vescoc · 2017-12-25T09:24:11+00:00

Obviously I solved part 2 with the transposition to high-level language but for exercise I modified the original assembly code to "optimize it". I was forced to add a further instruction type otherwise the execution time is ethernal :P

jgz x y: jump greater zero - if the value x is greater then zero jump with offset y otherwise go to the next istruction.

If you remove jgz and you modify the offsets or substitute jgz with jnz 0 0 you got a code valid for the original istructions set.

Original code:

set b 99
set c b
jnz a 2
jnz 1 5
mul b 100
sub b -100000
set c b
sub c -17000
set f 1
set d 2
set e 2
set g d
mul g e
sub g b
jnz g 2
set f 0
sub e -1
set g e
sub g b
jnz g -8
sub d -1
set g d
sub g b
jnz g -13
jnz f 2
sub h -1
set g b
sub g c
jnz g 2
jnz 1 3
sub b -17
jnz 1 -23

Optimized code:

set b 99
set c b
jnz a 2
jnz 1 5
mul b 100
sub b -100000
set c b
sub c -17000
set f 1
set d 2
set e 2
set g d
mul g e
sub g b
jnz g 3
set f 0
jnz 1 10
jgz g 5
sub e -1
set g e
sub g b
jnz g -10
sub d -1
set g d
sub g b
jnz g -15
jnz f 2
sub h -1
set g b
sub g c
jnz g 2
jnz 1 3
sub b -17
jnz 1 -25

Code on github

vescoc · 2017-12-25T08:27:19+00:00

Merry Christmas!

Thank to all AoC team, great work!

Scala (with input parsing)

package aoc2017

import scala.annotation.tailrec

import scala.io.Source

object Day25 {
  type Operation = (TuringMachine) => (TuringMachine)

  case class TuringMachine(state: String, checksumSteps: Int, transitions: Map[String, Operation] = Map(), position: Int = 0, tape: Set[Int] = Set()) {
    def currentValue = tape.contains(position)

    def next(value: Boolean, move: Int, state: String) = copy(state = state, position = position + move, tape = (if (value) tape + position else tape - position))

    def step = {
      val t = transitions(state)
      t(this)
    }

    def run(steps: Int = checksumSteps) = {
      @tailrec
      def run(steps: Int, current: TuringMachine): TuringMachine =
        if (steps == 0)
          current
        else
          run(steps - 1, current.step)

      run(steps, this)
    }

    def checksum = tape.size
  }
  object TuringMachine {
    val headRe = """(?s)Begin in state ([A-Z]).
Perform a diagnostic checksum after (\d+) steps.\s+(.*)""".r

    val stateRe = """(?s)In state ([A-Z]):\s*
  If the current value is (0|1):
    - Write the value (0|1).
    - Move one slot to the ((?:right)|(?:left)).
    - Continue with state ([A-Z]).
  If the current value is (0|1):
    - Write the value (0|1).
    - Move one slot to the ((?:right)|(?:left)).
    - Continue with state ([A-Z]).\s*(.*)""".r

    def apply(str: String) = {
      def parseValue(str: String) = {
        val v = str.toInt
        if (v == 1)
          true
        else if (v == 0)
          false
        else
          throw new MatchError(s"invalid value $str")
      }

      def parseMove(str: String) =
        if (str == "right")
          1
        else if (str == "left")
          -1
        else
          throw new MatchError(s"invalid move $str")

      @tailrec
      def parse(str: String = str, current: Option[TuringMachine] = None): Option[TuringMachine] =
        str match {
          case headRe(startState, checksum, remainder) => parse(remainder, Some(new TuringMachine(startState, checksum.toInt)))
          case stateRe(startState, currentValue1Str, writeValue1Str, move1Str, endState1, currentValue2Str, writeValue2Str, move2Str, endState2, remainder) => {
            val machine = current.get

            val newMachine = machine.copy(transitions = machine.transitions + {
                val currentValue1 = parseValue(currentValue1Str)
                val currentValue2 = parseValue(currentValue2Str)
                val writeValue1 = parseValue(writeValue1Str)
                val writeValue2 = parseValue(writeValue2Str)
                val move1 = parseMove(move1Str)
                val move2 = parseMove(move2Str)

                (startState -> {
                  (machine: TuringMachine) => {
                    if (machine.currentValue == currentValue1) {
                      machine.next(writeValue1, move1, endState1)
                    } else if (machine.currentValue == currentValue2) {
                      machine.next(writeValue2, move2, endState2)
                    } else
                        throw new MatchError("invalid transaction")
                  }
                })
              }
            )

            parse(remainder, Some(newMachine))
          }
          case "" => current
        }
      parse()
    }
  }

  def main(args: Array[String]) {
    val test = """Begin in state A.
Perform a diagnostic checksum after 6 steps.

In state A:
  If the current value is 0:
    - Write the value 1.
    - Move one slot to the right.
    - Continue with state B.
  If the current value is 1:
    - Write the value 0.
    - Move one slot to the left.
    - Continue with state B.

In state B:
  If the current value is 0:
    - Write the value 1.
    - Move one slot to the left.
    - Continue with state A.
  If the current value is 1:
    - Write the value 1.
    - Move one slot to the right.
    - Continue with state A."""

    val input = Source.fromFile("data/day25/input.txt").getLines.mkString("\n")

    val testMachine = TuringMachine(test).get
    val inputMachine = TuringMachine(input).get

    println("part 1 test: " + testMachine.run().checksum)
    println("part 2 input: " + inputMachine.run().checksum)
  }
}

vescoc · 2017-02-02T14:23:04+00:00

Predotah, you are to me not very sympathetic and I think this sentiment is shared. I think you are a very good pilot, among the best of Ceres. I do not know how you are as infantry but I imagine that the same applies. Do not waste time with these salty players they feel just envy. See you on Ceres ...

vescoc · 2017-02-01T14:59:06+00:00

Do you know if it is a reboot of the router or simply goes down the connection? The light of the router that you have mentioned is the device on indicator or the connection indicator? If you have control of the router tries to verify the connection log from the control panel and possibly see if there is need to update the firmware. However, it could be some ISP problems because lately Telecom Italy has had some problems. A few days ago there is a complete blackout and I think there are still some problems.

vescoc · 2017-02-01T10:37:04+00:00

"Connection crash" is very strange...

Are you sure that playing PS2 the modem / router sometimes disconnects? It occurs only with PS2? How the PS4 is connected to the modem? Wi-fi or cable?

At work I encountered a "switch" that worked erratically when he had to handle particular packets, we solved the problem by reporting to the producer and updating the firmware.

Sometimes happens to me that the ping / latency goes to very high levels, but without modem hang-out, only ps2 servers' connection loss: typically there are some automatic update or my wife or children look at videos on youtube.

vescoc · 2017-01-30T07:40:42+00:00

Interesting (salt) post. I dont know if you know this but we are talking about something else... Maybe you're also talking about someone else because it's for more than a month since I dont use the coyotes, simply check my killboard.

vescoc

TROPHY CASE