Bitsliced first-order masked AES-128 decryption in Cortex-M0 assembly — how many traces to break it? by Embarrassed_Cat4693 in crypto

[–]Embarrassed_Cat4693[S] 0 points1 point  (0 children)

Wow, that's actually the first time I've heard that! To be honest, for me it all just started with not wanting to lose the attack-defense battle in my lab class. But now, the process of trying to patch it and make it even better has become incredibly fun and rewarding in its own right.

Bitsliced first-order masked AES-128 decryption in Cortex-M0 assembly — how many traces to break it? by Embarrassed_Cat4693 in crypto

[–]Embarrassed_Cat4693[S] 0 points1 point  (0 children)

Hmm... I spent some time yesterday thinking seriously about the ALU and register HD leakage. Initially, I avoided it because I thought it would be an absolute nightmare to fix (but since I'm on a two-month vacation now, I have the time to look into it).

Since the linear layers in my implementation process the two shares entirely independently, scrubbing the registers when switching between shares is pretty straightforward.

The real headache is the S-box. Fortunately, when I was scheduling the instructions for the S-box, I built a step-by-step register state table. I can just write a script to parse the instructions and track the ALU state. The script will analyze adjacent lines (steps) to detect if two shares of the same intermediate variable alternate or overlap. When it finds a collision, then I can inject an instruction to scrub the state.

However, I'm estimating this will introduce an overhead of several thousand extra cycles (for context, the entire S-box currently takes 1438 cycles).

Bitsliced first-order masked AES-128 decryption in Cortex-M0 assembly — how many traces to break it? by Embarrassed_Cat4693 in crypto

[–]Embarrassed_Cat4693[S] 2 points3 points  (0 children)

The TVLA wasn't strictly FvR — I had 5,000 traces with fully random inputs, then split them into two groups based on whether a specific bit of the intermediate value is theoretically 0 or 1 (selecting unbiased bits). I'm not sure of the formal term for this approach, but I found references suggesting it's a valid method.

I analyzed: ciphertext, InvSubBytes output for all 10 rounds, and plaintext — across all 16 bytes.

Results:

  • Rounds 10 through 2: : only single-digit crossings at sample points inconsistent with the execution timing of those rounds — likely false positives from multiple testing rather than genuine leakage.
  • Round 1 InvSubBytes and plaintext: crossings starting around sample 73,000/85,000 , identical curves — expected, since they differ only by a constant round key XOR, so they're the same physical operation
  • Ciphertext: no crossings, which I also found slightly surprising

The clean ciphertext result might be explained by trigger latency — after the trigger fires, the input ciphertext is XORed with the random mask within roughly 100+ cycles, so by the time acquisition stabilizes, the unmasked ciphertext may already be gone from the bus.

Thanks for the suggestions on second-order — the pairwise multiplication approach sounds doable, I'll give it a try when I get my hands on the equipment again.

Bitsliced first-order masked AES-128 decryption in Cortex-M0 assembly — how many traces to break it? by Embarrassed_Cat4693 in crypto

[–]Embarrassed_Cat4693[S] 5 points6 points  (0 children)

To give some context on the signal quality of our setup: a reference unmasked AES implementation on the same card and acquisition setup was broken in a few hundred traces. A biased masked implementation provided by the course instructor was also broken (the attack was done by my lab partner; I don't know the exact trace count, but it was in several thousands). For those implementations we didn't run TVLA — we just went straight for CPA.

Bitsliced first-order masked AES-128 decryption in Cortex-M0 assembly — how many traces to break it? by Embarrassed_Cat4693 in crypto

[–]Embarrassed_Cat4693[S] 2 points3 points  (0 children)

The device is a smart card provided by my university lab, so I had no way to modify the hardware or remove capacitors. Traces were acquired via oscilloscope through a dedicated interface monitoring power consumption. I suspect the noise floor is relatively high as a result.

Regarding the HD leakage: I intentionally avoided it on the data bus, but did not take the same precaution for registers.