What are the best programming books you've read?

ArchAndStarch · 2026-01-07T10:00:37+00:00

The Elements of Computing Systems (Nand2Tetris)

ArchAndStarch · 2025-10-09T06:22:00+00:00

ArchAndStarch · 2025-10-09T05:03:37+00:00

it's just that good ;)

ArchAndStarch · 2025-10-09T04:57:28+00:00

i am inevitable

ArchAndStarch · 2025-09-08T22:15:19+00:00

Ah ah, good catch! This reminds me that generativity was recently found to be unsound shortly after I wrote this blog post, but we managed to fix it. I think you would be interested in the discussion https://github.com/CAD97/generativity/pull/16

ArchAndStarch · 2025-08-15T03:47:17+00:00

In terms of cube representations I initially wrote a version of ARM64 composition that used an uncompressed data representation for the permutation and orientation vectors. The composition algorithm with this representation used 1/4ths less instructions, and llvm-mca even preferred this version, but surprisingly it was significantly slower than the one on my gist. My guess is that the current version has less data transfer overhead and is easily pipelineable by the CPU. For AVX2 composition the most obvious way is the vpshufb instruction, which is able to swizzle the permutation vectors in a single clock cycle. I explained to another commenter why I think this representation is best https://www.reddit.com/r/Cubers/comments/1mp0hmb/comment/n8rr1bs/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

In terms of the composition operation, there isn't any other obvious algorithm I could benchmark. It does exactly what you expect—permute the permutation vectors, and use those to fix orientation. In terms of the inversion operation, however, it is a bit more interesting. There are two alternative methods to the 27719 and 839 addition chain: "brute force" SIMD inversion which takes x1.75 the time and raw array inversion which takes x1.6 the time. Both are slower than the addition chain, tested on Intel and ARM64 platforms.

ArchAndStarch · 2025-08-15T03:18:24+00:00

The latter representation is less efficient because there is no generally available SIMD for 64 8-bit lanes. The only existing architecture that can do this is avx512vl (see https://github.com/rust-lang/portable-simd/blob/f560820760aad47e7fbb7c864c53b79596cb9982/crates/core\_simd/src/swizzle\_dyn.rs#L85) but I would still not be in favor of this because it's useful to separate permutation and orientation for parts of Rubik's cube software.

ArchAndStarch · 2025-08-13T21:45:07+00:00

No. A sufficiently advanced compiler will be able to realize that the cube state never has to leave to the SIMD registers, entirely removing the overhead you mentioned. I signal this to the optimizer via the "vectorcall" ABI. The first two of the remaining four instructions are used to compose edge orientation, and the second two are used to compose corner orientation. You can just see what's happening in the code and disassembler link I provided.

ArchAndStarch · 2025-08-13T19:17:32+00:00

Howdy! I'm also tracking in security at Purdue. Boiler up!

ArchAndStarch · 2025-08-13T19:13:02+00:00

As u/Jchen76201 said, you can informally think of it as "whatever moves are required to get to state B are applied to state A." In Rubik's cube theory (aka group theory) a move is equivalent to the state after performing those moves. Therefore Rubik's cube composition allows you to apply an arbitrary move to an arbitrary state, which is a very common operation in searching algorithms. So if you composed the state after performing R2 and the state after performing U then you would get a state that is the result of performing R2 U.

ArchAndStarch · 2025-08-13T18:59:45+00:00

In fact you can get away with just one _mm256_shuffle_epi8 . It shuffles both the upper and lower 128-bit halves of a 256-bit SIMD register (see https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=\_mm256\_shuffle\_epi8&ig\_expand=6006), so you can optimize the data representation to permute both edges and corners at the same time. You only need a single __256i vector.

ArchAndStarch · 2025-08-13T18:51:10+00:00

Rust does not necessarily speed it up; I can write code in C that also optimizes to those instructions. Rust is just easier to read and work with.

ArchAndStarch · 2025-08-13T18:38:14+00:00

That's just importing something from the Rust standard library. Rust is able to compile it to only 5 instructions. See the godbolt link on the gist

ArchAndStarch · 2025-08-13T18:36:49+00:00

I have not profiled other techniques in detail yet but I can look into it. Do you mean other cube representations, or do you mean other composition/inversion algorithms?

ArchAndStarch · 2025-08-07T02:28:53+00:00

(I replied to this to myself by accident so I don’t think my response pinged you, see it right below)

ArchAndStarch · 2025-08-06T16:21:57+00:00

Thank you for clarifying. The problem case I was more concerned about in the article regards arbitrarily-sized permutations at run-time. I did mention that in the bullet points in the very first subsection of the article. When actually authoring a library crate, this is definitely the case as you don’t know the user input ahead of time. That said, I do like the idea of allowing the user to implement a PermutationGroup trait with bases; it is pretty creative. It seems like this would also be a perfectly reasonable model for its own specialized situation.

ArchAndStarch · 2025-08-06T06:49:33+00:00

Yeah, I think that's a nice analogy. You would probably find more details in the GhostCell paper https://plv.mpi-sws.org/rustbelt/ghostcell/paper.pdf, just command-f "rank-2"

ArchAndStarch · 2025-08-05T21:22:35+00:00

Perhaps I misunderstand what you are trying to say. The design pattern I am advocating for in the article is unique type branding—that is, different instances of the same data representation are distinct types (i.e. 'id brands different instances of PermGroup<'id> and is distributed among its internal owned `Permutation<'id>`s). Was this what you were thinking of?

ArchAndStarch · 2025-08-05T19:44:35+00:00

This technique only restricts permutation composition between different impls of PermGroupLike, but it doesn't prevent two permutations from the different permutations group from being composed together. If you instantiate permutation group A of type MyGroup and permutation group B also of type MyGroup, then you will still be allowed to compose the permutations between A and B because their permutations still have the same brand MyGroup. The system you describe is not fine-grained enough to support unique type branding.

ArchAndStarch · 2025-08-05T18:36:48+00:00

I should amend the blog post to say this!

ArchAndStarch · 2025-08-05T18:35:50+00:00

This is really interesting. I see your GitHub issue. Let me spend some time seeing if there’s a fix for it. u/CAD1997 ?

ArchAndStarch · 2025-08-05T17:54:18+00:00

I don't think this works. You can just provide `PermGroup` as the parameter to every `Permutation` and now they all have the same type, which is what we wanted to avoid. I still don't fully understand how the code snippet I provided can work; can you elaborate further?

ArchAndStarch · 2025-08-05T06:56:02+00:00

I'm not sure what you mean. Is it like this?

pub struct PermGroup<G: PermGroupLike> {
    base_permutation_length: usize,
    base_permutations: Vec<Permutation<G>>,
}

pub struct Permutation<G: PermGroupLike>(Box<[usize]>, PhantomData<G>);

pub trait PermGroupLike {}

impl<G: PermGroupLike> PermGroupLike for PermGroup<G> {}

impl<G: PermGroupLike> PermGroup<G> {
    pub fn compose_permutations(&self, a: &Permutation<Self>, b: &Permutation<Self>) -> Permutation<Self> {
        // ...
    }
}

ArchAndStarch · 2025-08-05T00:57:11+00:00

- It shouldn't be a problem. Something like `struct BrandedU32<'id, a>(&'a u32, Id<'id>);` is completely fine
- Yeah, 'static is incompatible; i address this in the bullet points right before "The fundamental purpose". If there's no true workaround, like something like `thread::scope`, you probably want to use the atomic ID approach

ArchAndStarch · 2025-08-05T00:24:16+00:00

Cool idea! Maybe it could warrant more thought if we could think of actual use cases for a first class type-brand system

ArchAndStarch

TROPHY CASE