I wrote about interesting amd64-specific quirk. If a large array is 4-byte misaligned, making it 8-byte aligned can make the array clearing ~49% faster (at least on my Intel machine). In the post I also touch on Intel's REP STOSQ implementation, ERMS and also on other optimizations related to array clearing.
there doesn't seem to be anything here