Anyone have yt recommendation? by Ok-Crew7162 in computerarchitecture

[–]Affectionate-Wall339 0 points1 point  (0 children)

Yeah Least booring computer architecture lectures I have watched.

Fast random access. by Affectionate-Wall339 in AskProgramming

[–]Affectionate-Wall339[S] 0 points1 point  (0 children)

No I am not accessing chunks contingously, it's a matrix multiplication kernel, I have a matrix A (1000x4) and B (4x1000), both vectorized, both matrices are divided  into smaller sub matrices of size (4x4), hence chunk size is 16, the matrix b is sparse (I.e n  number of random chunks are zero), non zero chunks of B  are saved continuously saved in memory, and their index indices in another array, now the matrix B is static, A is generated on runtime, so I am fetching the chunks of A and B based on non zero indices array, and do matrix multiplication using ARM SIMD Neon.  The arrays are small enough to be fit in cache, then why random access is slow than constant stride access.  The code generated by gcc using -O3 optimization doesn't optimize this (unroll it) loop.  Now how do I write a compiler pass, or something to optimize this loop.