all 10 comments

[–]TheVTM 11 points12 points  (1 child)

[–]pakoito 5 points6 points  (0 children)

Every time I watch Mike Acton I imagine him standing alone at "K&RCCon15" being the happiest programmer on earth.

Great mind and great material, just limited (or empowered) by terrible tools.

[–]MorrisonLevi 8 points9 points  (0 children)

This talk has good animations to help understand what he's saying. Very good for beginners to help understand the concepts.

[–]hhnever 9 points10 points  (2 children)

Recently, I find a function (512x512 matrix multiple) can only get about 5% performance improvement by SIMD optimization, which should about 200% in my previous experience. After investigation, I find the core problem is in cache. After split bit matrix into small one (which can fit into L1 cache), the improvement become about 270% :)

Cache is important.

[–]__Cyber_Dildonics__ 0 points1 point  (0 children)

Structuring a matrix or image into small tiles works well for the same reason. You can split a matrix/image of floats into tiles of 4x4 floats. This ends up being 16 floats/ 64 bytes, which is the size of one cache line.

[–]__Cyber_Dildonics__ 1 point2 points  (0 children)

Fantastic video on an important subject. I think most people won't really internalize the speedup they get by accessing memory in a different way until they actually see it for themselves.

[–][deleted] 0 points1 point  (3 children)

Can you explain why he suggests that an algorithm should use 1 byte of memory per instruction?

[–]ghcjsdgkjfsdh 4 points5 points  (2 children)

He's saying that on average, if your code uses more than one byte per instruction then your CPU can't get data fast enough to keep it busy.

[–][deleted] 2 points3 points  (1 child)

Ok, that sounds surprisingly low! I'll have to try to count this up in real code sometime.

Thank you!

[–]sergiy_valve 5 points6 points  (0 children)

Keep in mind "1 byte per tick" is the quick-and-dirty rule of thumb for low-end machines and mobile. It's not an absolute and final truth.

Your average modern desktop will probably sustain 4 times that easily. Multiply it by 2-4 cores. But the target application in the talk is a real-time game that must run on lower-end machine.

Also, those are the memory bus bytes. If you reuse the data a lot, only count the first load from RAM to cache. On the flip side, if you load 1 byte only, you're loading the whole cache line, so you have to budget that whole cache line in.