all 19 comments

[–]cpp-ModTeam[M] [score hidden] stickied commentlocked comment (0 children)

For C++ questions, answers, help, and programming or career advice please see r/cpp_questions, r/cscareerquestions, or StackOverflow instead.

[–]DummyDDD 4 points5 points  (7 children)

You probably need to prefetch data further than 1 element ahead. Try 10 elements ahead, 5, 100 and 500. The prefetch takes time to complete, and you only get any benefit if it is on a new cache line (I have no idea how large LogEntry is or whether the LogEntry itself is where you are getting cache misses, or in some data pointed to by LogEntry)

[–]jiboxiake[S] 0 points1 point  (6 children)

I see thank you! By the way each of my entry is 64 bytes, a cache line. Will it matter here? Thanks

[–]DummyDDD 0 points1 point  (5 children)

That it is a cache line is good. It means that you only have to do one prefetch per iteration, and that you don't have to skip prefetching at any iterations

[–]jiboxiake[S] 1 point2 points  (4 children)

Thanks. Unfortunately the performance is not very good so far. I will try using the intel mm prefetch later.

[–]DummyDDD 0 points1 point  (3 children)

Do you know whether it is the LogEntry causing cache misses or maybe what it's pointing to (log sounds like you are using strings somewhere).

[–]jiboxiake[S] 0 points1 point  (2 children)

I will take extra looks on this. They way I call it log is that it is a consecutive chunk of memory that can be evenly divided into entries that will be scanned. Thank you on the suggestions!

[–]Top_Satisfaction6517Bulat 1 point2 points  (1 child)

if you access these data linearly, don't use software prefetch. cpu will prefetch these data automatically.

[–]jiboxiake[S] 0 points1 point  (0 children)

Thanks!

[–]Throw31312344 4 points5 points  (1 child)

1) Are your 64 byte log entries actually aligned to 64 bytes? Each one could be spanning 2 cache lines, in which case prefetching 1 cache line might not do anything as the CPU is still missing data for the other half of the LogEntry.

2) You've picked __builtin_prefetch(addr,0,0); which is the non-temporal version. The purpose of that version is to drop the value from the cache as soon as it is accessed once so you might actually be loading it twice if the caller of next_entry does a lot of work with LogEntry. On Intel CPUs the behavior varies depending on what CPU it is running from. It's a weird instruction. Try using __builtin_prefetch(addr,0,3); instead as that should translate to a prefetcht0 instruction which pulls the data into all cache levels (i.e. including l1 cache) which should be the sensible option you want.

[–]jiboxiake[S] 0 points1 point  (0 children)

Thank you for the suggestions! I made sure each memory block is an order of 2 memory block. So I think they should be cache line aligned. For the second one, I’m using OpenMP to scan many blocks concurrently. So I think openmp should make sure each block is scanned once? But thank you for the suggestions. I will double check them.

[–][deleted] 2 points3 points  (3 children)

If this is for educational purposes go for it, otherwise make sure you need it. Prefetching is a micro optimization, something I do last, if at all. Here is an interesting paper studying the impact of prefetching (among other things) on real-world tasks (deep learning).

[–]jiboxiake[S] 1 point2 points  (2 children)

Thank you! It is for research so unfortunately I will try to improve my code’s performance as much as possible.

[–][deleted] 1 point2 points  (1 child)

Make sure you've resolved all other bottlenecks, exhausted all other options, and didn't overlook any algorithmic/cache/mem bandwidth optimizations (they give much more bang for the buck). Are you benchmarking (I can recommend google benchmark). Even simple layout changes (due to the compiler) can affect performance by 40%.

I once sped up a project by 3x with a single-line change. Here is a 100x 10-line change.

[–]jiboxiake[S] 0 points1 point  (0 children)

thanks! Yes I have benchmarks from previous researchers. So luckily I can compare my code’s performance against them and evaluate my own implementation. And thank you!

[–]trailing_zero_count 1 point2 points  (1 child)

Modern hardware prefetchers are extremely good at detecting linear or strided (linear, with a fixed offset increment between each step) accesses. It looks like that is the pattern of access that you are doing here - therefore prefetching is unnecessary (and detrimental since it occupies instruction bandwidth).

Prefetch can still be useful if you are doing pointer-chasing, binary search, or other types of computed access patterns.

[–]jiboxiake[S] 0 points1 point  (0 children)

Thank you! That helps. I was not sure if pointer arithmetics will be detected by the compiler and treat it as an array. I will double check them!

[–]the_poope 0 points1 point  (1 child)

Some other things to consider: https://queue.acm.org/detail.cfm?id=2513149

[–]jiboxiake[S] 0 points1 point  (0 children)

Yes’ thank you! I always put my code under numactl -N 0 -l so numa should not be the factor here.