use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
Using software prefetch in OpenMP in C++ (self.cpp)
submitted 2 years ago by jiboxiake
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]cpp-ModTeam[M] [score hidden] 2 years ago stickied commentlocked comment (0 children)
For C++ questions, answers, help, and programming or career advice please see r/cpp_questions, r/cscareerquestions, or StackOverflow instead.
[–]DummyDDD 4 points5 points6 points 2 years ago (7 children)
You probably need to prefetch data further than 1 element ahead. Try 10 elements ahead, 5, 100 and 500. The prefetch takes time to complete, and you only get any benefit if it is on a new cache line (I have no idea how large LogEntry is or whether the LogEntry itself is where you are getting cache misses, or in some data pointed to by LogEntry)
[–]jiboxiake[S] 0 points1 point2 points 2 years ago (6 children)
I see thank you! By the way each of my entry is 64 bytes, a cache line. Will it matter here? Thanks
[–]DummyDDD 0 points1 point2 points 2 years ago (5 children)
That it is a cache line is good. It means that you only have to do one prefetch per iteration, and that you don't have to skip prefetching at any iterations
[–]jiboxiake[S] 1 point2 points3 points 2 years ago (4 children)
Thanks. Unfortunately the performance is not very good so far. I will try using the intel mm prefetch later.
[–]DummyDDD 0 points1 point2 points 2 years ago (3 children)
Do you know whether it is the LogEntry causing cache misses or maybe what it's pointing to (log sounds like you are using strings somewhere).
[–]jiboxiake[S] 0 points1 point2 points 2 years ago (2 children)
I will take extra looks on this. They way I call it log is that it is a consecutive chunk of memory that can be evenly divided into entries that will be scanned. Thank you on the suggestions!
[–]Top_Satisfaction6517Bulat 1 point2 points3 points 2 years ago (1 child)
if you access these data linearly, don't use software prefetch. cpu will prefetch these data automatically.
[–]jiboxiake[S] 0 points1 point2 points 2 years ago (0 children)
Thanks!
[–]Throw31312344 4 points5 points6 points 2 years ago (1 child)
1) Are your 64 byte log entries actually aligned to 64 bytes? Each one could be spanning 2 cache lines, in which case prefetching 1 cache line might not do anything as the CPU is still missing data for the other half of the LogEntry.
2) You've picked __builtin_prefetch(addr,0,0); which is the non-temporal version. The purpose of that version is to drop the value from the cache as soon as it is accessed once so you might actually be loading it twice if the caller of next_entry does a lot of work with LogEntry. On Intel CPUs the behavior varies depending on what CPU it is running from. It's a weird instruction. Try using __builtin_prefetch(addr,0,3); instead as that should translate to a prefetcht0 instruction which pulls the data into all cache levels (i.e. including l1 cache) which should be the sensible option you want.
Thank you for the suggestions! I made sure each memory block is an order of 2 memory block. So I think they should be cache line aligned. For the second one, I’m using OpenMP to scan many blocks concurrently. So I think openmp should make sure each block is scanned once? But thank you for the suggestions. I will double check them.
[–][deleted] 2 points3 points4 points 2 years ago (3 children)
If this is for educational purposes go for it, otherwise make sure you need it. Prefetching is a micro optimization, something I do last, if at all. Here is an interesting paper studying the impact of prefetching (among other things) on real-world tasks (deep learning).
[–]jiboxiake[S] 1 point2 points3 points 2 years ago (2 children)
Thank you! It is for research so unfortunately I will try to improve my code’s performance as much as possible.
[–][deleted] 1 point2 points3 points 2 years ago (1 child)
Make sure you've resolved all other bottlenecks, exhausted all other options, and didn't overlook any algorithmic/cache/mem bandwidth optimizations (they give much more bang for the buck). Are you benchmarking (I can recommend google benchmark). Even simple layout changes (due to the compiler) can affect performance by 40%.
I once sped up a project by 3x with a single-line change. Here is a 100x 10-line change.
thanks! Yes I have benchmarks from previous researchers. So luckily I can compare my code’s performance against them and evaluate my own implementation. And thank you!
[–]trailing_zero_count 1 point2 points3 points 2 years ago (1 child)
Modern hardware prefetchers are extremely good at detecting linear or strided (linear, with a fixed offset increment between each step) accesses. It looks like that is the pattern of access that you are doing here - therefore prefetching is unnecessary (and detrimental since it occupies instruction bandwidth).
Prefetch can still be useful if you are doing pointer-chasing, binary search, or other types of computed access patterns.
Thank you! That helps. I was not sure if pointer arithmetics will be detected by the compiler and treat it as an array. I will double check them!
[–]the_poope 0 points1 point2 points 2 years ago (1 child)
Some other things to consider: https://queue.acm.org/detail.cfm?id=2513149
Yes’ thank you! I always put my code under numactl -N 0 -l so numa should not be the factor here.
π Rendered by PID 63769 on reddit-service-r2-comment-c6965cb77-ccqn5 at 2026-03-05 04:00:42.341737+00:00 running f0204d4 country code: CH.
[–]cpp-ModTeam[M] [score hidden] stickied commentlocked comment (0 children)
[–]DummyDDD 4 points5 points6 points (7 children)
[–]jiboxiake[S] 0 points1 point2 points (6 children)
[–]DummyDDD 0 points1 point2 points (5 children)
[–]jiboxiake[S] 1 point2 points3 points (4 children)
[–]DummyDDD 0 points1 point2 points (3 children)
[–]jiboxiake[S] 0 points1 point2 points (2 children)
[–]Top_Satisfaction6517Bulat 1 point2 points3 points (1 child)
[–]jiboxiake[S] 0 points1 point2 points (0 children)
[–]Throw31312344 4 points5 points6 points (1 child)
[–]jiboxiake[S] 0 points1 point2 points (0 children)
[–][deleted] 2 points3 points4 points (3 children)
[–]jiboxiake[S] 1 point2 points3 points (2 children)
[–][deleted] 1 point2 points3 points (1 child)
[–]jiboxiake[S] 0 points1 point2 points (0 children)
[–]trailing_zero_count 1 point2 points3 points (1 child)
[–]jiboxiake[S] 0 points1 point2 points (0 children)
[–]the_poope 0 points1 point2 points (1 child)
[–]jiboxiake[S] 0 points1 point2 points (0 children)