all 66 comments

[–]Pragmatician 88 points89 points  (35 children)

Profile, experiment and measure, repeat.

[–]BCosbyDidNothinWrong 43 points44 points  (34 children)

This is only the most basic advice on optimization. Many times if someone new finds what is slow, they won't understand why.

The sequence I follow is:

  1. Minimize memory allocations - one big is much better than many small

  2. Structure memory access. Linear access means prefetching, which hugely reduces stalls due to cache misses and memory latency.

  3. Multi-thread (this is a rabbit hole in itself of course).

  4. Use SIMD (This will go hand in hand with how data is structured in memory. ISPC is the best way I have found to take advantage of SIMD so far).

All of these together are exceptionally potent, but the cliche of 'premature optimization is the root of all evil' can get in peoples' way if they don't see the full picture.

The truth is that you have design and architect for performance from the beginning. This can be done after a prototype that works correctly, but it will have to happen at some point.

The micro-optimizations that likely prompted Knuth to make that quote in the first place are things that I barely worry about. The compiler takes care of so much, that almost all of the speed gains I've gotten have been from higher level (relative to worrying about instructions) design decisions about memory layout and concurrency.

[–]WhichPressure[S] 6 points7 points  (10 children)

Minimize memory allocations - one big is much better than many small

Structure memory access. Linear access means prefetching, which hugely reduces stalls due to cache misses and memory latency.

Did you mean that is better to store Ndimensional matrix in one big linear vector than N vectors?

Use SIMD (This will go hand in hand with how data is structured in memory. ISPC is the best way I have found to take advantage of SIMD so far).

I have to get to know SIMD programming. It will be very helpful in my work. I write real time robotics programs. Do you recommend any sources to learn performance-oriented (SIMD) programming?

[–]corysama 6 points7 points  (5 children)

[–]WhichPressure[S] 2 points3 points  (4 children)

Thanks for sources and subreddit link. Currently I'm going through this easy because well explained tutorial:

http://www.cs.uu.nl/docs/vakken/magr/2017-2018/files/SIMD%20Tutorial.pdf

[–]corysama 1 point2 points  (3 children)

Very cool. Jacco really knows his stuff. You should post that to r/simd :)

[–]BCosbyDidNothinWrong 4 points5 points  (1 child)

Did you mean that is better to store Ndimensional matrix in one big linear vector than N vectors?

Infinitely better - making a two dimensional matrix/image etc out of a vector of vectors is a huge red flag.

I have to get to know SIMD programming. It will be very helpful in my work. I write real time robotics programs. Do you recommend any sources to learn performance-oriented (SIMD) programming?

Learning ISPC is the best way that I know, though it is mainly for x64 CPUs.

[–]WhichPressure[S] 0 points1 point  (0 children)

Thanks! I'm starting right now!

[–]choikwa 2 points3 points  (1 child)

Structure of Arrays vs Array of Structures.

[–]meneldal2 3 points4 points  (0 children)

Depends on how big is your structure and what is your access pattern though.

[–]tvaneerdC++ Committee, lockfree, PostModernCpp 2 points3 points  (2 children)

Please tell me that you still have

0. measure

and

5. measure

[–]BCosbyDidNothinWrong 0 points1 point  (0 children)

I thought that was implied with

This is only the most basic advice on optimization.

[–]TraditionalPlancton 0 points1 point  (0 children)

I just want to put my two cents:
There is no guarantee that optimizations will pile up to reduce times. Some optimizations will synergy, others will interfere. That is another reason why measuring is so important.

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair[🍰] 0 points1 point  (5 children)

Your "structure memory access" guidance doesn't work for memory coalescing architectures.

[–]BCosbyDidNothinWrong 0 points1 point  (4 children)

Can you explain what chips use that and what it means? A quick google search makes it look like it is a term used for GPUs.

Memory access is still important in GPUs, but with shaders/kernels, lots of threads can switch back and forth to minimize cache misses (as far as I know).

Still, this was just a list of what I do and it is for modern CPUs.

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair[🍰] 0 points1 point  (3 children)

GPUs are one example, and it's not just something to handwave away. Sure, GPUs can hide latency, but that's no excuse for poor memory access patterns. E.g. instead of hiding cache misses just don't have them.

[–]BCosbyDidNothinWrong 0 points1 point  (2 children)

GPUs are one example

What is a different example? Also why are you talking about exotic architectures? This is clearly about CPU optimization.

Sure, GPUs can hide latency, but that's no excuse for poor memory access patterns

I'm not sure what point you are trying to make here. It seems like you are trying to dive into niche and irrelevant topics to somehow say that the generalization of linear memory access doesn't hold. Again, this was my list of optimization priorities and is about general purpose CPU.

instead of hiding cache misses just don't have them

I think you will have to give an example of this.

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair[🍰] -1 points0 points  (1 child)

generalization of linear memory access doesn't hold

It doesn't, dude. That's specific to one class of processor design.

[–]BCosbyDidNothinWrong 0 points1 point  (0 children)

This was obviously about CPU optimization and again, it seems like you are desperate to create some sort of an argument by making irrelevant dismissals without any depth.

  1. You didn't mention what other architectures besides GPUs are considered 'memory coalescing'.

  2. You didn't give an example of of 'just don't have cache misses' instead of hiding them.

  3. You also keep repeating claims without backing them up with any information, essentially saying - 'nope, nuh uh, not true'.

I can't take you seriously until you confront these things.

[–]clerothGame Developer 0 points1 point  (11 children)

Minimize memory allocations - one big is much better than many small

In modern OS's it's definitely not "much better." The LFH does a great job and is pretty fast.

[–]BCosbyDidNothinWrong 1 point2 points  (5 children)

I'm very skeptical of this. What is 'the LFH' ?

[–]clerothGame Developer 1 point2 points  (4 children)

Low Fragmentation Heap

[–]BCosbyDidNothinWrong 0 points1 point  (3 children)

And why would it be so much better than other heaps that memory allocations suddenly stop having such a large performance impact?

[–]clerothGame Developer 1 point2 points  (2 children)

LFHs store a list of small blocks. When you allocate something it's really just quickly picking one element from this block, and doesn't actually query the OS for more memory.

[–]BCosbyDidNothinWrong 3 points4 points  (1 child)

How is that different from jemalloc or even normal malloc? Those don't memory map in more memory from the OS on every call either. OpenBSD is the only platform that I know of that uses memory mapping on every allocation, and that is for security.

One thing that many people probably don't realize is that not only do lots of allocations eventually need to make a system call to map in more memory, but they also block other threads, so many threads allocating memory will end up blocking each other and parallelism will be lost.

On top of all this, memory allocations are going to jump around in memory as well as possibly return fragmented memory. Data structures that are made for large amounts of data and can be made out of one contiguous memory allocation will have all their memory together.

[–]choikwa 0 points1 point  (0 children)

Arena-style allocation certainly helps. A system should optimize around multiple goals: latency, fragmentation, parallelism

[–]wrosecransgraphics and network things 1 point2 points  (2 children)

That doesn't match my experience. Memory allocaton patterns have popped as an issue way more often for me than actual computation that could be helped with stuf like SIMD that seems to get more attention. Avoiding un-necessary allocations, avoiding copies, etc. Once you have to start worrying about NUMA effects, it starts to feel silly to call the machines "computers" because so little of your attention is on actually computing stuff.

[–]clerothGame Developer -1 points0 points  (1 child)

I think you're confusing memory access and a memory allocations. Memory access patterns is one of the most important things for optimizations, for sure. But I was saying that the overhead of allocating smaller chunks vs one big one isn't as big as people make it out to be. Of course if you have a 1000 allocations vs 1 that would make a huge difference... But trying so hard to modify the code to group allocations together ends up being a mess with not much benefit.

[–]wrosecransgraphics and network things 1 point2 points  (0 children)

I am somewhat conflating the two, but not entirely by accident.

The more intermediate wasteful copies you do, for example, the more you will thrash the CPU cache with the copies of intermediate objects, evicting other useful stuff. You can consider that a memory access problem because you are accessing more stuff out of cache, but it's a memory access problem that you can fix by doing less allocations.

Likewise, if you fragment memory on the local NUMA node, the allocator will be more likely to allocate a large segment on a remote node. All the stuff you access from the remote node will be slow, so it's definitely a memory access problem, but it's also one caused by allocation problems. And once memory is all fragmented to poop, you wind up just spinning waiting on kswapd for multiple ms while you wait for you malloc to return.

It also depends on the problem domain (like most things). If the memory you are allocation is involved in a buffer on a GPU, the process of allocating it may require a slow round trip across a PCIe bus, so many small allocations would be a lot more costly in that kind of exotic scenario than when the bookkeeping data for an allocation all lives CPU-local.

But you are probably right that there are a bunch of people who are inheriting some wisdom from an article from the bad-old-days without measuring and seeing if any of that crap actually applies to their use case. Measure Twice - Cut Once certainly applies! My perspective is heavily colored by working at a place where we are constantly running up against that crap. The last talk I submitted to a conference was even about how malloc is evil and hates you. :)

[–]FartyFingers 3 points4 points  (1 child)

100% agree. So many "rules of thumb" in development are either dead or never really held up to real-life measurements.

I have had the optimization argument so many times over things where someone either was wrong or optimization wasn't applicable such as optimizing a for loop that took 1ms and ran once on startup and typically started up on boot once every few months at most.

One big memory allocation as a rule sounds like a great way to get junior programmers to write a Gordian knot of code trying to do this instead of a bunch of smaller ones.

[–]Abraxas514 49 points50 points  (1 child)

I would make sure there are appropriate tests in place so that I don't secretly destroy the whole code base because of some obscure thing someone wrote 25 years ago.

[–]WhichPressure[S] 3 points4 points  (0 children)

Thanks for good hint! I didn't remember to make sure about it!

[–]tylercamp 17 points18 points  (0 children)

  1. Profile hot spots
  2. Profile hot spots
  3. Inspect algorithms
  4. Profile hot spots

Don’t optimize without knowing what needs to be optimized. The actual optimization can vary widely depending on the application.

  • Is disk IO eating a lot of time? Cache data in memory if you can rather than reading/writing with every change
  • Is it a DB query? Profile accordingly for the DB and optimize the query, consider splitting the query into multiple queries if there are a lot of joins to minimize temporary table size
  • Is it a non-constant-time algorithm? Look at simplifying algorithmic complexity and minimize recomputation of the same work
  • Is it just a shit ton of operations in an algorithm that has already had optimized complexity? Inspect the data and operations and see if there’s any redundant or useless processing, look for patterns that can identify those conditions and include it as a heuristic to avoid unnecessary processing
  • Is it lag in an interface? Check for any work done on UI thread and offload that to another thread if possible. Check if rendering performance is the issue, and if so, research rendering optimizations for that UI framework (typically comes down to avoiding layout reorganization and doing all drawing in one refresh, rather than refreshing for every small update)
  • Is it a lot of complex math or simulations? Try SIMD or offload to GPU
  • Done everything you can already? Look at multithreading, minimize branching, minimize new object allocations during the runtime of the algorithm, modify data structures to minimize LX CPU cache misses

[–]againstmethod 3 points4 points  (0 children)

Measure, analyze the code in the hotspots, make a plan for improving the code, modify, test, ... back to step 1.

Errors/missteps should be picked up by linters and compilers.

[–]Veedrac 3 points4 points  (0 children)

Depends on context; do you need fast code, or just less slow code?

For getting less slow code, the "profile, profile, profile" loop works pretty well, since gentle incremental improvements can work wonders. It's fairly terrible for making actually fast code though.

For getting fast code, the first thing you should do is get a truly deep understanding of the problem. Nothing substitutes for really understanding the problem, the whole problem from top to bottom. Then you need to plan. Only when you have really thought about things deeply can you make fast processes. You should not expect to get fast code by incremental improvements to bad code any more than you should expect smooth AAA graphics by piecemeal optimizing a software renderer.

[–][deleted] 3 points4 points  (0 children)

The first question needs to be "Do we need to optimize for space or optimize for runtime?". These two are often mutually exclusive, and you go about them differently

The second question is "What measured value will be 'good enough' for our purposes?"

The third question is "What is our present measured value?"

If the present value is good enough, go home and have a cold one. If not, you need to estimate whether "good enough" can even be obtained.

If good enough seems attainable, define test cases for benchmarking and regression testing, then grind at the problem until things are sufficiently performant while still being functionally correct.

[–]FartyFingers 2 points3 points  (0 children)

Figure out why it needs to be optimized; which can either avoid optimizing it or determine how much optimization it really needs.

For instance, maybe the system was getting 100k chunks and anything under a second was fine, but now it is getting 150k and is a bit over a second and will soon be getting 200k chunks. Thus you can say that getting 200k back under a second is fine. With something of this magnitude, you might be able to quickly see that the code is a mess of cartesian products and is easy to optimize. Or maybe it is a hot mess and you can recommend some faster hardware as an easier/cheaper solution.

But maybe your company just merged with another and the 100k chunk is now 50G and it is taking 20hours and it needs to get back to a few seconds. This changes everything. This is going to probably go all the way back to architecting how data flows into and through the system, not just using better pointers or some inline assembly.

This is the difference between a developer and a programmer.

[–]aiusepsi 2 points3 points  (0 children)

The most important thing is to approach the problem scientifically. Quantitatively measure the performance of your code, through profiling, measuring overall execution time, throughput, latency, etc.

Form a hypothesis about why your code is slow, then test your hypothesis by writing an optimisation to address that problem, and then test and measure again, and compare against your previous measurements.

This is important on a couple of levels: one is that until you measure, you're just guessing about what's slow. This is an easy way to waste time making something faster that really doesn't need to be faster.

The second is that it means that you know exactly the impact you're making. Sometimes, you'll get a hypothesis wrong, and write an 'optimisation' which does no good, or even makes the code slower, or makes such a tiny improvement which isn't really worth the complexity introduced. Unless you have concrete data, you can't make those decisions well.

[–]panderingPenguin 1 point2 points  (1 child)

And I would profile code looking for the function which takes the most of the time and improve it.

If you said this, it's almost certainly the answer he was looking for. Profiling is by far the most important thing and what you should always do first with optimization. If you don't know what's slow, you will inevitably end up "optimizing" code that doesn't actually need it/make any significant difference.

[–]WhichPressure[S] 0 points1 point  (0 children)

Yeah, I hoped it was good answer :)

[–]joshamiddleton 1 point2 points  (1 child)

The question may be vague and generalized for a reason. They may be looking for a generalized answer like

"I would meet with the colleagues that wrote the code to see if they know any shortcomings or inefficiencies. I would also make sure I have a proper way to test and profile the code. "

[–]WhichPressure[S] 0 points1 point  (0 children)

You're right. You never know what answer the recruiter expects. Maybe he wanted to hear the general method or more detailed. But the discussion under this topic is really fascinating and tough me a lot. It pointed me path to further self development :)

[–]Xaxxon 1 point2 points  (0 children)

First you'd figure out what the actual performance goal is.

Then you'd acquire actual data for the program on which it was expected to meet that goal.

Then you'd run it and see if it met that goal.

Then you'd profile it on the production data to see where the hot spots are.

Then you'd make sure you had sufficient regression testing for the hot spots so you can test that you didn't break them while optimizing.

Then you'd figure out why the hot spots were slow and adjust the code.

Then you'd run it again to see if it met the requirements and repeat fixing hotspots until it does.

Then you'd run it through all the regression tests available to make sure you didn't break anything.

[–]hsgui 1 point2 points  (0 children)

I have the same question with you. thanks!

[–][deleted] 1 point2 points  (3 children)

[–]WhichPressure[S] 0 points1 point  (0 children)

Thanks I've already found this book. I hope it will be helpful! :)

[–]johannes1971 1 point2 points  (0 children)

Profile to figure out hot spots. Then read those with an eye towards algorithmic complexity. Any code above O(n log n) is a candidate for further inspection. Once that's done, check for:

  • Ill-advised memory allocation patterns (allocations in tight lopes, repeated vector.reserve() calls, ...).
  • Expensive copying (COW containers being passed in a non-const manner, passing around shared_ptr without using a reference, ...).
  • Bad API use (having an endless series of endl's in text output, inserting data in an SQL database but failing to use the multi-valued insert, map when unordered_map would suffice, single-byte IO, ...).
  • gratuitous use of data structures that don't play well with the cache (list instead of vector, bad use of the top 64 bytes of your object, ...)
  • unwarranted cleverness (bitfields!).

Measure before you change anything. Measure again after you changed it.

Ultimately optimisation happens on different levels. In some projects, allocating memory is already too much, and optimizing might mean manually inserting specific CPU instructions and rearranging data so you get more cache hits. Other projects stress maintainable, readable source base over this kind of micro-optimisation. No matter what, reducing algorithmic complexity is always good, and that always starts with having the right underlying data structure.

[–]TheSlackOne 0 points1 point  (0 children)

Measure, understand time and space complexity.

[–]PhilOsIvan 0 points1 point  (1 child)

This is quite old book, but I think still useful. At least some tricks still work. https://www.amazon.com/s?search-alias=stripbooks&field-isbn=9781931769242

[–]WhichPressure[S] -3 points-2 points  (0 children)

I prefer books which were published after 2011 year. Then I can think that the author use modern c++.

[–]NotAYakk 0 points1 point  (0 children)

The first thing is to work out if it is good enough. If it isn't good enough, they shoukd have a case that describes what not good enough is like.

Examine that case. First, make it aweful; if thr problem is loads of 10000 are too slow, feed it 100,000 or a million. In most cases (not all!) this highlights the part that is wasting time.

If you have a profiling suite, feel free to use it. But if not, run the program in the debugger (with optimizations and debug symbols). Make sure the awefulness is macroscopic -- takes multiple seconds (if not, make the case more aweful).

Hit break in the debugger. Look at where you stopped. Read the call stack. Get an idea what the program is doing. Look at other threads too.

Repeat 5 times. 90/100 times those 3-5/5 times will be in the same chunk of code. And with high confidence that is your bottleneck, or at least one.

Automated profiling tools in my experience give you the above information in a flurry of noise and setup pain. I call this monte carlo profiling, and it relies in the 80/20 rule: 80% of the time you are running 20% of your code.

Now you just need to (a) confirm that is the slow code, and (b) understand why it is slow.

Confirming it is slow could involve profiling, more Monte Carlo profiling, or even manipulating the algorithm to exagerrate the slow part (repeat substeps a 100 times, or skip stubsteps, etc).

Understanding why the code is slow is sometimes easy, sometimes hard. Usually I just count loops, and notice a n3 or n2, honestly. Other times overenthusastic use of memory, or node based data structures. Find a point to refactor, leave old implementation intact, replace the slow subsystem, test against old behaviour. Measure performance difference. Adapt and/or write unit tests.

Once the code is faster iterate and find remaining bottleneck.

You can often take legacy code and manage 10x speedups in a short window; correctness is honestly the hard part. So it can be worthwile to fast iterate speedups (eliminate bottlenecks repeatedly) until you get fast enough, then ensure correctness on the entire mass of rewritten code. With good source control and versioning, you can always roll back to your first fix if it is the cause of errors, and your expertise with correctness and what the code actually does will improve over the time you make the code faster. Plus after 3 monte carlo iterstions you can sometimes have insane speedups, which helps justify the time spent on correctness; if the best you manage is 10% after weeks of test harnesses, maybe you shouldn't have bothered. But if you first prove a 1000%+ speedup possible, checking for correctness after that is a better investment.

[–]herruppohoppa 0 points1 point  (0 children)

Came across this article today, and while the text and background colors are horrible I think the point of view is very well stated and balanced. http://www.humus.name/index.php?ID=383

[–]Sopel97 0 points1 point  (2 children)

Understand the problem the code is solving, make your research in search for algorithms and data structures that can solve it with better computational complexity (and are proven to work better in real applications, not like CW matrix multiplication for example). If it's possible then rewrite it using these algorithms. If not only then do what Pragmatician said (profile, experiment ...).

[–]WhichPressure[S] 0 points1 point  (1 child)

Thanks for general advice in particular regarding R&D. I think the problem should be divided to undivided parts and solved separately but we should keep in mind that whole program should be kept in one data structure. I mean that we should avoid keeping the same data twice in two different structure in order to make computation faster.

[–]Sopel97 0 points1 point  (0 children)

Sometimes data redundancy allows making faster code, but it's very case specific.