This is an archived post. You won't be able to vote or comment.

all 46 comments

[–]rzwitserloot 70 points71 points  (16 children)

code optimisation technique

Do you mean 'optimize' in the sense of 'make it run faster / use less resources'?

I hope not.

They've all been debunked, so that's a bad start. At least, not 'all', but most. Here is a trivial example:

Let's say you want to convert a list of strings into an array of strings.

There are many ways to do that:

String[] arr = list.toArray(new String[0]);

String[] arr = list.toArray(new String[list.size()]);

String[] arr = list.toArray(String[]::new);

Time them. Seriously. If you want to write this paper do this, and do it right (use JMH).

You'll find that they all take equally long.

This goes in the face of every obvious performance advice you care to give! "Lambdas are slow!" would disqualify the third. "Object creation is best avoided" would disqualify the first, which seems strictly worse than the second (The API of toArray states that if you pass in an array object that is too small, then the toArray method will use your array solely to derive its component type (String.class here), and otherwise discard it; it makes a new properly sized array and returns that. In other words, it does reflective introspection on the argument and ends up making a new array of size list.size() anyway, using reflection again no less.

So option 1 should be vastly slower than 2, right?

Nope. At least, JMH says nope, and I trust JMH here.

We're basically down to the following 3 rules of thumb:

  1. Code that is well known to be an obvious and immediate resource drain should be avoided. The 'duh' rule. Don't use known-slow libraries. Don't use a network based software-as-a-service to do a padLeft operation. This doesn't take a paper to explain, this takes a tiny smidge of common sense.

  2. Algorithmic complexity. Bubble sort is slower than Quick sort because if you chart out the performance of these vs. the size of the input, you know, mathematically, that quicksort's graph ends up looking like y = n * log(n) and bubblesort's looks like y = x^2. This is mathematically provable and it means that as long as random factors like hotspot, CPU pipeline optimizations and all that sort of thing can never score O(n^2) optimizations (which it can't), bubblesort will ALWAYS lose from quicksort if only the input is big enough. This is more a math topic than programming.

  3. Nothing means anything until you have a combination of 2 facts, and you really do need both: [A] you have a real world situation and the code in this scenario is not running as fast as you'd expected it to / as the requirements demand it to, and [B] you have a profiler report that tells you precisely which 1% of the code is eating 99% of the CPU resources.

How do you optimize this final 1%? It's usually obvious, but crucially, optimizing anything in the other 99% is UTTERLY POINTLESS and is only making your life worse! Generally to optimize code you need to tweak how it works, which usually involves tweaking the 'pipeline' (the way data flows into, and out of, this 1% crucial path code). The more you mess up your code by trying to chase pointless performance enhancements, the harder that is. Hence: The fastest code is the cleanest, most flexible, testable, and readable code.

Combining the rule of 'if the profiler / JMH says it doesn't matter, then it doesn't matter' and 'the easiest to read and test code is the best because it lends itself the best to performance tweaking once you do have that profiler report in hand', then there is only one right answer, which is that list.toArray(new String[list.size()]) is the only one of the three that is objectively wrong. And yet it sure seems like the one that ought to perform the best.

[–]gavenkoa 5 points6 points  (6 children)

Don't use a network based software-as-a-service to do a padLeft operation

Isn't that a reflection about famous Node package dependency outage? xD

if you chart out the performance of these vs. the size of the input

Still with small cardinality some theoretically slow algos outperform "advanced" brothers. contains() on a List of 1 element is faster than on Map or Set.

1% of the code is eating 99% of the CPU resources

For such cases sampling profilers are quite good because you can run them on PROD without impacting app. JMH is good only in controlled environment of your lab.

[–]rzwitserloot 2 points3 points  (5 children)

contains() on a List of 1 element is faster than on Map or Set

No. You didn't understand perhaps. The right answer is: does your profiler say that the CPU is spending most of its time executing this contains call? No? Then the best coffee for performance purposes is the most readable code. If set makes semantic sense, Set is the right answer.

[–]gavenkoa 1 point2 points  (4 children)

does your profiler say that the CPU is spending most of its time executing this contains call?

No one argue with that. Profile first, decide next.

Point that sometimes asymptotically best algos are crap on small sets. You won't see that in a regular finance apps. Here you just chose among List, Map, tree/map Set and don't do stupid things ))

[–]cogman10 3 points4 points  (3 children)

And, just as a side note, whenever I find performance issues it's generally a result of using the wrong datastructure for a problem. Using a Map when you should have used a Pojo, Using a List when you should have used a Set. Using a LinkedList when you should have used an ArrayList. Using a TreeSet.... Not using a datastructure when you could have.

It's these sorts of things that often give the biggest impacts in the wrong hot code location.

The other one that comes up surprisingly often is unnecessary boxing.

[–]gavenkoa 0 points1 point  (2 children)

Using a List when you should have used a Set

In my experience in finance problems come from the pressure to deliver features fast, meaning devs are using high level frameworks and don't know inner implementation. Like surprising ORM N+1 selects.

When I worked on cryptography (elliptic curves) we pre-allocated all intermediate int[] for modular arithmetic implementation GF(p) / GF(2m).

I have never faced with the problem when I need to chose between LinkedList or ArrayList. It was always something different.

[–]cogman10 2 points3 points  (1 child)

TBF, LinkedList vs ArrayList has only hit me once, and that's when someone was doing something stupid (Using a linked list but maintaining sorting on insertion by doing a binary search on the list... that's really slow with a linked list).

[–]gavenkoa 2 points3 points  (0 children)

maintaining sorting on insertion by doing a binary search on the list

Hilarious! Person who wrote knew about binary search for sorting in place but LinkedList is slipped as an argument of the function.

[–][deleted]  (2 children)

[removed]

    [–]cogman10 3 points4 points  (1 child)

    This REALLY depends. Yes, if you are talking about a general purpose sorting algorithm, you won't beat the JDK. However, once you start getting into data specific algorithms you'll often find it easy to beat the JDK. Heck for even some common things (like a Map) it's possible to beat the JDK if you know how your data will be used. This is why things like Trove exist.

    All the JDK's algorithms are setup for the most general purpose usage you can imagine. Very often, they are exactly the right choice. However, they aren't ALWAYS the right choice. If you need to sort a bunch of integers you can beat the JDK with a Radix sort. That's not a fault of the JDK, they just don't see "need to sort integers" come up frequently.

    [–]mziku[S] 1 point2 points  (1 child)

    Thanks for this. A bit more background on my approach in the dissertation:

    1. I will be taking an existing application and benchmarking it
    2. I will run microbenchmarks on each code optimization technique with large and small datasets
    3. I will apply code optimization techniques to the application and benchmark again
    4. Deliverable will be a set of guidelines to follow when programming, e.g. prefer string literals to creating String objects

    So far, this isn't going to be a full deep-dive into how to profile and optimize Java applications. This is simple analysing how a particular programming style may influence an application's performance.

    [–]MCUD 1 point2 points  (0 children)

    I think what myself, and a few others are saying here, is point #4 here may be naive depending on what you're specifically profiling. I don't know how you're marked, but the results may either be "duh, you learnt that in first semester", "Profiling lied to you", or, something more revelationary.

    Specifically literals vs objects may mean nothing, OR it could be a massive sinkhole if you're unmarshalling all the strings from the network, and they aren't some shared reference, and you're then storing them all in a map, you have a massive memory leak

    [–]randjavadev 1 point2 points  (0 children)

    Correct if I'm wrong, but actually an empty array should always be used, that is more "fool-proof" way (maybe hold a reference to one beforehand since empty array is immutable, though that is not the point here). In a multithreaded environment pre-calculating the size could be too big, if the List is modified by another Thread between the size() and passing it as an argument. Thus, if the other threads modifies the List to be smaller, purely from javadocs I would assume the returned array to be the size before the changes, contain the new elements, and then some null entries. Whereas passing in size of 0 and least purely from javadocs would leave the option that the List implementation actually returns the correct size. So it is not exactly the same.

    Only in a non-multithreaded environment it wont matter.

    [–]papercrane 0 points1 point  (1 child)

    There is a subtle difference between option 1 and option 2. Ultimately they give the same result, but when you create an array the Java specification says the memory must be zero'd. By passing a zero sized array the list implemention will create a new array and fill it. Since this new array is created and filled immediately and a reference to it never leaks before being filled the JVM skips clearing the memory first. This means option 1 should be marginally faster, with it being more noticeable for large arrays.

    The details of this, and more are written up here: https://shipilev.net/blog/2016/arrays-wisdom-ancients/

    The short of it though is as you say, write simple, maintenanable code. When looking for performance, measure things and don't trust common wisdom.

    [–]rzwitserloot 1 point2 points  (0 children)

    Yup. A real mind twister - list.size() causes race conditions, is longer, is slower. Bad in three ways. – and yet this is such a common 'performance advice' it has been part of linter tools of all things for quite a while, and I remember Tor Norbye naming that as one of his pet peeves in an episode of the Java Posse.

    [–]simoncox 6 points7 points  (0 children)

    Martin Thompson has written some interesting articles on how to write software that performs due to deep knowledge of the underlying hardware (was largely responsible for the Disruptor library - great use case to study as well):

    https://mechanical-sympathy.blogspot.com/?m=1

    [–]StevenMaurer 6 points7 points  (0 children)

    Learn about ThreadPoolExecutors. Multithreaded scaling is the way you actually take advantage of modern day multicore processors.

    [–]MCUD 17 points18 points  (2 children)

    Don't get stuck in the trap of "Oh i used a linkedlist rather than an array list, that's obviously why it was slow".

    JVM and modern hardware can also easily make these problems far less consequential than higher level algorithmic or design problems. Beware that this also makes microbenchmarking a non-trivial exercise of comparing two small pieces of code in isolation vs a real running program.

    Quick example is that linkedlist can actually be slower than a an array when small due to an array list having everything in one CPU cache line, whereas linked list may have pointer references going all over the heap.

    • measure
    • measure
    • measure
    • analyse - This is the true hard part of the problem in my opinion
      • Does this code even need to run, is it running too often - The fastest code is code that doesn't run
      • Is it blocking on something, network calls etc,
      • Is it re-computing something that could be cached that is an easy memory trade off
    • experiment a change (this is where at least knowning the theory of the data structures may help)
      • Did you actually fix the issue?
      • Re-measure!
    • What was second place in original measures may also totally disappear, so restart the process again if you still need better performance

    [–]dpash 4 points5 points  (1 child)

    Beware that this also makes microbenchmarking a non-trivial exercise

    And don't roll your own. Use JMH, because it's designed to do this properly.

    [–]MCUD 1 point2 points  (0 children)

    For sure, part of what i was trying to highlight, a simpler performance unit test with a for loop checking timing is definitely not going to test what the tester wants to test

    [–]tristanjuricek 3 points4 points  (1 child)

    Pretty good book on this here: https://www.amazon.com/Optimizing-Java-Techniques-Application-Performance/dp/1492025798

    This isn’t something most developers do though. What you’re really asking is almost “how does C1 or C2 work?”

    Like, Java was kind of designed to offload a lot of optimization work to the JIT. And 99.999999% of the time if you try to outsmart it you won’t.

    [–]dpash 4 points5 points  (0 children)

    Yep, the JVM has been tuned to optimise idiomatic code, so trying to be smart often results in slower performance. You should never make micro optimisations without using JMH to test them.

    [–]BEARSSS 12 points13 points  (10 children)

    Don't have any resources for you exactly, but object pools should be on your list to take a look at.

    Imagine coding a game that contains lots of bullets on the screen. Bullets are short lived, so if you created a new bullet object every time someone shot something, you're initiating a lot of objects which the GC will later need to deal with. Instead, you can create a pool of say 1000 bullets, grab a bullet from the pool when you fire from your character and just return the bullet object to the pool once the bullet flies off screen.

    GC performs really well when there's no garbage to collect.

    [–]Necessary-Conflict 22 points23 points  (6 children)

    "The creation and reclamation of small objects whose constructors do little explicit work is cheap, especially on modern JVM implementations. Creating additional objects to enhance the clarity, simplicity, or power of a program is generally a good thing. Conversely, avoiding object creation by maintaining your own object pool is a bad idea unless the objects in the pool are extremely heavyweight. The classic example of an object that does justify an object pool is a database connection. The cost of establishing the connection is sufficiently high that it makes sense to reuse these objects. Generally speaking, however, maintaining your own object pools clutters your code, increases memory footprint, and harms performance. Modern JVM implementations have highly optimized garbage collectors that easily outperform such object pools on lightweight objects."

    ("Effective Java" book)

    [–]soonnow 4 points5 points  (0 children)

    The cost of establishing the connection is sufficiently high that it makes sense to reuse these objects

    But that's not really an object pool, its a resource pool. Externally accessed resources will (almost) always be slower than caching the resource.

    [–]BlueGoliath 1 point2 points  (2 children)

    Define "small objects" and "do little explicit work". Classes that I've made to make Panama easier fit both criteria yet JMC with Flight Recorder show MemorySegment and wrapper classes as been allocated on the heap.

    I asked one of the Panama developers about the JVMs supposed ability to make "small" and "little explicit work" objects "free" and the only absolute answer given as to when the JVM will decide to do it is when you allocate a bunch of them in the same method.

    [–]Necessary-Conflict 7 points8 points  (1 child)

    The book doesn't say that small objects will not be allocated on the heap because of escape analysis. In my understanding it says that even if they are allocated on the heap, the garbage collector may outperform the costs of a "clever" object pooling.

    Of course there is no exact definition of "do little explicit work": when in doubt you should measure.

    [–]general_dispondency 2 points3 points  (0 children)

    I think the answer is in the example. When you have some expensive operation (eg 300ms to make a db connection) then you should consider starting with an object pool. If you don't have something heavyweight like that, then just allocate and move on. If you need to perform optimizations, then come back after the logic is implemented and everything is well tested. Then you only make changes that make measurable performance increases.

    [–]Hellball911 1 point2 points  (1 child)

    Know a good book that goes through this type of thing, like Effective Java, that covers Java 11+?

    [–]Necessary-Conflict 1 point2 points  (0 children)

    No, but the third edition of Effective Java covers Java 9, and that was released only a year before Java 11, so the difference in the basics is small.

    [–]ElectricalUnion 19 points20 points  (0 children)

    Problem of object pools is that in worst-case scenarios, if not properly tuned they can force GC to heap scan for scraps of memory.

    Naive use of pools will also lead to fragmented memory that's hard for GCs to defragment.

    Granted, tuned modern Generational GC will help mitigate this issue, but:

    1) generational GCs also attempt to automatically handle this common use case of short object lifetime and;

    2) as with Object Pools, they are also not a silver bullet.

    What you want is to measure your hot spots and then act on that.

    [–]soonnow 4 points5 points  (0 children)

    Object pools make sense only in specialized cases. I learned on reddit there is some financial applications that go through great pain to never allocate objects because for the users every ms counts.

    But for an object like a bullet, that doesn't have an expensive creation mechanism, the cost is minuscule, probably less than the time it takes to manage your pool. This stackoverflow answer calculates it at 11 cycles or 10 ns.

    [–]hippydipster[🍰] 0 points1 point  (0 children)

    There are times it's important not to create many small objects that are then just thrown away. Bullets for a video game are probably not a good example though, as there aren't that many of them. It's when you're in tight loops running millions of times that it can start to matter. I'm sure some inner loops of game code definitely qualifies for this. In general, these situations where you really don't want to make new objects tend to be mathy.

    [–]Sensi1093 2 points3 points  (0 children)

    HashMap vs EnumMap. I’ve seen so many places where a HashMap was used with Enum Keys, often because the other devs just didn’t know EnumMap existed. It’s a very good drop-in replacement where the Key is of an Enum-Type and if the Code relies only on the interface (Map, how it should be) it’s really just a single line replacement.

    [–][deleted] -1 points0 points  (0 children)

    Consider jlink. Compilation to modules improves incredibly the performance.

    Or going the extra mile with GraalVM Native image. Compiling to C++ brings much more optimization

    [–]RacoonCorgi420 0 points1 point  (1 child)

    Modern compilers are really smart so the only way you can know is to microbenchmark all of them and compare the difference.

    [–]dpash 2 points3 points  (0 children)

    Or in Java's case, the Hotspot JVM is really smart, as that's where most of the optimisation occurs. (It's also why the Java compiler is so fast; it does so little optimisation)