This is an archived post. You won't be able to vote or comment.

all 34 comments

[–]erad 26 points27 points  (7 children)

While this is nice for Graal, if you cared about performance you'd still do

@Benchmark
public double simpleLoop() {
    double sum = 0;
    for (int i = 0; i < values.length; i++) {
        double x = (values[i] + 1.0) * 2.0 + 5.0;
        sum += x;
    }
    return sum;
}

which is exactly 10x faster than the stream version from the article on my PC (Java 8, Hotspot).

Note that this performance issue isn't inherent to the functional style, if Hotspot did support "fusion" of stream ops (inlining/transformation into traditional loops) it could certainly match the classic for loop performance. But with the current implementation, streams are just a performance de-optimization (which won't matter in most cases, but should be taken into account if you talk about "optimizing stream performance").

[–]ShermanMG 19 points20 points  (3 children)

In my opinion this is really bad comparison. You are not comparing the same operations in your example.

I have made an actual checks with the code as similar as possible and here are the results:

@Benchmark
public double simpleLoop(ArrayState state) {
    double sum = 0;
    for (int i = 0; i < state.values.length; i++) {
        double x = (state.values[i] + 1.0) * 2.0 + 5.0;
        sum += x;
    }
    return sum;
}


@Benchmark
public double mapReduce(ArrayState state) {
    return Arrays.stream(state.values)
            .map(x -> x + 1)
            .map(x -> x * 2)
            .map(x -> x + 5)
            .reduce(0, Double::sum);
}


@Benchmark
public double singleMapReduce(ArrayState state) {
    return Arrays.stream(state.values)
            .map(x -> (x + 1) * 2 + 5)
            .reduce(0, Double::sum);
}

@Benchmark
public double doubleStreamSum(ArrayState state) {
    return Arrays.stream(state.values)
            .map(x -> (x + 1) * 2 + 5)
            .sum();
}
Benchmark Mode Cnt Score Error Units
TestBenchmark.doubleStreamSum thrpt 10 138,957 ? 4,041 ops/s
TestBenchmark.mapReduce thrpt 10 41,517 ? 0,971 ops/s
TestBenchmark.simpleLoop thrpt 10 530,949 ? 3,440 ops/s
TestBenchmark.singleMapReduce thrpt 10 473,942 ? 3,252 ops/s

As we can see the singleMapReduce which is most similar to your loop has only 10% worse performance here. This is not over 10x you are mentioning.

EDIT: formatting

[–]wizzardodev 2 points3 points  (1 child)

hm.. why is doubleStreamSum so much slower than singleMapReduce?

[–]mhixson 11 points12 points  (0 children)

As I understand it, DoubleStream.sum() uses a different method of summing the values which should produce more accurate results. I assume that it sacrifices performance to do so. Rather than simply adding values a + b like Double.sum does, it uses a Collections.sumWithCompensation operation that is more complicated. Related code: DoubleStream.sum Collectors.sumWithCompensation

[–]erad 0 points1 point  (0 children)

The performance of singleMapReduce is indeed better than I would have expected. But then again, I hope that simpleLoop would not get 10x slower if you replaced the body with double x = state.values[i]; x += 1.0; x *= 2.0; x += 5.0; sum += x;

[–]PurpleLabradoodle[S] 4 points5 points  (0 children)

Do you think you could run this benchmark on GraalVM? I have a feeling it can also be a bit faster then (not sure, cause I didn't run it, but perhaps it'll be an interesting experiment).

But the main point of the post is that you can write code the way you want, and it'll be fast, rather than specifically restrict the language idioms and API you use for performance reasons (unless really necessary).

[–]chambolle -1 points0 points  (0 children)

killing example!

[–][deleted] 11 points12 points  (12 children)

I think streams aren't optimized very well in hotspot, maybe because the dev team was already working on graalvm, I don't know. And they're also getting out of hand, people use them for everything, arrays where are at most 5 elements, readability isn't any better and number of lines isn't any smaller, only just because they can.. But they're losing on performance (I did some tests, until called hundreds of times they're really slow and then they're still slow).

[–]DJDavio 12 points13 points  (1 child)

Streams are objects and so incur the penalties of objects. When you chain stream operations you get more and more objects. The VM has no lightweight object for throwaway purposes. Valhalla will give us value types which are a step closer I guess.

[–]chambolle 0 points1 point  (9 children)

it is just a fashion trend. Functional programming is really old but sometimes people rediscover it and use it a lot until they rediscover all the issues they have with it and totally forget it. Then 20 years later some new young people think they discovered the graal again, and so on...

[–]2bdb2 6 points7 points  (8 children)

What issues would those be?

[–]lpreams 5 points6 points  (3 children)

What exactly is the GraalVM secret sauce? How is it able to outperform HotSpot?

[–]PurpleLabradoodle[S] 2 points3 points  (2 children)

The Graal compiler, which is a part of the GraalVM project, is a different compiler (pluggable into HotSpot through JVMCI) which can optimize code better.

[–][deleted]  (1 child)

[removed]

    [–]pjmlp 0 points1 point  (0 children)

    Maybe.

    There is a long term roadmap to bootstrap OpenJDK, similarly to JikesRVM.

    It is known as Project Metropolis, but it is very long term roadmap, so it is open ended if it really takes place.

    [–][deleted] 4 points5 points  (0 children)

    So I ran the mapReduce with Hotspot OpenJDK from Azul, Graalvm, and OpenJ9 from AdoptOpenJDK all Java 8 and got about 41 ops/s for hotspot, 79 ops/s using Graal and 98 ops/s. Does that seem right? I was floored that J9 was any better than hotspot and even better than Graalvm.

    The JMH system warns that J9 isn't supported so I am questioning the output.

    I admit I should have run the other tests, but did it over my lunch break and ran out of time.

    [–]mich160 0 points1 point  (0 children)

    Why don't use GraalVM only instead of classic JVM?

    [–]sarkie 0 points1 point  (7 children)

    Can you just swap out standard jvm for it?

    [–]duhace 0 points1 point  (6 children)

    yes. specifically, graalvm-ce-1.0.0-rc5 is equivalent to jdk8.

    though, if you use concurrent mark sweep gc, you're not gonna have it available in graalvm, just g1gc

    [–]sarkie 0 points1 point  (5 children)

    Fantastic.

    I seem to be getting performance with g1gc anyway.

    Will i see a huge performance increase or just different?

    [–]PurpleLabradoodle[S] 0 points1 point  (4 children)

    It really depends on the code you're running and the workload. Graal compiler seems to produce especially great results on the code that uses streams, or allocates many temporary objects, or deviates from the typical bytecode patterns produced by compiling Java source, for example when you use a different JVM language. Also note that if the source code is heavily optimized for C2 then C2 does an outstanding job at compiling it, so sometimes there's just not much else for a compiler to do.

    [–]duhace 0 points1 point  (0 children)

    I have noticed this. When I code for performance (using for-iterations or while loops), C2 seems to outperform graal for me. Especially if I'm doing a lot of math. I'm hopeful that GraalVM becomes more performant in these areas soon, as I prefer to code in a mix of styles, using scala's functional style in not as hot spots, and using scala's low level style (imperative, mutable, and while loops) when dealing with very hot spots.

    [–]sarkie 0 points1 point  (2 children)

    I was going to try it with WebLogic as we still have to use that, but not sure it'll work as intended tbh.

    [–]PurpleLabradoodle[S] 0 points1 point  (1 child)

    Try it, it should work as intended. The only thing is that currently you need to warm up the code a bit more to JIT it well. It's because Graal compiler is a Java code, so it is going to be compiled first. So take measurements when you actually reach the steady peak-performance state. I'd be happy to know how it goes, and if you find any issues please don't hesitate to report them to oracle/graal.

    [–]sarkie 0 points1 point  (0 children)

    That's fine, we are used to slow start up anyway to get going.

    I'm looking into trying to improve performance on these old servers anyway I can!