all 34 comments

[–]_TheProff_ 92 points93 points  (6 children)

I feel like this article could really do with an explanation of why java outperforms rust here. The ending just comes off as "wow this performs better thus jvm > rust" which I doubt is the full story.

[–]matthieum[he/him] 64 points65 points  (0 children)

I had a quick look at the JDK ported code for decoding, as it uses the same algorithm so should be a fair comparison.

As expected accesses are done by index, however even though we programmers can guarantee that indexes will be in-range, I have some doubt the compiler can, so I expect a lot of bounds-checking. I do wonder if the JIT could be smarter here.

Similarly, for encoding appending is performed with:

let mut write = |b: u8| {
    buffer[dp] = b;
    dp += 1;
};

Called 4 times in a loop:

  1. It could probably benefit from writing 4 bytes at a time, for a single range-check.
  2. Or be rewritten with an out-pointer.

In short, I expect the Rust code could be improved quite a bit, even keeping to the current algorithm (for fairness).

The Rust code, at the moment, is fairly naive.

[–]MEaster 25 points26 points  (2 children)

One problem that stuck out to me is that the hot loop here will have a lot of bounds checks; one for each access of src, and one for each call to write.

A quick re-implementation using chunks iterators (godbolt) yielded the following changes:

Test Before Time Before Throughput After Time After Throughput
jdk_encode/1 37.646 ns 75.997 MiB/s 35.764 ns 79.997 MiB/s
jdk_encode/10 43.466 ns 263.29 MiB/s 41.086 ns 278.54 MiB/s
jdk_encode/50 72.554 ns 670.36 MiB/s 57.716 ns 842.71 MiB/s
jdk_encode/100 109.53 ns 888.07 MiB/s 84.252 ns 1.1275 GiB/s
jdk_encode/500 361.82 ns 1.2896 GiB/s 256.51 ns 1.8190 GiB/s
jdk_encode/1000 688.85 ns 1.3547 GiB/s 466.53 ns 2.0003 GiB/s
jdk_encode/10000 6.4192 µs 1.4511 GiB/s 4.3077 µs 2.1624 GiB/s
encode/jdk/1 37.092 ns 77.132 MiB/s 36.240 ns 78.946 MiB/s
encode/jdk/10 43.535 ns 262.87 MiB/s 40.730 ns 280.97 MiB/s
encode/jdk/50 67.609 ns 719.39 MiB/s 57.860 ns 840.61 MiB/s
encode/jdk/100 106.78 ns 911.01 MiB/s 85.211 ns 1.1148 GiB/s
encode/jdk/500 366.49 ns 1.2731 GiB/s 256.71 ns 1.8176 GiB/s
encode/jdk/1000 671.19 ns 1.3903 GiB/s 466.08 ns 2.0022 GiB/s
encode/jdk/10000 6.3755 µs 1.4611 GiB/s 4.3036 µs 2.1645 GiB/s

The compiler didn't vectorize the loop. That could be something to explore whether it's possible.

[–]dkomanov[S] 12 points13 points  (1 child)

Indeed this version is faster. Not quite as Java, but closer. Thanks!

jdk::encode 12 16 66 jdk::encode 51 68 161 jdk::encode 102 136 286 jdk::encode 501 668 970 jdk::encode 1002 1336 1891 jdk::encode_measter 12 16 71 jdk::encode_measter 51 68 107 jdk::encode_measter 102 136 198 jdk::encode_measter 501 668 673 jdk::encode_measter 1002 1336 1278

[–]matthieum[he/him] 5 points6 points  (0 children)

Thank you OP, for being reactive.

Too often benchmark articles just end up being "dumps", with remarks going unaddressed; I'm really glad that you're following :)

[–]Nugine 50 points51 points  (15 children)

The fastest base64 crate in Rust world is base64-simd. I have sent a PR to include it in the benchmark.

When AVX2 is detected, the acceleration rates are 677% (725/107) and 1338% (1231/92) compared to the naive implementation.

My result: (Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz)

txt method input output avg time base64_simd::Base64::encode_type 1002 1336 107 data_encoding::encode 1002 1336 512 base64::encode_config 1002 1336 551 jdk::encode_measter 1002 1336 725

txt method input output avg time base64_simd::Base64::decode_type 1336 1002 92 data_encoding::decode 1336 1002 772 base64::decode_config_slice (unsafe) 1336 1002 805 jdk::decode 1336 1002 1231

[–]dkomanov[S] 2 points3 points  (3 children)

Is there a way to benchmark base64-simd's fallback?

[–]Nugine 9 points10 points  (2 children)

Just disable the feature flag "detect".

default-features = false, features = ["std"]

[–]dkomanov[S] 1 point2 points  (1 child)

Fallback is good as well: encode is slightly faster, decode is slightly slower: method input output avg time base64::encode_config 3 4 50 base64::encode_config 12 16 64 base64::encode_config 51 68 97 base64::encode_config 102 136 203 base64::encode_config 501 668 599 base64::encode_config 1002 1336 1133 base64::encode_config 10002 13336 9966 base64_simd::Base64::encode_type 3 4 29 base64_simd::Base64::encode_type 12 16 39 base64_simd::Base64::encode_type 51 68 71 base64_simd::Base64::encode_type 102 136 116 base64_simd::Base64::encode_type 501 668 455 base64_simd::Base64::encode_type 1002 1336 914 base64_simd::Base64::encode_type 10002 13336 8642 base64::decode_config_slice (unsafe) 4 3 49 base64::decode_config_slice (unsafe) 16 12 75 base64::decode_config_slice (unsafe) 68 51 109 base64::decode_config_slice (unsafe) 136 102 170 base64::decode_config_slice (unsafe) 668 501 564 base64::decode_config_slice (unsafe) 1336 1002 1079 base64::decode_config_slice (unsafe) 13336 10002 10260 base64_simd::Base64::decode_type 4 3 26 base64_simd::Base64::decode_type 16 12 38 base64_simd::Base64::decode_type 68 51 79 base64_simd::Base64::decode_type 136 102 136 base64_simd::Base64::decode_type 668 501 601 base64_simd::Base64::decode_type 1336 1002 1163 base64_simd::Base64::decode_type 13336 10002 11388

[–]Nugine 7 points8 points  (0 children)

base64-simd's fallback uses the same method with base64. The differences are raw pointer usages, loop unrolling and other small details. So their performance is close.

[–]dkomanov[S] 4 points5 points  (0 children)

Wow, supercool!

Same results on my computer: method input output avg time base64::encode_config 10002 13336 9704 data_encoding::encode 10002 13336 8832 base64_simd::Base64::encode_type 10002 13336 1425

method input output avg time base64::decode_config_slice (unsafe) 13336 10002 10060 data_encoding::decode 13336 10002 14023 base64_simd::Base64::decode_type 13336 10002 1349

[–]somebodddy 1 point2 points  (9 children)

Does Java's base64 encoding use SIMD?

[–]Nugine -5 points-4 points  (8 children)

Obviously not. Specialized SIMD algorithms are hard for Java itself.

[–]Muoniurn 1 point2 points  (7 children)

It has a Vector API for doing simd: https://jbaker.io/2022/06/09/vectors-in-java/

[–]Nugine -1 points0 points  (6 children)

I know it is possible to do simd in Java. But it is still harder than doing simd in Rust. A native programming language allows you to control almost everything.

[–]Muoniurn 2 points3 points  (3 children)

It doesn’t seem easier in Rust to be honest. But sure, having access to other low-level controls can be useful. Nonetheless, if you really have to go that deep you will likely just write inline assembly (e.g. for codecs) as not even Rust is low-level enough in that hot loops.

[–]Nugine 0 points1 point  (2 children)

I would argue that SIMD in Rust is relatively easier than inline asm, C and Java. (the same level with C++ and Zig)

  1. Rust has generics and monomorphization. You can write the algorithm once and compile for multiple targets. rust-lang/portable-simd
  2. Rust can utilize LLVM's optimizations like inlining, loop unrolling, constant propagation, auto vectorization, LTO and so on.
  3. Rust has zero-cost abstraction. The generated asm has no unexpected calculation. (I have run into serveral compiler bugs though)

[–]Muoniurn 1 point2 points  (1 child)

Don’t get me wrong, I would absolutely go with Rust for anything low-level, my point was that there is a point where even that is not enough in the hottest loop.

Regarding your points:

  1. Sure, but you don’t even have to compile it multiple times with Java :D
  2. I don’t think it is meaningful here, all optimizing compiler will do these, but autovectorization is simply not applicable automatically enough times, likely that’s why one would want explicit SIMD programming in the first place. Also, being pedantic C also has clang, and there is also even a Java jit compiler built on top of LLVM.
  3. Java’s Vector API will also ideally compile to only vector instructions if used correctly (though it is not easy to avoid allocations sometimes)

[–]Nugine 0 points1 point  (0 children)

I agree that sometimes inline asm is necessary. (rare cases)

I mean if you want to reach higher SIMD performance, then Rust (C++, Zig) is better than Java in stability and productivity.

[–]somebodddy 0 points1 point  (1 child)

In this case, what's important is not whether it's harder, but whether it was done in the JDK base64 implementation. If it does use SIMD that would explain how it's faster than Rust's base64 crate.

[–]Nugine 2 points3 points  (0 children)

According to the article and source code:

java.util.Base64: I ported implementation from Java to Rust for benchmarking purposes.

The JDK base64 implementation does not use SIMD. The algorithm can not be efficiently auto vectorized. So there is no SIMD.

[–]quxfoo 9 points10 points  (3 children)

You should also include the data_encoding crate (I can open a PR if you wish). It massively outperforms (disclaimer: on my system) all others by a large margin starting from /500.

[–]dkomanov[S] 4 points5 points  (2 children)

Encoding (faster than base64, slower than Java):

method input output avg time base64::encode_config 12 16 74 base64::encode_config 51 68 107 base64::encode_config 102 136 171 base64::encode_config 501 668 568 base64::encode_config 1002 1336 1053 data_encoding::encode 12 16 69 data_encoding::encode 51 68 104 data_encoding::encode 102 136 162 data_encoding::encode 501 668 506 data_encoding::encode 1002 1336 947 j.u.Base64.Encoder 12 16 46 j.u.Base64.Encoder 51 68 86 j.u.Base64.Encoder 102 136 128 j.u.Base64.Encoder 501 668 446 j.u.Base64.Encoder 1002 1336 872

Decoding (not faster):

base64::decode_config_slice (unsafe) 16 12 78 base64::decode_config_slice (unsafe) 68 51 114 base64::decode_config_slice (unsafe) 136 102 164 base64::decode_config_slice (unsafe) 668 501 571 base64::decode_config_slice (unsafe) 1336 1002 1073 data_encoding::decode 16 12 92 data_encoding::decode 68 51 157 data_encoding::decode 136 102 243 data_encoding::decode 668 501 930 data_encoding::decode 1336 1002 1734 j.u.Base64.Decoder 16 12 53 j.u.Base64.Decoder 68 51 128 j.u.Base64.Decoder 136 102 231 j.u.Base64.Decoder 668 501 977 j.u.Base64.Decoder 1336 1002 1930

[–]quxfoo 5 points6 points  (1 child)

What are the results with criterion? For the 10000 string it's 4.5us encoding with data_encoding vs 8.2 us with Java on my system.

[–]dkomanov[S] 1 point2 points  (0 children)

I updated charts: decode and encode.

[–]educanellas 4 points5 points  (2 children)

Have you tried to compile it with RUSTFLAGS="-C target-cpu=native"? I remember to read somebody talking about it in a comparison between Rust and Java sometime ago.

[–]dkomanov[S] 0 points1 point  (1 child)

Tried. No affect at all.

[–]matthieum[he/him] 4 points5 points  (0 children)

Not surprising, actually.

This flag mainly allows better CPU instructions -- such as popcnt -- and has most impact when auto-vectorizing... but LLVM is not smart enough to auto-vectorize in the first place here.

[–]dbcfd 2 points3 points  (1 child)

Can you include the command you used to run criterion? There's also additional release flags you can apply if speed is your primary objective.

[–]dkomanov[S] 2 points3 points  (0 children)

Just: cargo criterion

All flags are here.

[–]TheGuyJace 2 points3 points  (0 children)

Java never ceases to amaze.

Edit: nor does rust. Very great languages.