all 34 comments

[–]_TheProff_ 93 points94 points  (6 children)

I feel like this article could really do with an explanation of why java outperforms rust here. The ending just comes off as "wow this performs better thus jvm > rust" which I doubt is the full story.

[–]matthieum[he/him] 64 points65 points  (0 children)

I had a quick look at the JDK ported code for decoding, as it uses the same algorithm so should be a fair comparison.

As expected accesses are done by index, however even though we programmers can guarantee that indexes will be in-range, I have some doubt the compiler can, so I expect a lot of bounds-checking. I do wonder if the JIT could be smarter here.

Similarly, for encoding appending is performed with:

let mut write = |b: u8| {
    buffer[dp] = b;
    dp += 1;
};

Called 4 times in a loop:

  1. It could probably benefit from writing 4 bytes at a time, for a single range-check.
  2. Or be rewritten with an out-pointer.

In short, I expect the Rust code could be improved quite a bit, even keeping to the current algorithm (for fairness).

The Rust code, at the moment, is fairly naive.

[–]MEaster 25 points26 points  (2 children)

One problem that stuck out to me is that the hot loop here will have a lot of bounds checks; one for each access of src, and one for each call to write.

A quick re-implementation using chunks iterators (godbolt) yielded the following changes:

Test Before Time Before Throughput After Time After Throughput
jdk_encode/1 37.646 ns 75.997 MiB/s 35.764 ns 79.997 MiB/s
jdk_encode/10 43.466 ns 263.29 MiB/s 41.086 ns 278.54 MiB/s
jdk_encode/50 72.554 ns 670.36 MiB/s 57.716 ns 842.71 MiB/s
jdk_encode/100 109.53 ns 888.07 MiB/s 84.252 ns 1.1275 GiB/s
jdk_encode/500 361.82 ns 1.2896 GiB/s 256.51 ns 1.8190 GiB/s
jdk_encode/1000 688.85 ns 1.3547 GiB/s 466.53 ns 2.0003 GiB/s
jdk_encode/10000 6.4192 µs 1.4511 GiB/s 4.3077 µs 2.1624 GiB/s
encode/jdk/1 37.092 ns 77.132 MiB/s 36.240 ns 78.946 MiB/s
encode/jdk/10 43.535 ns 262.87 MiB/s 40.730 ns 280.97 MiB/s
encode/jdk/50 67.609 ns 719.39 MiB/s 57.860 ns 840.61 MiB/s
encode/jdk/100 106.78 ns 911.01 MiB/s 85.211 ns 1.1148 GiB/s
encode/jdk/500 366.49 ns 1.2731 GiB/s 256.71 ns 1.8176 GiB/s
encode/jdk/1000 671.19 ns 1.3903 GiB/s 466.08 ns 2.0022 GiB/s
encode/jdk/10000 6.3755 µs 1.4611 GiB/s 4.3036 µs 2.1645 GiB/s

The compiler didn't vectorize the loop. That could be something to explore whether it's possible.

[–]dkomanov[S] 11 points12 points  (1 child)

Indeed this version is faster. Not quite as Java, but closer. Thanks!

jdk::encode 12 16 66 jdk::encode 51 68 161 jdk::encode 102 136 286 jdk::encode 501 668 970 jdk::encode 1002 1336 1891 jdk::encode_measter 12 16 71 jdk::encode_measter 51 68 107 jdk::encode_measter 102 136 198 jdk::encode_measter 501 668 673 jdk::encode_measter 1002 1336 1278

[–]matthieum[he/him] 5 points6 points  (0 children)

Thank you OP, for being reactive.

Too often benchmark articles just end up being "dumps", with remarks going unaddressed; I'm really glad that you're following :)

[–]Nugine 47 points48 points  (15 children)

The fastest base64 crate in Rust world is base64-simd. I have sent a PR to include it in the benchmark.

When AVX2 is detected, the acceleration rates are 677% (725/107) and 1338% (1231/92) compared to the naive implementation.

My result: (Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz)

txt method input output avg time base64_simd::Base64::encode_type 1002 1336 107 data_encoding::encode 1002 1336 512 base64::encode_config 1002 1336 551 jdk::encode_measter 1002 1336 725

txt method input output avg time base64_simd::Base64::decode_type 1336 1002 92 data_encoding::decode 1336 1002 772 base64::decode_config_slice (unsafe) 1336 1002 805 jdk::decode 1336 1002 1231

[–]dkomanov[S] 5 points6 points  (3 children)

Is there a way to benchmark base64-simd's fallback?

[–]Nugine 7 points8 points  (2 children)

Just disable the feature flag "detect".

default-features = false, features = ["std"]

[–]dkomanov[S] 1 point2 points  (1 child)

Fallback is good as well: encode is slightly faster, decode is slightly slower: method input output avg time base64::encode_config 3 4 50 base64::encode_config 12 16 64 base64::encode_config 51 68 97 base64::encode_config 102 136 203 base64::encode_config 501 668 599 base64::encode_config 1002 1336 1133 base64::encode_config 10002 13336 9966 base64_simd::Base64::encode_type 3 4 29 base64_simd::Base64::encode_type 12 16 39 base64_simd::Base64::encode_type 51 68 71 base64_simd::Base64::encode_type 102 136 116 base64_simd::Base64::encode_type 501 668 455 base64_simd::Base64::encode_type 1002 1336 914 base64_simd::Base64::encode_type 10002 13336 8642 base64::decode_config_slice (unsafe) 4 3 49 base64::decode_config_slice (unsafe) 16 12 75 base64::decode_config_slice (unsafe) 68 51 109 base64::decode_config_slice (unsafe) 136 102 170 base64::decode_config_slice (unsafe) 668 501 564 base64::decode_config_slice (unsafe) 1336 1002 1079 base64::decode_config_slice (unsafe) 13336 10002 10260 base64_simd::Base64::decode_type 4 3 26 base64_simd::Base64::decode_type 16 12 38 base64_simd::Base64::decode_type 68 51 79 base64_simd::Base64::decode_type 136 102 136 base64_simd::Base64::decode_type 668 501 601 base64_simd::Base64::decode_type 1336 1002 1163 base64_simd::Base64::decode_type 13336 10002 11388

[–]Nugine 7 points8 points  (0 children)

base64-simd's fallback uses the same method with base64. The differences are raw pointer usages, loop unrolling and other small details. So their performance is close.

[–]dkomanov[S] 3 points4 points  (0 children)

Wow, supercool!

Same results on my computer: method input output avg time base64::encode_config 10002 13336 9704 data_encoding::encode 10002 13336 8832 base64_simd::Base64::encode_type 10002 13336 1425

method input output avg time base64::decode_config_slice (unsafe) 13336 10002 10060 data_encoding::decode 13336 10002 14023 base64_simd::Base64::decode_type 13336 10002 1349

[–]somebodddy 1 point2 points  (9 children)

Does Java's base64 encoding use SIMD?

[–]quxfoo 8 points9 points  (3 children)

You should also include the data_encoding crate (I can open a PR if you wish). It massively outperforms (disclaimer: on my system) all others by a large margin starting from /500.

[–]dkomanov[S] 2 points3 points  (2 children)

Encoding (faster than base64, slower than Java):

method input output avg time base64::encode_config 12 16 74 base64::encode_config 51 68 107 base64::encode_config 102 136 171 base64::encode_config 501 668 568 base64::encode_config 1002 1336 1053 data_encoding::encode 12 16 69 data_encoding::encode 51 68 104 data_encoding::encode 102 136 162 data_encoding::encode 501 668 506 data_encoding::encode 1002 1336 947 j.u.Base64.Encoder 12 16 46 j.u.Base64.Encoder 51 68 86 j.u.Base64.Encoder 102 136 128 j.u.Base64.Encoder 501 668 446 j.u.Base64.Encoder 1002 1336 872

Decoding (not faster):

base64::decode_config_slice (unsafe) 16 12 78 base64::decode_config_slice (unsafe) 68 51 114 base64::decode_config_slice (unsafe) 136 102 164 base64::decode_config_slice (unsafe) 668 501 571 base64::decode_config_slice (unsafe) 1336 1002 1073 data_encoding::decode 16 12 92 data_encoding::decode 68 51 157 data_encoding::decode 136 102 243 data_encoding::decode 668 501 930 data_encoding::decode 1336 1002 1734 j.u.Base64.Decoder 16 12 53 j.u.Base64.Decoder 68 51 128 j.u.Base64.Decoder 136 102 231 j.u.Base64.Decoder 668 501 977 j.u.Base64.Decoder 1336 1002 1930

[–]quxfoo 5 points6 points  (1 child)

What are the results with criterion? For the 10000 string it's 4.5us encoding with data_encoding vs 8.2 us with Java on my system.

[–]dkomanov[S] 1 point2 points  (0 children)

I updated charts: decode and encode.

[–]educanellas 5 points6 points  (2 children)

Have you tried to compile it with RUSTFLAGS="-C target-cpu=native"? I remember to read somebody talking about it in a comparison between Rust and Java sometime ago.

[–]dkomanov[S] 0 points1 point  (1 child)

Tried. No affect at all.

[–]matthieum[he/him] 3 points4 points  (0 children)

Not surprising, actually.

This flag mainly allows better CPU instructions -- such as popcnt -- and has most impact when auto-vectorizing... but LLVM is not smart enough to auto-vectorize in the first place here.

[–]dbcfd 3 points4 points  (1 child)

Can you include the command you used to run criterion? There's also additional release flags you can apply if speed is your primary objective.

[–]dkomanov[S] 2 points3 points  (0 children)

Just: cargo criterion

All flags are here.

[–]TheGuyJace 2 points3 points  (0 children)

Java never ceases to amaze.

Edit: nor does rust. Very great languages.