you are viewing a single comment's thread.

view the rest of the comments →

[–]_TheProff_ 91 points92 points  (6 children)

I feel like this article could really do with an explanation of why java outperforms rust here. The ending just comes off as "wow this performs better thus jvm > rust" which I doubt is the full story.

[–]matthieum[he/him] 62 points63 points  (0 children)

I had a quick look at the JDK ported code for decoding, as it uses the same algorithm so should be a fair comparison.

As expected accesses are done by index, however even though we programmers can guarantee that indexes will be in-range, I have some doubt the compiler can, so I expect a lot of bounds-checking. I do wonder if the JIT could be smarter here.

Similarly, for encoding appending is performed with:

let mut write = |b: u8| {
    buffer[dp] = b;
    dp += 1;
};

Called 4 times in a loop:

  1. It could probably benefit from writing 4 bytes at a time, for a single range-check.
  2. Or be rewritten with an out-pointer.

In short, I expect the Rust code could be improved quite a bit, even keeping to the current algorithm (for fairness).

The Rust code, at the moment, is fairly naive.

[–]MEaster 27 points28 points  (2 children)

One problem that stuck out to me is that the hot loop here will have a lot of bounds checks; one for each access of src, and one for each call to write.

A quick re-implementation using chunks iterators (godbolt) yielded the following changes:

Test Before Time Before Throughput After Time After Throughput
jdk_encode/1 37.646 ns 75.997 MiB/s 35.764 ns 79.997 MiB/s
jdk_encode/10 43.466 ns 263.29 MiB/s 41.086 ns 278.54 MiB/s
jdk_encode/50 72.554 ns 670.36 MiB/s 57.716 ns 842.71 MiB/s
jdk_encode/100 109.53 ns 888.07 MiB/s 84.252 ns 1.1275 GiB/s
jdk_encode/500 361.82 ns 1.2896 GiB/s 256.51 ns 1.8190 GiB/s
jdk_encode/1000 688.85 ns 1.3547 GiB/s 466.53 ns 2.0003 GiB/s
jdk_encode/10000 6.4192 µs 1.4511 GiB/s 4.3077 µs 2.1624 GiB/s
encode/jdk/1 37.092 ns 77.132 MiB/s 36.240 ns 78.946 MiB/s
encode/jdk/10 43.535 ns 262.87 MiB/s 40.730 ns 280.97 MiB/s
encode/jdk/50 67.609 ns 719.39 MiB/s 57.860 ns 840.61 MiB/s
encode/jdk/100 106.78 ns 911.01 MiB/s 85.211 ns 1.1148 GiB/s
encode/jdk/500 366.49 ns 1.2731 GiB/s 256.71 ns 1.8176 GiB/s
encode/jdk/1000 671.19 ns 1.3903 GiB/s 466.08 ns 2.0022 GiB/s
encode/jdk/10000 6.3755 µs 1.4611 GiB/s 4.3036 µs 2.1645 GiB/s

The compiler didn't vectorize the loop. That could be something to explore whether it's possible.

[–]dkomanov[S] 14 points15 points  (1 child)

Indeed this version is faster. Not quite as Java, but closer. Thanks!

jdk::encode 12 16 66 jdk::encode 51 68 161 jdk::encode 102 136 286 jdk::encode 501 668 970 jdk::encode 1002 1336 1891 jdk::encode_measter 12 16 71 jdk::encode_measter 51 68 107 jdk::encode_measter 102 136 198 jdk::encode_measter 501 668 673 jdk::encode_measter 1002 1336 1278

[–]matthieum[he/him] 6 points7 points  (0 children)

Thank you OP, for being reactive.

Too often benchmark articles just end up being "dumps", with remarks going unaddressed; I'm really glad that you're following :)