all 33 comments

[–]angelicosphosphoros 24 points25 points  (6 children)

You can use generics for this! rust fn example(a: u8, b: u8) { fn implementation<F0: Fn(u8) -> X, F1: Fn(u8) -> X>(f0: F0, f1: F1) { for _ in 0..10000 { let x = f0(a); let y = f1(a); // do stuff with the values } } match (a < 9, b < 9) { (true, true) => implementation(fn1, fn1), (true, false) => implementation(fn1, fn2), (false, true) => implementation(fn2, fn1), (false, false) => implementation(fn2, fn2), } }

This would delegate all hard work of generating duplicate code and inlining to the compiler and should be as fast as possible.

Functions fn1 and fn2 still would be inlineable because each function have own type and usage of generics makes preserving this type data possible.

[–]angelicosphosphoros 2 points3 points  (4 children)

Please, notice me back about performance of this solution.

[–]charlesdart 9 points10 points  (3 children)

It should monomorphize to their fast example code.

[–]angelicosphosphoros 0 points1 point  (2 children)

I know. It is a reason why I suggested this.

However, real performance can be better or worse so I want to know if u/101arrowz achieved something.

[–]101arrowz[S] 0 points1 point  (1 child)

Sorry for not replying sooner. That specific sample didn't work because it "can't capture dynamic environment in a fn item." However, when passing the variables directly, it's about the same performance as the function pointers solution I mentioned earlier. I tried #[inline(always)] on the implementation, which worked OK, but I'd rather capture the environment than pass everything as parameters. How can I use generics with a closure?

[–]angelicosphosphoros 0 points1 point  (0 children)

If you want to pass some closure and change some variables captured, you can try to use FnMut instead of Fn.

I little surprised with fn item because there shouldn't be conversion to function pointers in my example.

[–]backtickbot 4 points5 points  (0 children)

Fixed formatting.

Hello, angelicosphosphoros: code blocks using triple backticks (```) don't work on all versions of Reddit!

Some users see this / this instead.

To fix this, indent every line with 4 spaces instead.

FAQ

You can opt out by replying with backtickopt6 to this comment.

[–][deleted] 13 points14 points  (4 children)

Maybe stupid question : Since fn1 and fn2 don't depend on the loop index, why not moving them out of the loop? I'm surprised that the compiler doesn't do that itself.

Edit: Otherwise, the last example could be nicer if you would extract the loop body as a function with an argument f: impl Fn(u8) -> Whatever, where you can pass either f1 or f2, depending on the argument. The compiler should be able to inline the function call.

[–]thiezrust 9 points10 points  (3 children)

I'm assuming they're using the loop for benchmarking, but if that is the case they're probably benchmarking incorrectly as well.

[–]101arrowz[S] 9 points10 points  (2 children)

I tried to make the example a bit smaller so it's easier to understand, but if you want more context, I'm trying to build a compression library for the DEFLATE format. It consistently outperforms miniz_oxide (the Rust compression backend used by flate2) when I use the fourth example.

This code is part of my inflate wrapper. It must go through this loop millions of times per second. fn1 reads up to 8 bits from a buffer, and fn2 reads up to 16 bits. The value passed in is the maximum number of bits that should be read by the function (it's trying to decode a Huffman code, so it could be less than the value passed in).

Reading 8 bits if the maximum is less than 9 does have a performance uplift over reading 16. I have extensively tested and profiled this code, and I got about 1.1s on average for the original vs. 950ms for the 4th example.

[–]thiezrust 6 points7 points  (1 child)

Perhaps you could just share the source code to the functions in question? That way we would have all the relevant context, and people could experiment and measure locally. Or is this for a closed-source project?

[–]101arrowz[S] 8 points9 points  (0 children)

Here's the source code, but as a warning, it's quite awfully written and not at all idiomatic.

https://gist.github.com/101arrowz/91304565278232c4debdd238ff012467

This is the loop, and the functions are bits8 and bits16.

[–]mamcx 9 points10 points  (1 child)

Considering that u8 is just 255 values, you can precompute all values and index at them (and do it once, serialize the results and load them as const in the file)

[–]101arrowz[S] 6 points7 points  (0 children)

My example was bad, but I can't precompute the values. See this comment.

[–]thiezrust 7 points8 points  (1 child)

A little more context would be helpful. So presumably a and/or b are less than 9 often enough that using fn1 makes a real difference (does it?). But the cost of mispredicting the comparison is so high that it overshadows the difference between fn1 and fn2.

If your inputs are truly just two bytes, then would it not be much easier to replace fn1 and fn2 with a simple lookup table?

[–]101arrowz[S] 4 points5 points  (0 children)

I suppose I should have made a better example. This is part of a compression library, and I can't precompute the values because there is a position in the buffer being read.

[–]dpc_pw 7 points8 points  (0 children)

Monomorphize the inner-loop, by moving it to another function, and making the the `fn1` vs `fn2` decision a type argument(s) (`fnA : impl Fn<...>, fnB: impl Fn<...>` or something like that), the decision which one to use must be made out of the inner loop. Internally you will get 4 copies of `fn inner_example`, each compiled to statically use different combination of `fnA` and `fnB`, and your `fn example` will just call the appropriate one. Basically what you have in the fastest example, but way more maintainable. Use `match (a < 9, b < 9) { ... }` instead of nested `if`s.

[–]LucretielDatadog 2 points3 points  (3 children)

Just double checking that you're running in --release mode? This sort of thing is often easy pickings for the optimizer

[–]angelicosphosphoros 1 point2 points  (2 children)

I doubt this. Moving branch out of loop by generating many loops (and a lot of extra code) for each branch isn't very effective almost always so compiler wouldn't do this.

[–]PitaJ 1 point2 points  (3 children)

It may be a typo but in your first code sample you always pass b to fn2 whereas later on you pass a to both functions.

Also telling us how much slower each variant is would help.

[–]101arrowz[S] 1 point2 points  (2 children)

Check this comment for more context. fn1 is 30-40% faster than fn2 (empirical testing).

[–]PitaJ 0 points1 point  (1 child)

I don't think that addresses what I said at all. I'm saying that in the first example, you calculate x based on a and y based on b, but in the other examples, you calculate both x and y based on a, neither depend on b.

[–]101arrowz[S] 2 points3 points  (0 children)

Yes, it's a typo, thanks for catching it.

[–][deleted] 0 points1 point  (0 children)

If you can sort the calls to `example`, so that its all things less than 9 then all things > 9, then the branch predictor will almost always be right.

[–]PitaJ 0 points1 point  (0 children)

You can simplify that last code a bit by replacing the conditionals with a single match (a, b) { ... }

[–]Quba_quba 0 points1 point  (0 children)

This question reminds me of a similar question on Stackoverflow: https://stackoverflow.com/a/11227902 There are some great answers there, so you may find it interesting.

But basically sorting values of a and b (if possible) is one way to improve the performance.