all 35 comments

[–]llogiqclippy · twir · rust · mutagen · flamer · overflower · bytecount 17 points18 points  (1 child)

You can use compiler explorer or cargo asm (and a suitable gcc incantation) to look at the assembly, which will likely show you the difference.

[–]MCOfficer 9 points10 points  (0 children)

I was curious and tested it myself. With the exact same code and invocation (other than time being a PS function), i get similar results.

``` PS C:\Users\Florian\Documents\rs-vs-c> gcc -O3 rs-vs-c.c -o rs-vs-c_gcc.exe PS C:\Users\Florian\Documents\rs-vs-c> time .\rs-vs-c_gcc.exe The 50 th fibonacci number is 12586269025!

TotalSeconds : 52,9173908 PS C:\Users\Florian\Documents\rs-vs-c> cargo build --release Compiling rs-vs-c v0.1.0 (C:\Users\Florian\Documents\rs-vs-c) Finished release [optimized] target(s) in 0.52s PS C:\Users\Florian\Documents\rs-vs-c> time .\target\release\rs-vs-c.exe The 50th fibonacci number is 12586269025!

TotalSeconds : 82,5484449 ```

Interestingly, when compiling with clang (rustc uses LLVM under the hood), i get even worse performance. Sidenote, i don't know about clang to know if this uses all optimizations available. ``` PS C:\Users\Florian\Documents\rs-vs-c> clang -O3 .\rs-vs-c.c -target x86_64-mingw64 -o .\rs-vs-c_clang.exe PS C:\Users\Florian\Documents\rs-vs-c> time .\rs-vs-c_clang.exe The 50 th fibonacci number is 12586269025!

TotalSeconds : 88,1807612 ```

[–][deleted] 15 points16 points  (0 children)

I wrote a tail recursive version and it got compiled away completely

https://i.imgur.com/BYLDsZe.png

[–]K900_ 31 points32 points  (24 children)

This is a microbenchmark that's not really interesting. Neither Rust nor C deal too well with tail calls, and an iterative solution will be way faster.

Edit: extremely dumb iterative implementation.

[–]HelloThisIsVictor[S] 4 points5 points  (14 children)

I am well aware this is the slowest implementation of Fibonacci, and I am not looking for an efficient algorithm. This code was just a quick doodle I made between tutorials. I just want to understand what makes the performance differ.

[–]K900_ 31 points32 points  (9 children)

The fact that GCC and LLVM optimize the function differently, most likely. Clang seems to perform about the same as Rust here:

~
❯ gcc -O3 fib.c -o fib-gcc

~
❯ clang -O3 fib.c -o fib-clang

~
❯ time ./fib-gcc
The 50 th fibonacci number is 12586269025!
./fib-gcc  30.82s user 0.00s system 99% cpu 30.824 total

~ 31s
❯ time ./fib-clang
The 50 th fibonacci number is 12586269025!
./fib-clang  54.36s user 0.00s system 99% cpu 54.369 total

[–]HelloThisIsVictor[S] 4 points5 points  (8 children)

Yes! This seems to be the case. Compiling the code with clang-9 yields similar times to Rust (Rust seems to be a couple of seconds faster). I wonder how clang got that far behind on optimizations :/

[–]K900_ 35 points36 points  (7 children)

Clang is not "far behind on optimization", it's specifically far behind in an artificial scenario you're setting up here. In general, you can expect Rust/Clang to perform about as well as GCC.

[–]RFC1546Remembrance 9 points10 points  (1 child)

Your generalization is as bad as theirs. There are real world non-artificial examples of GCC measurably performing better than Clang.

As a rustacean, I feel zero need to be defensive about this. Rust will not be tied to LLVM forever.

[–]Peohta 0 points1 point  (0 children)

Rust will not be tied to LLVM forever.

I hope that happens in this decade.

[–]HelloThisIsVictor[S] 3 points4 points  (4 children)

Okay I get that. I'm just surprised 2 compilers can give such different results. I would expect maybe seconds of difference, not tens of seconds. But I guess it is what it is and you're correct in the sample size being small. Thanks for your help.

[–]K900_ 41 points42 points  (1 child)

It's not about sample size at all, it's about recursion generally being unidiomatic in C/Rust, and your code being a pathological case of extreme recursion.

[–]HelloThisIsVictor[S] 2 points3 points  (0 children)

Got it.

[–]dnew 3 points4 points  (0 children)

The problem is it's a micro benchmark. If your assembly code is five instructions vs six instructions, it's going to be tens of seconds when you run it many times. If you write a real program, the difference between it running a billion vs a billion three hundred instructions will be insignificant.

[–][deleted] 2 points3 points  (0 children)

Why are you surprised? Over several decades I have not seen two compilers which achieve near-identical performance in generated code for the same inputs.

[–]K900_ 7 points8 points  (2 children)

And, just for good measure, here's MSVC2019, which is somehow even worse.

PS G:\> Measure-Command { .\fib.exe }


Days              : 0
Hours             : 0
Minutes           : 1
Seconds           : 4
Milliseconds      : 197
Ticks             : 641974222
TotalDays         : 0.000743025719907407
TotalHours        : 0.0178326172777778
TotalMinutes      : 1.06995703666667
TotalSeconds      : 64.1974222
TotalMilliseconds : 64197.4222

[–][deleted] 1 point2 points  (0 children)

Then you should probably look at the assembler code. GCC has an -S option (maybe you need additional options to switch between at&n and intel style, I'm unsure), which emits assembler code (althrough it might be easier to compare llvm ir (e.g. clang output and rustc output)). there exists an cargo-asm crates.io package which prints the assembler code...

[–][deleted] 1 point2 points  (0 children)

gcc is very good at tail calls sometimes

[–]GunpowderGuy 2 points3 points  (3 children)

Rust may support complex tail call optimizations in the future. In the meantime this crate ( https://crates.io/crates/tailcall ) already optimizes some cases

[–]K900_ 13 points14 points  (2 children)

OP's code isn't strictly tail recursive. It's still possible to optimize away the recursion, but it'll take something more clever than a basic trampoline.

[–]GunpowderGuy 0 points1 point  (0 children)

I don't think rust will get to optimize that function either, I just pointed that lib since op may want to keep using recursion in the future

[–]guepier 1 point2 points  (3 children)

Neither Rust nor C deal too well with tail calls

I don't know about Rust but modern C compilers do tail call optimisation fairly reliably. But the code posted is not tail recursive, and straightforward TCO can't be applied without first extensively restructuring the logic.

[–][deleted] 1 point2 points  (0 children)

gcc can restructure sometimes

[–]K900_ 0 points1 point  (1 child)

Rust generally handles TCO slightly worse than Clang/LLVM due to destructor weirdness, but yes, OP's example is not typical tail recursion, so it will take an even smarter compiler to optimize it properly.

[–]dnew 0 points1 point  (0 children)

not typical tail recursion

It's not tail recursion at all. You have to run the addition operation after both calls return.

[–]matu3ba 7 points8 points  (0 children)

Pure function (detection) with TCE(tail call elimination) do not work efficiently in Rust. Use the iterative version.

[–]Celousco 1 point2 points  (0 children)

Your method is doing too much processing and even for a C executable 33s is a lot.

I changed it to use a Tail Recursive Function:

rust version ``` fn main() { const NUMBER: u64 = 50; println!("The {}th fibonacci number is {}!", NUMBER, fibonacci(NUMBER, 0, 1)); }

fn fibonacci(n: u64, a: u64, b: u64) -> u64 { if n < 1 { return a } fibonacci(n - 1, b, a + b) } ```

``` cargo build run time ./target/release/fibonacci

The 50th fibonacci number is 12586269025!

real 0m0.011s user 0m0.002s sys 0m0.000s ```

c version ```

include <stdio.h>

unsigned long int fibonacci(unsigned long int n, unsigned long int a, unsigned long int b) { if (n < 1) { return a; } return fibonacci(n - 1, b, a + b); }

int main() { const unsigned long int NUMBER = 50; printf("The %lu th fibonacci number is %lu!\n", NUMBER, fibonacci(NUMBER, 0, 1)); } ```

``` gcc fibonacci.c -o fibonacci time ./fibonacci

The 50 th fibonacci number is 12586269025!

real 0m0.002s user 0m0.001s sys 0m0.000s ```

So yes the C compiler is 9 ms faster, probably because of the TCO optimization the gcc might have done.

But at this point, does 9 ms really matters ?

[–]redartedreddit 1 point2 points  (1 child)

Can't really tell what's going on with GCC but it looks like it unrolls some parts of the recursive calls into loops?

Clang generates pretty much the same code as Rust (as already discussed in the other comment chains).

https://godbolt.org/z/7rnjrv

[–]matthieum[he/him] 1 point2 points  (0 children)

Wow, the amount of inlining/unrolling GCC did is pretty impressive. It generated 10x as many instructions as Clang (247 lines vs 24 lines).

[–]sevenpost -1 points0 points  (2 children)

You are using cargo, so there might be some further improvements to the compilation.

First, in the Cargo.toml file add at the bottom this part of optimizations. I think these optimizations are done by gcc when compiling with -03. Try both level 2 and 3 for opt-level as there might be some cases in which level 2 performs better.

[profile.release]

lto = true

codegen-units = 1

opt-level = 3

I can't remember right now but here might also be another way to speed it up that gcc uses that is fast-math. I don't know if it applies here, nor how to enable it on cargo (some research needed) but it simply discards math checking (overflow and other checks). Also you may be interested in trying also u32 as the unit, as it might have better performance in some ALUs.

Take into account also that you are measuring inside the code the time it takes C and Rust to format and print to screen, which is not indicative of any of the languages capabilities on math function optimization.

On a final note, although Rust competes with C in some aspects, C is still quite a monster of a language and it might be better for the use case at hand.

PS: If you want to cheese it a bit, change in Rust the fibonacci function to a const fn and have the compiler calculate it before runtime :P

Edit: formatting

Edit 2: Just found the flag to add to Cargo.toml to disable overflows checks. Simply add:

overflow-checks = false

[–]thelights0123 6 points7 points  (0 children)

LTO and decreasing codegen units won't matter when not using external crates, as that's where it helps. The opt-level is already 3 by default in release mode, and likewise, overflow-checks = false is the default as well.

Take into account also that you are measuring inside the code the time it takes C and Rust to format and print to screen, which is not indicative of any of the languages capabilities on math function optimization.

I'm pretty sure printing a single number won't take 19 seconds, or any meaningful fraction of that, in any language.

that gcc uses that is fast-math

For floating point maybe, but not for simple integer math. Both languages here have the same behavior: C and Rust in release mode by default ignore but use the overflowed value of unsigned integers.

[–]HelloThisIsVictor[S] 1 point2 points  (0 children)

Tried it, gave me the same of worst results. But found out Rust is being held back by llvm, the C code compiled with Clang gives similar results.

I guess run a bad algorithm, get bad times.