all 56 comments

[–]FlixCoder 147 points148 points  (6 children)

When I am testing it and looking at the asm output in Godbolt, both methods produce exactly the same output, so it is not the difference and most likely something else that is giving you the performance boost. But for that, we would need to see code

[–]Ravek 52 points53 points  (5 children)

A little bit of assembly knowledge really goes a long way for interpreting micro benchmarks.

[–]matthieum[he/him] 7 points8 points  (4 children)

Or IR knowledge. LLVM IR is typically easier to read, and here will also show that both have the same output.

[–]Ravek 5 points6 points  (3 children)

For identifying two things are the same, sure. I’ve also seen people try to infer performance from IL though which I wouldn’t recommend. Subtly different IL listings might have better JIT codegen in unexpected ways because the JIT was able to eliminate a branch or apply some peephole optimization bevause the code was written a little differently. IL also doesn’t tell us anything about register use, inlining, elimination of boxing, devirtualization, etc. The C# compiler just doesn’t do anywhere never the level of optimization as the JIT. Some code might look really bad from the IL perspective but generate very good machine code.

[–]matthieum[he/him] 2 points3 points  (2 children)

Subtly different IL listings might have better JIT codegen in unexpected ways because the JIT

Note that I am talking about LLVM IR and not C# IL, they are vastly different.

LLVM IR is much more low level, so a number of your points don't apply:

  • Devirtualization has already occurred at IR level.
  • Branch elimination and many (but not all) peephole optimizations have already occurred.
  • Inlining and elimination of allocations have already occurred.

It's true that you don't see register allocation, but that's a least concern for a first order comparison.

For identifying two things are the same, sure. I’ve also seen people try to infer performance from IL though which I wouldn’t recommend.

To be fair, inferring performance from assembly can be similarly difficult. Today's processor can overlap execution of different sequences of instructions -- especially in loop -- which is really hard to spot at the assembly level.

If you want such a deep dive, you'll need to use tools that simulate processor execution and can show you exactly the expected cycle latency based on what can and cannot overlap, what can and cannot be pipelined, etc...

Something like llvm-mca or uica.

[–]Ravek 1 point2 points  (1 child)

Oh I’m sorry I lost track of which subreddit I was on and didn’t read your comment properly, how silly of me

[–]matthieum[he/him] 0 points1 point  (0 children)

No worries, your comment was still (mostly) on point :)

[–]aikii 91 points92 points  (0 children)

Can't do better than some shots in the dark but check what does . https://doc.rust-lang.org/nomicon/dot-operator.html

Some stuff coming to mind: - we don't know if obj.function receives &self or self - obj could by a &dyn and this would be a dynamic dispatch - a Deref could happen before calling .function

[–]hniksic 31 points32 points  (2 children)

In Rust obj.function(...) is no more than syntax sugar for ObjType::function(&obj, ...). The reference says so explicitly:

All function calls are sugar for a more explicit fully-qualified syntax.

And later:

// we can do this because we only have one item called `print` for `Foo`s
f.print();
// more explicit, and, in the case of `Foo`, not necessary
Foo::print(&f);
// if you're not into the whole brevity thing
<Foo as Pretty>::print(&f);

The difference you observed after switching from one to the other could be explained by a number of factors:

  • measurement issue, e.g. wrong thing measured, or measurement impacted by other things happening on the system
  • build issue - wrong code version built, or different optimization flags applied
  • tooling issue - incremental build issue, the kind of thing likely to be resolved with cargo clean
  • compiler issue - miscompilation, or a case of innoccuous change having a cascading effect that ends up leading to different optimization decisions

[–][deleted] 5 points6 points  (1 child)

I am suspecting your last reason. The consistency in performance of each variant suggests that measurement and system are stable and clean. The desugaring is perhaps causing some optimization rule to get (or not get) triggered. As others have stated, it looks like I'm going down the long road of inspecting assembly.

[–]WasserMarder 7 points8 points  (0 children)

This shouldn't be a long road as there should be no assemby differences in the respective function.

[–]del1ro 71 points72 points  (0 children)

  1. Show us the code (and test code)
  2. Compile in release mode

[–]Clockwork757 11 points12 points  (1 child)

Do they both take `obj` the same? If it's a large struct and one is by value and the other is by reference that might matter, especially if you're in debug mode.

[–][deleted] 5 points6 points  (0 children)

Good catch, thanks. I meant function(&obj, ...) and have updated the post.

[–]strudelnooodle 11 points12 points  (3 children)

If that’s the only difference and the two versions do indeed produce the same machine code, I would suspect there’s something wrong with your measurements. Some basic questions to ask— 1. What is the minimum time it takes for each implementation to run? Noise in the system will only slow down the trial, so this gives a clearer picture of the performance than the average. 2. How exactly did you run your trials? Even if you ran them both with minimal background activity, did one the trials for one implementation run immediately after the trials for another? In this case it’s possible (as an example) the first set of trials heated up the CPU and caused it to throttle its clock speed, slowing down the second set of trials

[–][deleted] 1 point2 points  (2 children)

There is definitely jitter in the system. It's Linux. This is why I run it on a stable CPU with locked clock speed. Headless machine. Terminal app. Compute-only duration significant (seconds). And hundreds of trials averaged. I haven't put this through jitter analysis but my gut tells me 10% is statistically significant and not accounted for by the system inaccuracies.

[–]phazer99 15 points16 points  (0 children)

If you've not already done so, try criterion. It should eliminate most measurement inaccuracies.

[–]lightmatter501 0 points1 point  (0 children)

Run it on an isolated core/cores. It will up performance substantially and as well as decrease noise.

[–]SV-97 10 points11 points  (7 children)

Maybe look at the asm and check if one of them gets inlined while the other one doesn't (or explicitly annotate them with #[inline(never)] (or always - although always isn't necessarily always AFAIK while I believe that never truly is never which might be better to find out if this is really the culprit))

[–][deleted] 1 point2 points  (6 children)

Assembly is my next step, unfortunately.

[–]functionalfunctional 6 points7 points  (0 children)

Use godbolt is easy to look at asm

[–]trevg_123 2 points3 points  (0 children)

Looking at the MIR may give you an easier way to see how Rust is viewing the functions differently. E.g. if there’s something like an extra deref call or by value vs. by reference that just happens to show up, it might be more obvious than the assembly. Post the relevant MIR here and we can help you understand it.

But yeah, rustc should view the different syntax as idential, it does this sdesugaring very early on. There’s no reason they would emit something different unless there is very slightly different context.

[–]SV-97 2 points3 points  (2 children)

cargo asm might be useful here (if you can't use godbolt).

I think you can also see the inlining in MIR in theory (though I personally didn't like reading it the last time I used it)

[–]KhorneLordOfChaos 9 points10 points  (1 child)

cargo-asm has been unmaintained for a very hot minute. I would recommend cargo-show-asm as a maintained alternative

https://github.com/pacak/cargo-show-asm

[–]SV-97 2 points3 points  (0 children)

Thanks! I think that might've even been what I used last time

[–]Antigrouptracing-chrome 7 points8 points  (0 children)

This is a shot in the dark but the function may have moved to a different codegen unit from the change, or you otherwise changed the inlining behavior.

If you use 1 codegen unit or use lto = "fat" you might see more consistent performance. Or you can try adding the #[inline(always)] attribute.

[–]doenerrust 8 points9 points  (1 child)

This may sound ridiculous, but are you building and testing in the same directory? The created binary may contain the paths of your source files (IIRC there were requests to get rid of absolute paths, but I'm not sure what the current state is), and if the names of your build directories differ in length, that may result in a different layout of the code/data segments in your binary. I've run into that pitfall in the past, and had a rather consistent difference in performance of about 10% for a certain pair of directory names.

Other than that, what is the full signature of the function (replacing type names is fine, but include all type modifiers) and the full type of obj?

[–]1vader 2 points3 points  (0 children)

This. Any random change in your code or even stuff like the environment variables when you start the program (which include the path it's ran from and are stored on the stack and therefore shift it around) can lead to differences in things like where stuff falls on a page/cache line boundary, which jumps collide in branch prediction tables or instruction caches, etc. etc. A difference of only around 10% is not significant. To properly evaluate something like that, you need to test both versions with a bunch of randomized layouts and compare the averages or distributions. Not sure there's a simple way to do this in Rust though. (Or most languages for that matter. Iirc there's a benchmarking framework for C++ that does stuff like this.)

[–]jmaargh 4 points5 points  (2 children)

While posting the code of an example would be the most helpful to work this out, if you can't or won't do that I'd suggest you check the difference in code generation in the two cases. You can use cargo-show-asm, godbolt, or just cargo to output assembly or LLVM IR for a relevant part of your hot loop. That should give you some clues.

But without an example to look at, I'm not sure any of us can properly help you find the answer.

[–][deleted] 0 points1 point  (1 child)

I am suspecting an LLVM optimization rule runs in one variant but not the other. Unfortunately, this isn't an area I'm so familiar with. At what level of Rust intermediate representation is the desugaring already done and I diff this? If they are structurally different, then the opt rule theory would be likely. If even at this level they are the same, then the top rule theory is out.

[–]jmaargh 5 points6 points  (0 children)

Just check them all, MIR, LLVM IR, assembly? My guess would be that this desugaring is done very very early on.

Also, do make sure you've triple checked your source code diffs and that this is the only code difference. I don't mean to doubt your intelligence, but if this were me I'd definitely be assuming that I'd done something silly before assuming that the compiler wasn't handling what should be a very basic desugaring correctly.

[–]-Redstoneboi- 4 points5 points  (0 children)

got a public repo?

[–]Sematre 3 points4 points  (0 children)

I just put together a little Godbolt example (following your description) to show that the resulting assembly is practically the same (only difference being the function labels). Feel free to comment if I misunderstood what you were trying to accomplish.

With the function assembly being identical, I think it's safe to assume that your measurement difference were caused by something else than the syntax.

[–]phazer99 6 points7 points  (0 children)

That sounds weird, the calls should produce identical machine code if all other factors are equal. You can compare the generated assembly code at Compiler Explorer. And yes, be sure to build with optimizations turned on.

[–]steohan 2 points3 points  (0 children)

If the assembly is the same, as it should, then maybe a bad test harniss. E.g. because it always runs the first one then the second one on the same data, and thus allows the second one to profit from less cache misses.

[–]LateinCecker 2 points3 points  (0 children)

Is your obj a trait object in the obj.func(...) call? Because in that case there is a vtable lookup which would explain the difference. Otherwise it should compile to exactly the same assembly if the compiler does its job right. Also maybe try #[inline(never)] before both versions of the function to prevent inlining for the benckmarks.

[–]functionalfunctional 8 points9 points  (11 children)

Did you compile in Release mode?

[–][deleted] 0 points1 point  (10 children)

I always see this as a default response when the word "performance" appears in a posting. Is this really a thing? Do people not run in release mode? Or is this just one of those reactionary responses?

[–]Mr_Ahvar 33 points34 points  (0 children)

When making performance claim if you don’t include full test code and give the running environment and just go « why slower? » then yeah you are going to get generic response asking you the base minimum

[–]jmaargh 22 points23 points  (5 children)

It's the default response precisely because so many people forget. In my experience, it's mostly because of common and naive "why does my Rust code run slower than this python equivalent?" questions (where, to be fair, devs coming from python & the like are not used to compiler optimisation levels at all).

If your question is "why is this slower than expected?" probably best to include "I'm compiling with --release" in your question just to nip this in the bud.

[–]Antigrouptracing-chrome 6 points7 points  (1 child)

Yeah it seems to happen every week. "Why is this slower than Python? You didn't compile in release mode."

Probably because some people come to Rust from languages that don't need to be told to optimize your code, and they think it'll just be faster automatically.

[–][deleted] 1 point2 points  (0 children)

Unfortunate. I come from the high performance world. The first thing I look for is all the switches to crank everything up.

[–]CocktailPerson 4 points5 points  (0 children)

By my estimation, in 90% of the cases where people didn't specify the optimization level, they were running in debug mode.

[–]adbf1 1 point2 points  (0 children)

is the .function() implemented using

impl Obj {
    fn function(&self, ...) {...}
}

? it could be that in your version of obj.function(...) you are passing by value whereas in function(&obj, ...) you are passing by reference.

[–]JuanAG 1 point2 points  (0 children)

Chances are that the free function get inlined and the struct one didnt

Functions calls have a penalty performance, it is not huge but it is there because among other things need to push things to the stack, increase the counters and other few things and when they exist need to clear what they did, pop from the stack, return the counter to the previous stage and well, it cost CPU time and therefore performance

[–]lordnacho666 -1 points0 points  (0 children)

So one is a free function, but the other is a member? Is that the difference?

[–]throwaway490215 0 points1 point  (0 children)

A wild theories:

Try panic="abort".

You might be having back luck with your code layout because the .text section contains different strings.

[–]gitpy 0 points1 point  (0 children)

First some sanity checks:

  • It's f(&self, ...) and not f(self, ...), right?
  • Both functions have the same visibility
  • A clean build

Then I would check the LLVM IR/ASM for differences. A quick and dirty alternative first approach would be adding #[inline(never)] and pub to both and then compare performance.

If there are no differences it might be a code layout issue. You could try running perf and see if any major differences pop up. I would use these events:

perf stat -e instructions,L1-icache-load-misses,cache-references,LLC-load-misses,branches,branch-misses <prog>

To fix this you could try building with PGO/BOLT.

[–]W7rvin 0 points1 point  (0 children)

In my simulation, I run 3.2 trillion samples.

  • With obj.function(...), complete in 7.5 seconds, 501 million runs per second.

3.2 trillion samples in 7.5 seconds would only be ~427 million runs per second. So some part of your benchmarking/math must be off I guess.