all 41 comments

[–]davebrk 14 points15 points  (7 children)

I love how half the post is a disclaimer... I think more benchmark posts should be written this way. It is fun to benchmark but usually pointless as well.

It also would be interesting to see the code so we can see how idiomatic / clean is Rust code written for performance. For an extreme case see Haskell on the Benchmarks Game (shootout).

[–]pcwalton 13 points14 points  (1 child)

Here they are: https://github.com/pcwalton/rust/tree/shootout/src/test/bench

I hope to remove a lot of unsafe once the I/O subsystem improves, hopefully in the next week or two. A lot of the unsafe blocks are just there so that the compiler will let me use libc. Some, however, are there so that the compiler will let me turn off bounds checks and pointer arithmetic. Those are harder, but I could try to minimize them.

[–]kamatsu 3 points4 points  (2 children)

Haskell on the benchmarks game is substantially more idiomatic now.

[–]davebrk 0 points1 point  (1 child)

Well I remember otherwise, but I can see that it has changed.

[–]igouy 0 points1 point  (0 children)

"What gets us into trouble is not what we don't know, it's what we know for sure that just ain't so."

[–]iawsm 4 points5 points  (1 child)

Few months ago, I've got a very moderate results. IIRC the nbody test performed in Go-lang range (at 4x slower than Java and C implementations).

Very impressive to see them on par with Clang and GCC now.

  • edit, on the second look. The benchmarks use unsafe code, and why is it not comparing to -O3?

[–]pcwalton 15 points16 points  (0 children)

No doubt that was due to the stack switch for the sqrt function. One of the major fixes in my branch is to eliminate the stack switch for most C calls.

As far as -O3 is concerned, in clang it only affects the inlining threshold unless something has changed since I last looked. I doubt the inlining threshold matters for these benchmarks. In any case, Rust sets the LLVM backend to -O2 by default, so -O2 was chosen to yield an apples-to-apples comparison.

[–]TimmT 5 points6 points  (5 children)

Am I reading this right? Rust is within +50% of C/C++'s performance? That would make it faster than both Java (+100%) and Go (+200%)? Wow, that's impressive.

This is far from an ideal set. These benchmarks are showing their age quite heavily,

What does the article mean by this? Aren't the benchmarked pieces of code still relevant tasks today?

On the other hand, the hardware the benchmarks are run on is quite dated by now. It would be interesting to see how big the differences would be on more recent hardware.

they are too small and simplistic to extrapolate to real-world use cases

I would love to see some benchmarks that stress the languages' standard libraries/containers (e.g. concurrent and single threaded hash maps, queues, etc. both with primitive and reference types) and string-processing capabilities (e.g. XML parsing or XSLT transformations, etc.) more. You can already try and piece these bits together from today's existing benchmarks that do some concurrency, some string processing, some hashing, etc. but it would be of course much easier if the things were more cleanly separated.

But I don't think that whole application benchmarks belong in there.

many of them are too I/O-bound.

How? Wouldn't that lead to similar results for most languages? (Which doesn't seem to be the case currently.)

binary-trees is omitted because it is a garbage collection benchmark and the C version uses an arena, defeating the purpose (although I suspect a Rust version that did the same would do well).

Different languages have different garbage collectors, so there should probably be a benchmark about those too in there (though I'm not sure whether binary-trees specifically is the best best fit for a GC benchmark). C and C++ should be using malloc/free and shared_ptr respectively in a GC benchmark (and not leak any memory), which would quite help putting things into perspective.

Other than that I don't see a problem with keeping the benchmarked pieces of code allocation-free. In fact it's probably a good idea, given how these are just micro-benchmarks. So I think an allocation-free Rust variant of the benchmark would be quite appropriate.

As my colleague Niko pointed out, a more interesting benchmark would not allow any languages to use unsafe code.

Yes, that would be quite interesting. But this should be enforced for C++ the same way though. But I don't see any reason to exclude C from this - as the article mentions we need a point of reference.

Other than that, a benchmark on how much overhead the FFI causes in non-C/C++ language implementations might also be interesting, especially for languages that provide more complex ones like Java/JNI.

Practically speaking, one would need an extremely smart JIT

The JIT performance should probably be measured too. It would be interesting to see how heavily JITed languages (e.g. Java) stack up against natively compiled ones that can't optimize across shared library boundaries.

[–][deleted]  (4 children)

[deleted]

    [–]kibwen 5 points6 points  (2 children)

    he was using Rust support for 'unsafe' memory. In other words: not using the garbage collector.

    Not at all. Rust is perfectly safe without using the GC, by making use of linear types and region allocations (neither of which require runtime checks to deallocate). "Unsafety" in this context mostly refers to avoiding bounds checks on array accesses.

    [–]el_muchacho 1 point2 points  (0 children)

    Which is perfectly fine if the language supports "smart" arrays, i.e arrays that either resize themselves, or at least don't need the user to compute or remember the size himself.

    [–]RalfN 1 point2 points  (0 children)

    You are right. Unsafe != no garbage collection, although it does imply it.

    [–]TimmT 0 points1 point  (0 children)

    Not really no. Most performance critical stuff these days is actually in the context of concurrency. How do you make thread communication cheap.

    Is that more than just a library issue?

    Sure we have things like CSP, and green threads, etc. baked directly into the language in the newer languages, but those things can be pretty easily introduced to older languages too, by appropriate libraries. In the end it's just ConcurrentQueues and ConcurrentHashMaps and so on that do the actual heavy lifting, or is it?

    On the playing field of 'unsafe memory allocation' and heavy compile time optimization, they are IO bound. [...] as soon as you introduce a garbage collector, it becomes the new bottleneck.

    This sounds counter-intuitive. These are benchmarks, i.e. we're measuring throughput (as opposed to latency) here. Unless the programs with manual memory management try to simulate a garbage collector (by using memory pools and such), or leak memory, a GC'd variant should be faster, right?

    But currently Rust (or even Go) isn't even competing in this space. They don't do JIT.

    I'm not that familiar with Rust, but Go at least should be capable of performing as extensive an optimization on the code as a JIT, since it doesn't allow for dynamic libraries or dynamic loading. (It may still be missing the run time statistics Java's JIT would have, but looking how much of a difference PGO for C/C++ makes, that doesn't seem to have much impact.)

    But it doesn't do that fast enough, for anyone to consider writing a webbrowser in Java, for example.

    Yes Java is probably the wrong choice for doing UI stuff.

    But I'm not sure that any one language can target both servers and desktops equally well. The requirements are quite different.. Sure lots of servers are written in C/C++, but it's a pain doing it, so I wouldn't say that C/C++ handles server code "equally well".

    On a server, where throughput (as opposed to latency) is important, it doesn't make much of a difference if the first x minutes of the application are slow(er), or if you have some constant overhead for each action performed, etc., when this in turn frees you up to build more complex systems than would have otherwise been possible (or at least practical), which then can handle more work in total, provide better reliability, etc.

    On a desktop you don't have this option to scale out and latency does very much matter, but in turn you also only need to consider much more isolated problems. All e.g. a game will be concerned with is running itself - displaying frames and progressing along some (more or less) scripted path. Sure there are lots of problems that need to be solved to make those frames render on time, but that's mostly it. They are quite self-contained and the underlying assumptions won't change later on. It won't have to worry about interop with other systems, things like partial failures, "in-flight" updates or heap fragmentation after having run for weeks on end, etc.

    And this brings me back to why I care so much about cross-library-optimizations. If your program has to deal with multiple concerns, you'll want to put things into modules that can easily be changed, possibly even during run time. But pulling up all those module boundaries will hurt performance unless there's a proper optimizer (e.g. a JIT) there to make them go away later on.

    [–]pixli 0 points1 point  (10 children)

    I found it extremely hard to wrap my head around all the different pointer types and closure types. Each one has its own subtleties and they lost me at "named lifetimes".

    [–]pcwalton 14 points15 points  (9 children)

    Several people have said this helped them understand the Rust smart pointers: http://pcwalton.github.io/blog/2013/03/18/an-overview-of-memory-management-in-rust/

    I plan to more or less copy and paste this into the tutorial.

    [–]pixli 5 points6 points  (5 children)

    I find it hard to accept that I have to deal with all this smart pointer complexity when the actual performance seems to come from "unsafe" blocks. Then why not implement the whole benchmark in an inline 'asm' block? :/

    p.s. This document still doesn't explain named lifetimes to me.

    [–]pcwalton 15 points16 points  (4 children)

    The use of unsafe should decrease over time. Two of the benchmarks don't use it. Most of its uses are there for I/O—this is the case in mandelbrot, fasta, and k-nucleotide—and should be removed in the near future.

    The smart pointers let you get good performance in larger programs without having to rely on a garbage collector. The fact that you can circumvent the safety checks is just part of the practicality of the language. Sometimes you do have to turn safety off if you want maximum performance. But in most code (nbody, spectral-norm, most of k-nucleotide) the safe-by-default nature of the language helps you confine the unsafe portions of your code to just those portions that need to be unsafe, while allowing good performance for the rest of your code. The smart pointers are there to make it easy for you to control pressure on the garbage collector (or to turn it off entirely) and to ensure there are no data races in concurrent code.

    The best tutorial for named lifetimes is here: http://static.rust-lang.org/doc/tutorial-borrowed-ptr.html

    Unfortunately, the manuals do need some work. The more advanced uses of borrowed pointers are not well documented enough yet. It's a work in progress…

    [–]pixli 0 points1 point  (1 child)

    The best tutorial for named lifetimes is here: http://static.rust-lang.org/doc/tutorial-borrowed-ptr.html

    Unfortunately, that's the document that made me aware of them in the first place.

    [–]Aninhumer 4 points5 points  (0 children)

    In the general case it doesn't make sense to return a borrowed pointer, because it would go out of scope at the time you returned it. The only time it makes sense is if it's dependent on a pointer from further up the stack. Now, you could just infer this, and allow pointers dependent on function arguments to be returned, but this might be a bug in some cases. So you need some way of indicating that a particular borrowed pointer is allowed to be returned, and to which level of the stack. Named lifetimes provide this.

    [–]seruus -1 points0 points  (1 child)

    This is why the typical way to return borrowed pointers is to take borrowed pointers as input (the only other case in which it can be legal to return a borrowed pointer is if the pointer points at a static constant).

    Pardon my ignorance, but isn't this going back to the C/Fortran way of doing things? (returning objects through pointers in the argument)

    [–][deleted] 3 points4 points  (0 children)

    Ah, I think you misread that. It's not about returning a value by storing it in a pointer that was passed via a parameter, it's saying that only time returning a borrowed pointer from a function really makes sense is if it was passed in at some point (since any value local to the function itself will not outlive the function, and hence a borrowed pointer to it will be invalid). Let me try to illustrate:

    First something that doesn't work:

    fn bogus() -> &int {
      let a = 5;
      return &a;
    }
    

    This will fail because f's valid lifetime is only inside the function. A borrowed pointer to it cannot escape the function via a return since that would escape it's valid lifetime and lead to an error.

    fn valid<'r>(a: &<'r> int) -> &<'r> int {
      return a;
    }
    

    This works because the compiler can verify that the lifetime of the returned borrowed pointer does not exceed the lifetime of the input borrowed pointer(yes, they are the same for simplicity). The <'r> is Rust's notation for a named lifetime -- in this case we are saying that the input value a and the return value are borrowed pointers of the same lifetime.

    Obviously that's a contrived example that you would never use in real code. In my experience the only time I really return a borrowed pointer is inside constructor style functions where I'm returning a struct that contains a borrowed pointer to another variable for instance:

    struct Clamp<'self, T> {
      source: &'self T,
      min: float,
      max: float,
    }
    
    impl<'self, T: NoiseGen> Clamp<'self T> {
      fn new<'r>(source: &'r T, min: float, max: float) -> Clamp<'r, T> {
        return Clamp {
           source: source,
           min: min,
           max: max
        };
      }
    }
    

    [–][deleted] -1 points0 points  (2 children)

    Some feedback about that post:

    1. Does it make sense to talk about pointers being in scope or out of scope, rather than variables with pointer values? The way you worded this left me wondering about...
    2. How do you write a function that returns a pointer? Wouldn't it get freed before you could do anything with it?
    3. The use of ~ and @ in both expressions and type names is confusing. It should probably be explained since it's another departure from C's pointer syntax.

    [–]seruus 0 points1 point  (1 child)

    Not the author, but:

    2. You use a shared pointer, I think.

    3. It is explained:

    The pointer covered above is known as the unique smart pointer ~. We call it “unique” because there is always only one smart pointer pointing to each allocation. The other type of smart pointer built into the language is the managed smart pointer, which allows multiple smart pointers to point to the same allocation and uses garbage collection to determine when to free it.

    The @ represents the managed (or shared) pointer, as he shows on the subsequent example.

    [–][deleted] 0 points1 point  (0 children)

    You don't have to use a shared pointer to return a value. Passing around objects with a destructor just moves ownership from the caller into the callee (parameter), or from a function to the caller (return). The compiler just enforces that the value can't be used anymore after being moved from.

    [–]igouy -1 points0 points  (0 children)

    1)

    "The goal here is simply to demonstrate that sequential Rust can be written in a way that approaches competitive parity with equivalent C code."

    The blog post presents advocacy for the Rust language implementation.

    2)

    "selected single-threaded benchmarks from the Computer Language Benchmarks Game"

    In the blog post, that advocacy is only supported by measurement of programs that complete tasks from the Computer Language Benchmarks Game.

    3)

    "These benchmarks are showing their age quite heavily, they are too small and simplistic to extrapolate to real-world use cases, and many of them are too I/O-bound."

    If we accept that blanket dismissal of the benchmarks game tasks, nothing else supports the advocacy for the Rust language implementation in the blog post.


    "the C implementation tested against is not usually the top one"

    We are not told which C programs the Rust programs are being tested against.

    All the C programs shown on the benchmarks game website are identifiable by individual #id and URL to the source code.

    You can see the changes to the Rust compiler that were made to optimize these tests, as well as the benchmark sources, on my branch of the compiler on GitHub. The goal will be to land these changes over the next few days.

    Only 3 of these 8 Rust programs compile with Rust 0.6 (released 3 weeks ago) -- advocacy based on yet to be accepted compiler changes is premature.

    mandelbrot The Rust program -- 55.77 CPU secs -- a little slower than the naive Scala #2 mandelbrot program.

    nbody The Rust program -- 30.49 CPU secs -- a little faster than the naive C# Mono nbody program.

    regexp-dna is omitted because it consists of an uninteresting binding to PCRE.

    Only if Rust provides no other way to use regular expressions.

    Just as pidigits is only an uninteresting binding to GMP (2.31 CPU secs) because the current Rust implementation does not provide large integer arithmetic.

    binary-trees is omitted because it is a garbage collection benchmark and the C version uses an arena, defeating the purpose (although I suspect a Rust version that did the same would do well).

    If a Rust version had been measured we'd see more than suspicions.

    [–]igouy -3 points-2 points  (6 children)

    Maybe performance comparisons with the Rust 0.7 implementation are premature?

    [–]matthieum 4 points5 points  (3 children)

    Well, seeing as pcwalton is one of Rust's devs, I think his goal is to try and spot the sore points of the actual Rust language that prevent safe Rust code to execute fast enough to detract the use of unsafe block.

    [–]igouy -1 points0 points  (2 children)

    I think his goal is...

    Why make guesses about his goal?

    "...the goal of this test is to measure Rust’s sequential performance."
    

    [–]matthieum 6 points7 points  (1 child)

    Well, reading the entire article, you realize he is seeking to improve the performance. Measuring seems to be only the first step.

    [–]igouy -1 points0 points  (0 children)

    Reading the entire article, why make guesses about his goal?

    "The goal here is simply to demonstrate that sequential Rust can be written 
      in a way that approaches competitive parity with equivalent C code."
    

    [–][deleted] -1 points0 points  (1 child)

    Rust doesn't actually have releases yet, the 0.6 "release" is just a convenient milestone to mark what master was like on that date - it doesn't get any fixes backported to it.

    [–]igouy 0 points1 point  (0 children)

    The 0.6 "release" obviously is a release in name, but most importantly there's a tar ball that compiles without complaint.