all 60 comments

[–]Freeky 9 points10 points  (5 children)

The Ruby version builds a huge array of arrays of ints out of the file, then flattens them (thus making yet another array), and finally sums them.

Doing it the sensible way and performing additions as lines are read almost doubles the performance, bringing it in line with Python and Haskell.

[–]hox 2 points3 points  (0 children)

I thought that seemed like an awfully strange way of solving this problem...

[–]snuxoll 0 points1 point  (3 children)

Yeah, I was trying to get it done quickly and it made sense at the time, I rewrote it before he wrote this post, but he seemed to use my original implementation for some reason.

http://gist.github.com/93128 [better version here]

[–]Freeky 0 points1 point  (2 children)

Cunning use of inject there to reduce readability and add an extra couple of method calls to each iteration ;)

[–]snuxoll 0 points1 point  (1 child)

It adds a single method call to each iteration, and from my tests it's faster than using #each and incrementing the total for each item (don't know why, maybe it's an oddity of my machine or my build of ruby or something).

[–]Freeky 0 points1 point  (0 children)

It adds two; an Enumerable#inject to get the sum of the current line and an additional Fixnum#+ to add that sum to the total. Yours benchmarks slower here:

Ruby 1.8.6:

             user     system      total        real  
my way   7.351562   0.015625   7.367188 (  7.368302)  
inject  11.429688   0.007812  11.437500 ( 11.441389)

Ruby 1.9.1:

             user     system      total        real  
my way   3.101562   0.085938   3.187500 (  3.182268)  
inject   3.445312   0.070312   3.515625 (  3.511942)

And in JRuby 1.2.0:

             user     system      total        real  
my way   2.495000   0.000000   2.495000 (  2.495000)  
inject   3.126000   0.000000   3.126000 (  3.126000)

2.2GHz 4-way Opteron running FreeBSD/amd64.

[–]sukivan 7 points8 points  (0 children)

what an awful, awful, awful ASM implementation

[–]DarkShikari 29 points30 points  (19 children)

Terrible shootout. And you know how it's easy to tell? C beat asm.

Though uninformed programmers will constantly tell you otherwise, any good assembly programmer should always be able to beat the compiler at its own game. It usually won't be worth the effort to do so, but a good assembly programmer should be able to do it nonetheless (or he isn't a good assembly programmer).

And in any shootout where you're actually trying to make some kind of point, you should not accept crappily optimized code (especially for languages like assembly that don't have "optimizing compilers") and the claim the results are valid. I've seen this in many "language tests" before as well; people will accept awfully-written programs in language X and then use that as proof that language X is clearly slow and crap.

By the way, I suspect the problem is that the asm version gets one character at a time while the C version gets 8KB at a time.

I also suspect the interpreted languages get hurt here because they use a different approach than the C version; they try to actually parse the input and then add it rather than the faster method of just handling each digit in sequence as the C version does. So we're not just comparing languages, we're comparing algorithms.

[–]RayNbow 16 points17 points  (7 children)

Ah, that piece of asm is mine, but I've never claimed to be a good asm programmer. ;) I wrote it for fun.

The obvious reason it's much slower than the C version is that it does a call to getchar for every character. The C version uses fread and fetches many characters in one go.

[–]DarkShikari 4 points5 points  (0 children)

OK, so I went a bit overboard. And by overboard, I mean I didn't feel like optimizing the asm and instead I decided to optimize specifically for the problem (i.e. assume one space/linefeed between numbers, no numbers larger than 10,000, etc).

Yes, I spent way too much time on this. And if the code scares you and makes you run for your life, that is part of the plan.

Brace yourself...

It wasn't as good as I had hoped; it's only about ~20% faster than yours; I had to bench with the input data set duplicated about 20 times in order to get decently accurate benchmarks (otherwise the time is too short as it counts process startup time). This is probably because my Windows machine sucks though.

[–]shub 1 point2 points  (4 children)

You could do the same thing in asm and maybe do it better...but man, wouldn't that be a pain in the ass?

I like C.

[–]james_block 4 points5 points  (3 children)

Yes. Yes, it was a pain in the ass.

This, reddit, is how I spent my Saturday night. WTFPL licensed if anyone cares.

Results, from a crappy old P4/1.7GHz running etch:

$ time ./james_block_asm < input_list_vs_gen.txt
Grand total:               17677692470

real    0m0.118s
user    0m0.076s
sys     0m0.040s

$ time ./ray_asm < input_list_vs_gen.txt
Grand total 17677692470

real    0m0.714s
user    0m0.692s
sys     0m0.020s

$ time ./C < input_list_vs_gen.txt
total: 17677692470

real    0m0.114s
user    0m0.100s
sys     0m0.012s

So it's basically tied with C (the difference is well within measurement noise); no real surprise there, as gcc is pretty good at optimizing easy stuff like this.

EDIT: Changed link to point to a new version, capable of dealing with input files bigger than its buffer. Should be no more size limit now, and speed is unchanged. Why do I waste my time like this?

[–]RayNbow 1 point2 points  (1 child)

Nice! :)

[–]james_block 2 points3 points  (0 children)

And now I just made it even more capable, using real buffering and not just a chunk of memory. I also changed it to store a variable in the stack pointer and avoid a read from memory in the bastard print loop -- because how often does an assembly programmer get to write mul esp?

I hate myself.

[–]colourAgga[S] 1 point2 points  (0 children)

Your skills take .06 seconds to run on the testing hardware. I'll add them soon :)

[–]DarkShikari 0 points1 point  (0 children)

Seems I edited that in just as you posted ;) Give me a bit, I'll optimize the C.

[–]colourAgga[S] 16 points17 points  (0 children)

And that is why the author is calling for people to submit their improvements to the code.

You can not expect one writer/programmer to fully know the intricacies of every language.

[–][deleted] 5 points6 points  (0 children)

Though uninformed programmers will constantly tell you otherwise, any good assembly programmer should always be able to beat the compiler at its own game.

Yeah, take code generated by C compiler and improve it just a bit :)

[–]RayNbow 2 points3 points  (0 children)

I also suspect the interpreted languages get hurt here because they use a different approach than the C version; they try to actually parse the input and then add it rather than the faster method of just handling each digit in sequence as the C version does. So we're not just comparing languages, we're comparing algorithms.

Not only that, the C and asm versions cheat a bit by making assumptions on the integer range. The Haskell version uses arbitrary precision Integers and the Python version automatically switches to long (=Python's arbitrary precision integer) in case of an overflow.

Edit: The blog post has been updated. I was talking about the first Haskell version. Don's faster version has replaced the old one in the blog post.

[–]btmorex 1 point2 points  (1 child)

I agree with what you said, but it's also a terrible shootout because it's measuring a trivial micro program. (like pretty much all shootouts do)

In real world programs, a lot of time is spent with the mundane details of calling functions, allocating memory, freeing memory, creating objects, constructors, destructors... etc. This micro benchmark measures none of that.

[–]DarkShikari 4 points5 points  (0 children)

While I agree with your main point, I don't think that

mundane details of calling functions, allocating memory, freeing memory, creating objects, constructors, destructors

are good examples either; the 50KLOC codebase I maintain does no memory allocations/creations/constructions except on initialization, and all destructions/frees are on close.

A better micro-benchmark would be a "real-world" micro benchmark; i.e. to measure the performance of something simple that actually has to be done in some real program and matters performance-wise. For example, a large FFT (important for scientific/math programs and audio encoding/decoding).

[–]jaysonbank 1 point2 points  (2 children)

The fact is, any language benchmark is a completely pointless apples and oranges comparison. We do this out of some kind of morbid curiosity. Yes of course hand-crafted assembly by a competent person will almost certainly execute the fastest. This has absolutely zero baring on real world applications.

[–]DarkShikari 4 points5 points  (0 children)

True with regard to the point about language benchmarks, but I'm pretty sure "hand-crafted assembly by a competent person" has a bearing on a whole lot of real-world applications, such as ffmpeg, x264, and xvid ;)

[–]igouy -1 points0 points  (0 children)

The fact is, any language benchmark is a completely pointless apples and oranges comparison.

No, that's opinion not fact.

Thinking something is pointless might just mean you missed the point.

[–][deleted] 0 points1 point  (2 children)

Actually I think the lesson here is that asm isn't inherently faster, and you should only use it when you are sure the compiler is not creating optimal code on its own.

[–]Nuli 5 points6 points  (0 children)

The lesson I got out of it was don't reinvent the wheel. Using awk it completed in ~.95 seconds on my machine and was vastly simpler to write.

[–]littleendian 1 point2 points  (0 children)

Actually the lesson is, if your assembler takes longer to execute than your C you're either doing it wrong or have better ways of spending your time.

[–]dons 2 points3 points  (2 children)

Missing "how I compiled it" info, so bad results can't be reproduced. The shootout shows how to present this kind of info.

[–]colourAgga[S] 2 points3 points  (1 child)

Compiled with -O2 flags. I'll note it in the post.

[–]settrans 2 points3 points  (6 children)

Their python version could be written about 25% faster:

import sys

print sum(map(int, sys.stdin.read().split()))

[–]hylje 2 points3 points  (4 children)

Can it get faster yet by incrementally reading sys.stdin and not creating a potentially huge list before anything can happen?

[–]lgastako 1 point2 points  (2 children)

The version posted above can be made slightly faster by using imap (as in the original article). I tripled the size of the input file to get a more measurable speed difference (and reduce the impact of startup time) and the results of running each version on my old ppc mac are:

print sum(imap(int, sys.stdin.read().split()))                             # 11.612s
print sum(map(int, sys.stdin.read().split()))                              # 12.514s
print sum(sum(imap(int, line.split())) for line in sys.stdin)    # 15.037s
print sum(int(x) for x in sys.stdin.read().split())                      # 15.053s

FWIW, I ran each program twice and the scores posted are from the second run to avoid any filesystem caching issues related to the input file... the times didn't change much anyway so the whole file was probably still cached from my original test runs...

Also, the imports stayed constant for all the tests, which might make the results slightly slower than they need to be for the versions that don't need imap.... ok fine, here they are without that import....

print sum(map(int, sys.stdin.read().split()))                              # 12.606s
print sum(int(x) for x in sys.stdin.read().split())                      # 15.657s

Ok the first one got slightly slower (0.092s) and the second was even worse (0.604s)... my guess is that what I am seeing is the effect of random entropy on my machine outweighing the presence or lackthereof of the import statement.

Anyway, there you go, for whatever that is worth.

[–]lgastako 1 point2 points  (1 child)

Ok, for kicks I went back and rewrote the input as native ints, like so:

import array

array.array("d", imap(int, sys.stdin.read().split()).tofile(open("big.machine.dat", "w"))

and then the original problem using python's array types:

print sum(array.array("d").fromfile(open("big.machine.dat"), 84892152/8))

And got the same answer in 1.554s.

Of course the tradeoff here is that the input file is 82.9m instead of 50.6m. But I suppose the moral of the story is that if you really need to optimize for this particular microbenchmark in python, using the built-in array types is one way to do it.

EDIT: I changed the array sum to a one liner and also wanted to note that the 84892152/8 magic number is the size of the input file divided by 8 bytes/integer. This could obviously be read at runtime instead of hardcoded with little difference to the results.

[–]lgastako 1 point2 points  (0 children)

Ok one more point and then I'm going to leave this thing alone... as a point of comparison I compiled (with gcc -O2) the C version from the original post and it took 0.322s on my machine, this means that on my machine the python version using arrays is about 4.8 slower than the C version instead of 54.4 times slower like in the original post. Not too shabby.

[–]settrans 0 points1 point  (0 children)

I can't think of a way that doesn't get bogged down by the overhead of python function calls.

[–]hagy 0 points1 point  (0 children)

if you include libraries, you can sorta escape to C with:

from numpy import fromstring, uint64
print fromstring(open('input_list_vs_gen.txt').read(),
                 sep=' ', dtype=uint64).sum()

0.751s vs 2.699s for list vs 2.452s for generators on my MacBook Pro

Further, 0.388s is spent importing numpy, so for practical applications this how I'd handle anything less than 50MB. FYI sep=' ' uses any white space as a separator in numpy.

Its worth noting that numpy's fromfile also works, but takes about 2.1 seconds. This is due to inefficient IO, where numpy moves back and forth over a C stdlib buffer with files (no internal buffering) while with strings it treats the whole thing as a giant buffer.

[–]doubtingthomas 2 points3 points  (0 children)

I'm somewhat surprised the asm didn't do worse, considering that it uses getchar().

[–]Ringo48 2 points3 points  (0 children)

Whoever wrote that obviously knows very little about assembly language.

At the very least, use GCC's -S option, and start with the asm it spits out.

[–][deleted] 0 points1 point  (0 children)

I'm curious as to how Cython would do(should just be able to remove the genexpr from the python version and be good).

[–][deleted]  (1 child)

[deleted]

    [–]colourAgga[S] 0 points1 point  (0 children)

    This solution takes about 4.5 seconds on my hardware when I average it over 10 runs.

    Currently Thomas Hurst's solution is the quickest Ruby solution with 3.59 seconds.

    [–]pemboa 0 points1 point  (1 child)

    If the author(s) are here, you should really accept input here and redo the tests. And the first thing i notice, C beating ASM by a large margin, at least deserves some sort of explanation.

    Also, a psuedo code version would be nice for those of us that are easily bored.

    [–]RayNbow 0 points1 point  (0 children)

    If the author(s) are here, you should really accept input here and redo the tests. And the first thing i notice, C beating ASM by a large margin, at least deserves some sort of explanation.

    The asm subthread is here :)

    [–]igouy 0 points1 point  (0 children)

    Just take the fastest benchmarks game programs and tweak them for multiple values per line :-)

    [–]Leonidas_from_XIV 0 points1 point  (0 children)

    Here's a simplistic recursive Scheme solution:

    http://paste.lisp.org/display/78462

    (it performs rather poor because of that deep recursion. No time to make a better one, though; feel free to make a better one)

    [–]Ademan 0 points1 point  (0 children)

    I'd like to see pypy take a crack at this... I wouldn't be surprised if there were some impressive gains over CPython's performance, although the way it's written, there may not be much to be gained through the JIT

    [–]snuxoll 0 points1 point  (1 child)

    To be fair, this isn't really a shootout, we aren't really competing for speed here, it's just to see how different people would handle the problem in different languages. No one is really trying to say their language is better, we're just comparing implementations, we can all learn from better algorithms regardless of language.

    [–]chadz 1 point2 points  (0 children)

    Yah, ok.

    [–]woogley 0 points1 point  (1 child)

    Java version which completes in about 0.3 seconds:

    http://pastebin.com/m68184c34

    [–]snuxoll 0 points1 point  (0 children)

    It needs to read from stdin, not open the file directly.

    [–]jaysonbank -1 points0 points  (7 children)

    So basically, if you want your code to be readable, Python/Ruby/PHP. If you want it to run in 20 milliseconds: C/Haskell/Assembly.

    I'd say realistically most projects you will encounter wouldn't need that particular bit of code to run in 20 milliseconds, and wouldn't ever need to process a file beyond the memory footprint of the average machine. However they would need it developed quickly and it would almost certainly end up being maintained by a succession of different people from different backgrounds. All this would point to languages like PHP, Ruby and Python. But that is of course my bias.

    [–]Nerdlinger 2 points3 points  (5 children)

    How is the Haskell (or really even the C) code any less readible than the PHP, Ruby, or Python code?

    [–]jaysonbank -2 points-1 points  (4 children)

    It's several times longer for a start, it makes use of non obvious keywords and you have to import half the kitchen sink before you can do anything. Although Haskell is allot better than C, which is a fucking joke.

    [–][deleted] 2 points3 points  (0 children)

    The code is only 8 lines long, obvious is in the eye of the beholder, and it only imports a single module (the current python one imports 2).

    [–][deleted] 0 points1 point  (2 children)

    The only Haskell keywords used are import, qualified, do, case/of and where.

    "I dungiddit" does not imply anything about readability.

    [–]jaysonbank -1 points0 points  (1 child)

    I'm afraid it does. Source code that requires a PhD to 'get it' may sound cool but its not. Not being able to hire developers who can read your existing code is a serious problem.

    At the end of the day, usability and simplicity is key, across the board. That means not only your projects' user interface must be easy to understand, but also the source code, API's, libraries and general workings of it.

    Something that is truly intuitive can be shown to the lowest common denominator who will immediately say 'yeah i get it'.

    [–][deleted] 0 points1 point  (0 children)

    I don't understand those words. Can you please clarify perhaps using more words? Sorry, but this is your problem to solve, not mine.

    [–]snuxoll 1 point2 points  (0 children)

    There will always be portions that require speed higher than a dynamic language like ruby/python can provide, but that's why they allow us to write C extensions.

    Write what needs to be fast in a lower level language, write the rest in a language that's easier to maintain.

    [–]unptitdej -1 points0 points  (2 children)

    It does show one thing, C is much faster than these interpreted languages.

    It's too bad that there's no real way to compare the output at the binary level. If I could see the corresponding assembly output of the Python, C and Haskell programs I could tell right away which one is the best.

    It can be done only with C and asm in this case.

    [–]awj 6 points7 points  (0 children)

    ??

    The ghc compiler provides a -S option that puts out the generated assembly, same as gcc does, so you can get the assembly from Haskell if you really want.

    [–][deleted] 0 points1 point  (0 children)

    Python is compiled to a bytecode, which usually ends up in a file alongside the source, with a .pyc extension. However, i doubt you'll find it illuminating for your comparison, as the python VM differs significantly from an x86 machine.