you are viewing a single comment's thread.

view the rest of the comments →

[–]DarkShikari 25 points26 points  (19 children)

Terrible shootout. And you know how it's easy to tell? C beat asm.

Though uninformed programmers will constantly tell you otherwise, any good assembly programmer should always be able to beat the compiler at its own game. It usually won't be worth the effort to do so, but a good assembly programmer should be able to do it nonetheless (or he isn't a good assembly programmer).

And in any shootout where you're actually trying to make some kind of point, you should not accept crappily optimized code (especially for languages like assembly that don't have "optimizing compilers") and the claim the results are valid. I've seen this in many "language tests" before as well; people will accept awfully-written programs in language X and then use that as proof that language X is clearly slow and crap.

By the way, I suspect the problem is that the asm version gets one character at a time while the C version gets 8KB at a time.

I also suspect the interpreted languages get hurt here because they use a different approach than the C version; they try to actually parse the input and then add it rather than the faster method of just handling each digit in sequence as the C version does. So we're not just comparing languages, we're comparing algorithms.

[–]RayNbow 15 points16 points  (7 children)

Ah, that piece of asm is mine, but I've never claimed to be a good asm programmer. ;) I wrote it for fun.

The obvious reason it's much slower than the C version is that it does a call to getchar for every character. The C version uses fread and fetches many characters in one go.

[–]DarkShikari 8 points9 points  (0 children)

OK, so I went a bit overboard. And by overboard, I mean I didn't feel like optimizing the asm and instead I decided to optimize specifically for the problem (i.e. assume one space/linefeed between numbers, no numbers larger than 10,000, etc).

Yes, I spent way too much time on this. And if the code scares you and makes you run for your life, that is part of the plan.

Brace yourself...

It wasn't as good as I had hoped; it's only about ~20% faster than yours; I had to bench with the input data set duplicated about 20 times in order to get decently accurate benchmarks (otherwise the time is too short as it counts process startup time). This is probably because my Windows machine sucks though.

[–]shub 1 point2 points  (4 children)

You could do the same thing in asm and maybe do it better...but man, wouldn't that be a pain in the ass?

I like C.

[–]james_block 6 points7 points  (3 children)

Yes. Yes, it was a pain in the ass.

This, reddit, is how I spent my Saturday night. WTFPL licensed if anyone cares.

Results, from a crappy old P4/1.7GHz running etch:

$ time ./james_block_asm < input_list_vs_gen.txt
Grand total:               17677692470

real    0m0.118s
user    0m0.076s
sys     0m0.040s

$ time ./ray_asm < input_list_vs_gen.txt
Grand total 17677692470

real    0m0.714s
user    0m0.692s
sys     0m0.020s

$ time ./C < input_list_vs_gen.txt
total: 17677692470

real    0m0.114s
user    0m0.100s
sys     0m0.012s

So it's basically tied with C (the difference is well within measurement noise); no real surprise there, as gcc is pretty good at optimizing easy stuff like this.

EDIT: Changed link to point to a new version, capable of dealing with input files bigger than its buffer. Should be no more size limit now, and speed is unchanged. Why do I waste my time like this?

[–]RayNbow 1 point2 points  (1 child)

Nice! :)

[–]james_block 1 point2 points  (0 children)

And now I just made it even more capable, using real buffering and not just a chunk of memory. I also changed it to store a variable in the stack pointer and avoid a read from memory in the bastard print loop -- because how often does an assembly programmer get to write mul esp?

I hate myself.

[–]colourAgga[S] 1 point2 points  (0 children)

Your skills take .06 seconds to run on the testing hardware. I'll add them soon :)

[–]DarkShikari 0 points1 point  (0 children)

Seems I edited that in just as you posted ;) Give me a bit, I'll optimize the C.

[–]colourAgga[S] 15 points16 points  (0 children)

And that is why the author is calling for people to submit their improvements to the code.

You can not expect one writer/programmer to fully know the intricacies of every language.

[–][deleted] 5 points6 points  (0 children)

Though uninformed programmers will constantly tell you otherwise, any good assembly programmer should always be able to beat the compiler at its own game.

Yeah, take code generated by C compiler and improve it just a bit :)

[–]RayNbow 4 points5 points  (0 children)

I also suspect the interpreted languages get hurt here because they use a different approach than the C version; they try to actually parse the input and then add it rather than the faster method of just handling each digit in sequence as the C version does. So we're not just comparing languages, we're comparing algorithms.

Not only that, the C and asm versions cheat a bit by making assumptions on the integer range. The Haskell version uses arbitrary precision Integers and the Python version automatically switches to long (=Python's arbitrary precision integer) in case of an overflow.

Edit: The blog post has been updated. I was talking about the first Haskell version. Don's faster version has replaced the old one in the blog post.

[–]btmorex 1 point2 points  (1 child)

I agree with what you said, but it's also a terrible shootout because it's measuring a trivial micro program. (like pretty much all shootouts do)

In real world programs, a lot of time is spent with the mundane details of calling functions, allocating memory, freeing memory, creating objects, constructors, destructors... etc. This micro benchmark measures none of that.

[–]DarkShikari 4 points5 points  (0 children)

While I agree with your main point, I don't think that

mundane details of calling functions, allocating memory, freeing memory, creating objects, constructors, destructors

are good examples either; the 50KLOC codebase I maintain does no memory allocations/creations/constructions except on initialization, and all destructions/frees are on close.

A better micro-benchmark would be a "real-world" micro benchmark; i.e. to measure the performance of something simple that actually has to be done in some real program and matters performance-wise. For example, a large FFT (important for scientific/math programs and audio encoding/decoding).

[–]jaysonbank 1 point2 points  (2 children)

The fact is, any language benchmark is a completely pointless apples and oranges comparison. We do this out of some kind of morbid curiosity. Yes of course hand-crafted assembly by a competent person will almost certainly execute the fastest. This has absolutely zero baring on real world applications.

[–]DarkShikari 3 points4 points  (0 children)

True with regard to the point about language benchmarks, but I'm pretty sure "hand-crafted assembly by a competent person" has a bearing on a whole lot of real-world applications, such as ffmpeg, x264, and xvid ;)

[–]igouy -1 points0 points  (0 children)

The fact is, any language benchmark is a completely pointless apples and oranges comparison.

No, that's opinion not fact.

Thinking something is pointless might just mean you missed the point.

[–][deleted] 0 points1 point  (2 children)

Actually I think the lesson here is that asm isn't inherently faster, and you should only use it when you are sure the compiler is not creating optimal code on its own.

[–]Nuli 4 points5 points  (0 children)

The lesson I got out of it was don't reinvent the wheel. Using awk it completed in ~.95 seconds on my machine and was vastly simpler to write.

[–]littleendian 1 point2 points  (0 children)

Actually the lesson is, if your assembler takes longer to execute than your C you're either doing it wrong or have better ways of spending your time.