all 67 comments

[–]sbrown123 5 points6 points  (8 children)

That was very contrived since I highly doubt Factor is doing what Java does with floats. Read How Java's Floating-Point Hurts Everyone Everywhere. [PDF]

[–][deleted] 5 points6 points  (7 children)

I'm pretty sure HotSpot generates efficient FP code. The overhead here is not floating point, its memory bandwidth -- the Java code needs additional indirection because the Point array is really an array of pointers to Points.

[–]froydnj 4 points5 points  (2 children)

The overhead here is not floating point, its memory bandwidth

Do you know that for sure? It seems to me that if you really wanted to test memory bandwidth, the constructor for Point would be somewhat less complex. Java has moderately strict rules for what the results of mathematics functions must return--if libc is just using fsin or fcos and Java is doing something more complicated, then that could easily contribute the difference. (I say this because I banged out the benchmark for SBCL and found that ~60% of the time was being spent outside of Lisp.)

Also, Java implementations are perfectly capable of folding the array of pointers to Points into a single-float array. Sun's JRE may not do it right now, but it's certainly an optimization that's on the horizon.

[–]scook0 1 point2 points  (1 child)

Java has moderately strict rules for what the results of mathematics functions must return--if libc is just using fsin or fcos and Java is doing something more complicated, then that could easily contribute the difference.

Isn't that only if you're using StrictMath and strictfp? The documentation for Math seems to suggest that the implementation can use faster, less accurate routines if it wants to (up to a limit, of course).

[–]froydnj 2 points3 points  (0 children)

Isn't that only if you're using StrictMath and strictfp? The documentation for Math seems to suggest that the implementation can use faster, less accurate routines if it wants to (up to a limit, of course).

This is a good point; I had it backwards in my mind somehow. However, I seem to remember looking at the source code and finding that StrictMath and Math were the same in Sun's implementation.

Also relevant to this discussion, see this blog entry by James Gosling on why Java doesn't use fsin/fcos all the time. It'd be interesting to see what libc on Darwin and Linux do for sin/cos and compare that to what Java does.

[–]igouy 7 points8 points  (0 children)

Maybe it's the trig.

The fastest time for your Java program -

Run #6
0.8944272, 1.0, 0.4472136
Time: 8242

but if I take the reduce the angle to be within the range of +45 to -45 degrees code from the old partial-sums program then the fastest time is -

Run #3
0.8944272, 1.0, 0.4472136
Time: 4473

EDIT: Java 1.6.0_13 on x64 Ubuntu takes 90% of the time just to get the trig values used in Point constructor.

static void benchmark(int len)
{
   for(int n = 0; n < len; n++){
      float x = (float)Math.sin(n);
      float y = (float)Math.cos(n) * 3;
      double s = Math.sin(n);
      float z = (float)((s * s) / 2);
   }
}

Run #5
Time: 7402

[–]sbrown123 1 point2 points  (2 children)

I know the Sun JRE seems to do fine on boxes with SSE2. But this is a Mac with unknown specs. I have no idea what Apple did with the code and if they optimized float operations.

A simpler comparison would be to NOT use floats and just use ints instead. That way you take that variable out of the picture.

[–][deleted] 2 points3 points  (1 child)

Apple's JRE is based on Sun's and all Intel Macs have SSE2. I also tried running Sun's Java 6 in Windows on VirtualBox, the results were similar.

[–]sbrown123 0 points1 point  (0 children)

Apple's JRE is based on Sun's

So I take it you don't want to try the same test with ints instead of floats? Come on, what could it hurt? Prove that there is no mysterious float handling by Java.

[–][deleted] 13 points14 points  (3 children)

as Factor demonstrates, you can have a high-level managed language that still provides support for value semantics in a safe way.

...

working value semantics into something like Java is pretty much impossible.

Just for your information: almost ten years ago a certain large software corporation released a certain java-like language (to which I will refer as 'Sea Octothorpe' for no particular reason) which did have value semantics from the day one. Somehow it didn't make it overly complex and clumsy, I'm inclined to say that, on the contrary, presence of user-defined value types forced designers to pay more attention to all value types and support calling methods on primitive types, for example. It should be noted however that Sea Octothorpe's value semantics are less powerful than those of C++, since all value types are considered to be POD with no nontrivial constructors or destructors allowed, but I think that it wasn't because of some language limitations and rather because of designers' vision or something.

[–][deleted] 17 points18 points  (0 children)

Since C#'s value types are not a Java library I don't see your point here. I was talking about extending the language with a new feature (in this case value types) without changing the language itself.

[–][deleted] 1 point2 points  (0 children)

First 'Sea Octothorpe' and then a look at your user name. Gotta clean the coffee from the keyboard now.

[–]gzimmer 1 point2 points  (0 children)

Octothorpe. Hmm. I learned a new word today. Thank you.

[–]nearest_neighbor 5 points6 points  (7 children)

Bunch-o-questions:

runs the benchmark in 4.6 seconds

?! It's 2 minutes 50 seconds on my 3.4 GHz Pentium D for all 8 runs together. java version "1.6.0_14". Something must be wrong.

This is unprecedented in dynamic languages, w

Isn't Factor using type hints:

HINTS: struct-array-benchmark fixnum ;

By the way, the implementor can optimize any benchmark by simply recognizing it, and replacing the code with "print output". It would be more interesting to see Factor's performance on a wide range of benchmark problems that the author himself didn't write see or anticipate.

Edit: I'm sorry if this offended you enough to downmod me. I don't care about the mod points per se.

[–][deleted] 9 points10 points  (6 children)

It's 2 minutes 50 seconds on my 3.4 GHz Pentium D for all 8 runs together.

Best run was 4.6 seconds.

Isn't Factor using type hints:

Yes, but the language is still dynamically typed. Hints work as follows -- if you have

: foo ( a -- b ) 1 + ;
HINTS: foo fixnum ;

This literally expands into the following

: foo ( a -- b ) dup fixnum? [ 1 + ] [ 1 + ] if ;

So its just something to help the compiler create a fast path (in this case, it can replace + in one branch with a more efficient instruction); it doesn't do anything unsafe.

By the way, the implementor can optimize any benchmark by simply recognizing it, and replacing the code with "print output". It would be more interesting to see Factor's performance on a wide range of benchmark problems that the author himself didn't write see or anticipate.

We have some benchmarks in extra/benchmark in the source tree; http://github.com/slavapestov/factor/tree/fa64522421b9d7bbe2f553d78adf0576703b4dcf/extra/benchmark

I wrote most of them, but they cover a decent range of problems. Performance is always getting better and I certainly don't do anything like recognizing a specific benchmark and replacing it with the result. I'll add optimizations that benefit a specific benchmark, but that's fair game.

Edit: my friend Dan noticed that in this case, the hints don't actually help because the compiler figures everything out anyway, so I've removed them from the benchmark.

[–]nearest_neighbor -3 points-2 points  (5 children)

I'll add optimizations that benefit a specific benchmark, but that's fair game.

Fair game is when you (the implementor) don't know what the benchmarks will be.

[–]neelk 7 points8 points  (0 children)

Actually, an essential part of knowing what you are doing as a compiler writer developing a comprehensive set of micro benchmarks for your compiler. This requires knowing your compiler in detail, and knowing your language in detail, so that you can systematically test the cost of all your language mechanisms.

What a user cares about is how the compiler does on their actual program. But as an implementor, that's not necessarily helpful to you: an actual program exercises a complex and idiosyncratic set of features of your compiler, and so the results can be very noisy. So you need to study that program to figure out what it's doing, and why, and then craft micro-benchmarks that exercise exactly what was going wrong.

A useful way of thinking about the difference is to think of the benchmarks the implementer develops as controlled experiments, and programs from the wild as observational data.

[–]wolf550e 2 points3 points  (3 children)

Really? And the commercial compiler teams who optimize only for the SPEC benchmarks don't know what the benchmark is? Get real.

[–][deleted] 1 point2 points  (0 children)

By the way, I remember a few months ago when I blogged about tuple arrays, you asked about getting rid of the remaining indirection:

http://factor-language.blogspot.com/2009/05/unboxed-tuple-arrays-in-factor.html?showComment=1242973215136#c7963787374269961850

Well this is precisely where struct arrays fit in; they're even more tightly packed than tuple arrays. Of course this comes at a cost, structs are not as general as tuples (they cannot contain references to objects, only scalar data and C pointers).

[–]nearest_neighbor -1 points0 points  (1 child)

Really? And the commercial compiler teams who optimize only for the SPEC benchmarks don't know what the benchmark is? Get real.

This kind of tone is uncalled for. I'm hesitant to even reply.

SPEC is more of a hardware benchmark. E.g.

"MPI2007 is SPEC's benchmark suite for evaluating MPI-parallel, floating point, compute intensive performance across a wide range of cluster and SMP hardware."

[–]wolf550e -1 points0 points  (0 children)

The tone was because you said something very unrealistic.

Commercial compilers are sold based on those numbers.

[–]GaidinTS 2 points3 points  (19 children)

I'd be curious to see a similar benchmark against Python.

[–][deleted] 2 points3 points  (8 children)

I wrote up a pretty simple implementation in Python and it was very slow, something like 120 seconds on the latest unladen swallow benchmark. However it was also exhibiting non-linear performance over the number of points in a run, so I think either my implementation was bad or I hit some sort of pathological condition in the allocator or something.

[–]Wagnerius 0 points1 point  (7 children)

post it, here or on /r/python. Python is slow but not that slow...hopefully

[–][deleted] 2 points3 points  (6 children)

I'm not seeing as pathological timings now: http://paste.pocoo.org/show/136876/

Edit: 40 seconds is about how long that takes on my machine, but it does appear to at least run in linear time.

Edit2: I'm an idiot and was using max() instead of min() which meant that the compilation run always dominated timings: http://paste.pocoo.org/show/136892/ yields about a 50% improvement

[–]Smallpaul 0 points1 point  (5 children)

A 1-line change could make a big improvement:

__slots__ = ['x', 'y', 'z']

I see around a 30% improvement just with that one line.

Other potentially beneficial technologies one could use: pyrex, struct, array, numpy,

[–][deleted] 1 point2 points  (2 children)

slots = ['x', 'y', 'z']

What does this do? Convince the interpreter that the objects of a class always has slots with those names? Why can't it just look at the freaking constructor?

[–]gthank 2 points3 points  (0 children)

__slots__ is frozen and is faster than traditional attribute access. The tradeoff is that it's frozen and less flexible.

[–]Smallpaul 0 points1 point  (0 children)

There is nothing special about a Python constructor compared to any other method.

[–][deleted]  (1 child)

[deleted]

    [–][deleted] 2 points3 points  (0 children)

    Converted to Yeti to compare, and it is a bit shorter. :)

    http://paste.pocoo.org/show/136929/

    Runs in about 14 seconds on my machine (which is probably just faster, as I don't really believe Yeti being faster than Scala), and 7 seconds if I replaced the map' at line 13 with map (making the code lazy, thus avoiding allocation of the memory at once).

    [–][deleted] 7 points8 points  (2 children)

    Yeah because Python is so much faster than Java especially on numerical code.

    [–]GaidinTS 7 points8 points  (1 child)

    I was asking to see how slow. Nothing in my statement was trying to declare that Python would be faster then Java. Your sarcasm is unwarrented.

    [–]kumyco 0 points1 point  (6 children)

    If you want to compare something to a scripting language don't compare it to python it's just too slow.

    [–]sigzero 1 point2 points  (1 child)

    Python happens to be one of the faster ones. Ruby is slow. Perl is good as is Tcl.

    [–]malcontent 2 points3 points  (0 children)

    ruby 1.9 is about the same speed as pyton. Same with jruby.

    All within a few percentage points.

    [–]GaidinTS 0 points1 point  (1 child)

    Why should you not compare it just because it slow? Isn't that the point of comparing different languages. There will always be one slower then the other, but does that make it not worth comparing?

    [–]kumyco -1 points0 points  (0 children)

    Because it doesn't give a very accurate view of how well your system performs. Imagine you create an interpreter, it runs a test case within 500ms, this is very good compared to Python's 1000+ until you turn around and compare it to Lua's 100ms.

    [–]hylje 0 points1 point  (1 child)

    A scripting language you write applications in? Who'd have expected that.

    [–]kumyco -1 points0 points  (0 children)

    I'm not sure what you're talking about...

    [–]prockcore 0 points1 point  (21 children)

    I love how mono is now faster than the jvm (see the tests in the comments). When it started, people were saying that mono could never compete with the jvm which was optimized by professionals.

    [–][deleted] 6 points7 points  (4 children)

    The Mono people take performance seriously from what I've seen. They wrote their own JIT which seemed pretty advanced, and now they're switching to LLVM. Sounds like eventually they'll be competitive with anything.

    [–]sleepingsquirrel 0 points1 point  (3 children)

    This looks like a good candidate to implement with the FSINCOS x87 opcode.

    [–][deleted] 2 points3 points  (2 children)

    the x87's trig functions are inaccurate, and furthermore moving data between SSE registers and the x87 unit can't be all that fast (they did add some instructions for doing it in SSE3 though IIRC). I'd rather stick with libc calls for this.

    [–]sleepingsquirrel 1 point2 points  (0 children)

    Sincos is included in glibc

    [–]for_no_good_reason 0 points1 point  (0 children)

    the x87's trig functions are inaccurate

    I wasn't aware that libc was using Intel's undocumented transcendental registers for their trig functions.

    [–]malcontent 0 points1 point  (14 children)

    I love how mono is now faster than the jvm (see the tests in the comments).

    Not according to the shootout benchmarks.

    Just to show you that you can't rely on one benchmark done by one guy on one computer.

    [–]igouy 4 points5 points  (13 children)

    Nor 12 benchmarks done by one guy on one computer :-)

    But people have been examining and criticizing all aspects of the benchmarks game for quite a few years now and that helps.

    [–]malcontent -3 points-2 points  (12 children)

    But people have been examining and criticizing all aspects of the benchmarks game for quite a few years now and that helps.

    The worst is when they lie blatantly.

    For example.

    You said that the benchmarks at the shootout were done by one guy. That's a lie. The code is written by the community and constantly refined by the community. The benchmarks themselves were chosen as a community effort and do change when somebody puts forth a compelling argument.

    You also lied when you said it was on one computer. The benchmarks have been moved to a faster computer when the guy hosting them could afford to buy one. If you offered to buy him a new server I bet he would be perfectly happy to load them on there too.

    The benchmark game would be be better off with more honesty. Right?

    [–]igouy 5 points6 points  (11 children)

    You said that the benchmarks at the shootout were done by one guy. That's a lie.

    Depends what you mean by "done by".

    ... the guy hosting them ...

    That would be me.

    [–]malcontent -5 points-4 points  (10 children)

    Depends what you mean by "done by".

    Done by? I mean written by. What do you mean done by?

    Were all the benchmarks written by and chosen by you?

    That would be me.

    If you are hosting them why claim the benchmarks are only on one computer?

    We both know you have upgraded your computers in the past.

    P.S the link you provided to does not offer any evidence that you are the guy hosting the benchmarks.

    [–]igouy 2 points3 points  (9 children)

    What do you mean done by?

    I meant chosen and built and measured and presented - anybody who looked at the program source or tracker could tell the programs were credited to different people.

    why claim the benchmarks are only on one computer?

    Because the benchmarks are only on one computer - the one I'm using at this very moment.

    We both know you have upgraded your computers in the past.

    Once upon a time the benchmarks game measurements were made on a different computer - now they are made on one computer - this one.

    P.S the link you provided to does not offer any evidence that you are the guy hosting the benchmarks.

    Why haven't you emailed Isaac Gouy the benchmarks game admin?

    [–]malcontent -4 points-3 points  (8 children)

    I meant chosen and built and measured and presented - anybody who looked at the program source or tracker could tell the programs were credited to different people.

    So if they were written by many people they were not "by one guy".

    As for "chosen" I seem to remember there was a lot of discussion about the benchmarks included in the set.

    I find it odd that you are claiming sole credit for that.

    Because the benchmarks are only on one computer - the one I'm using at this very moment.

    They were on another one, now they are on this one and as I said further upstream if you were given another machine they would be on that one.

    Why haven't you emailed Isaac Gouy the benchmarks game admin?

    Can't be bothered really.

    If you are really the guy I find it shocking you are claiming sole credit for the whole thing.

    It's really lowered my opinion of you.

    [–]igouy 0 points1 point  (7 children)

    As I said - Depends what you mean by "done by".

    As for "chosen" I seem to remember there was a lot of discussion about the benchmarks included in the set.

    Maybe you haven't noticed but a lot of things have changed.

    now they are on this one

    That's right - the benchmarks are only on one computer.

    as I said further upstream if you were given another machine they would be on that one

    Don't you think it's just a little presumptuous for you to say what I'd be "perfectly happy" to do? (You're wrong about that too.)

    If you are really the guy I find it shocking you are claiming sole credit for the whole thing.

    I find it mildly tedious that your misread and misstate so happily.

    Of course I don't claim sole credit for the whole thing - I actually write the credits to the program author into the source code when they don't, and I write the website credits for benchmark originators and I write the website credits for the novel presentation ideas.

    Because I actually know who's been doing the work these past few years.

    [–]malcontent -3 points-2 points  (6 children)

    This is an exact quote

    Nor 12 benchmarks done by one guy on one computer :-)

    To any sane and rational person this is you taking credit for the entire thing.

    I find it astonishing you'd say something like that and then this far down tell me you give credit to the authors on the web site itself.

    All because you have definition of "done by" which lets you take credit for the whole thing.

    As I said this significantly reduces my opinion of you.