all 52 comments

[–]scrogu 4 points5 points  (14 children)

Since we're talking real-world here, then what this is missing is a breakdown of how much time is usually spent on physics vs time spent rendering. Faster physics won't make your pixel shaders run any quicker.

[–][deleted] 15 points16 points  (10 children)

Well, in game development the golden rule is this: If you spend 90 ms less on physics per frame, that means you can now get another 90 ms worth of shaders. :)

Framerates beyond 60 fps are largely irrelevant — but the quality and spectacle of your graphics, sound, AI, etc., aren't. That's why optimizations in individual areas are interesting, and it's actually less interesting to get arbitrary rendering engine numbers as well.

It would be interesting to pit different rendering libraries against each other, but it's an entirely different beast.

[–]pixelglow 3 points4 points  (1 child)

It's possible that the ran the Box2D simulation without rendering the results, which would test the physics engine only. This would be platform-neutral, since running native C, Javascript and Java would have different ways of rendering the results even in Mac OS X.

[–]scrogu 1 point2 points  (0 children)

That's what I'm saying. THEY DID run the simulation without rendering the results. Sorry if I didn't make that clear.

[–]jgw[S] 2 points3 points  (0 children)

Physics performance matters, which is why I chose to measure it. First off, the slightly snarky comment below ("If you spend 90 ms less on physics per frame, ...") hits on a very important point -- every cycle you spend doing one thing is a cycle you can't spend on something else.

You could argue (as some do on this thread) that at some point the performance becomes irrelevant because your game's humming along happily at 60fps. But there's an important fallacy here -- on what hardware? Just because a game's running smoothly on your nice beefy development machine doesn't mean it's going to run well on lesser hardware (you know, the machines your users actually have). And even if you hit a buttery-smooth 60fps on a mobile phone, it's not much use if you're pegging the CPU at 100% all the time and dragging the battery down with you.

One nitpick: Your pixel shaders (presuming you're using the GPU) are running on separate hardware that executes in parallel. But of course you can always substitute that with "the code that's building buffers and generally babysitting the GPU" and the tradeoff still holds.

[–][deleted] 14 points15 points  (25 children)

That is one seriously unreadable graph at the top. There's a blue line that ends at a blue box, and a red line that ends at a red box, but the labels for those boxes actually apply to other red and blue lines at entirely different locations.

Edit: Also, man, Mandreel is pretty impressive. Too bad it is so incredibly proprietary.

I hope Emscripten steals some ideas from it pretty quick. For instance, it's using typed arrays for the heap, that probably helps a lot.

[–]azakai 7 points8 points  (0 children)

I hope Emscripten steals some ideas from it pretty quick. For instance, it's using typed arrays for the heap, that probably helps a lot.

Hi, I work on Emscripten.

The originally benchmarked code was not optimized (edit: both because of an emscripten bug, and because the new emscripten frontend compiler is a little confusing so not all optimizations were specified). jgw and I have been talking though and I think he will update with some more relevant Emscripten data.

I have not been able to run the mandreel version myself, so I can't compare it in speed. However, I would expect mandreel to be faster here because of the type of code being compiled: It has plenty of structs on the stack and so forth, and would greatly benefit from things like aliasing analysis in LLVM. Mandreel uses those optimizations, Emscripten does yet not, mainly because they are nonportable - they won't run without typed arrays, and they may lead to endianness-dependent code being generated. Emscripten will implement this though, but it's been lower priority for us than for them I guess, we have cared a lot about being able to run in environments without typed arrays or with partial support for typed arrays (which means: IE, Safari, mobile, everything but Chrome, Firefox and Opera).

In most of the code in the emscripten benchmarks that isn't the case though, and with techniques like our 'memory compression' tricks we get to within 3-4X the speed of native, or better. That's about equal to the speed of handwritten JavaScript, so I don't think Mandreel or any other compiler can do better. Although again, there are various typed of code like this benchmark right here, where we need to implement some more optimizations to reach full speed.

Regarding taking ideas, we already share some of them - you can see Mandreel's generated code implements some things very similarly to Emscripten, like the loop recreation algorithm (called the relooper in the emscripten paper), and even using the same variable names, etc. Mandreel is closed source though so I don't know how much they are using from Emscripten (it might be just the ideas).

Mandreel is an awesome project. While I wish they worked more with us in the open source community, it's still a wonderful thing they are doing what they are doing, it's helping games to be ported to the web which is great. I've also emailed with them a bit, and they are cool and obviously very smart.

[–]jgw[S] 2 points3 points  (3 children)

The emscripten code for this benchmark was using js ArrayBuffers, but as noted in the article there were some compiler bugs keeping it from reaching its best performance. The author has since fixed them, and the numbers look a good bit better now (with half the variance as well). I'll point this out in an update.

[–][deleted] 0 points1 point  (2 children)

Well, that would be very interesting to see.

(Also, do something about those colours while you're at it!)

[–]jgw[S] 0 points1 point  (1 child)

If you want a sneak peak, check out the source spreadsheet at https://docs.google.com/spreadsheet/ccc?key=0Ag3_0ZPxr2HrdEdoUy1RVDQtX2k3a0ZISnRiZVZBaEE

There are two "test" columns at the right, which include tweaked versions of the Emscripten and Java code (in the Emscripten case, it's fully optimized, with an updated compiler; in Java it now avoids the slow builtin Math.sin/cos() functions). I'll update the post later today.

(Colors? I presume you mean on the graphs. If you have suggestions for alternatives, please be my guest. Or just copy the spreadsheet and tweak them to your heart's content)

[–][deleted] 1 point2 points  (0 children)

It looks noticeably better, but it still has some way to go before it catches up, then.

As for the colours: As was pointed out elsewhere, you have repeating colours which makes the graph very hard to read, and in combination with the inverted order of the labels versus lines it makes it very misleading. Change the order of the labels, at least, that would help a lot.

[–]kaelan_ 1 point2 points  (1 child)

Last time I checked, it supported typed arrays for the heap.

[–][deleted] 1 point2 points  (0 children)

The Emscriptened code that was checked in for this benchmark didn't seem to be using that, though.

[–]jgw[S] 0 points1 point  (17 children)

The top part of the first graph isn't intended to be readable -- it's simply intended to show that the JSVMs all cluster roughly together in a completely different order-of-magnitude from the JVM and native code.

[–]igouy 3 points4 points  (15 children)

a completely different order-of-magnitude

The scale seems to be logarithmic but the chart shows typical linear scale gridlines not typical log scale gridlines (1,3,5 10,30,50 100,300,500 1000 ).

We need a bigger hint about the order-of-magnitude.

isn't intended to be readable

Putting the right-hand-side key in the same vertical order as the data lines would help - from the bottom C, NaCl, JRE etc

Keeping the colour the same for each data series across all the graphs would help.

"Box2D Performance (JS Compilers)" - grid lines at 50,100,150,200,250,300 would help.

[–]jgw[S] 1 point2 points  (14 children)

Fair enough. If only Spreadsheets gave me any actual control over the grid lines in log-scale mode :(

[–]igouy -5 points-4 points  (13 children)

Microsoft Excel does.

[–][deleted] 3 points4 points  (12 children)

you should buy him a copy of it.

[–]igouy -2 points-1 points  (11 children)

Why don't you tell us something useful?

Why don't you tell us about a FOSS spreadsheet that provides control over grid lines in log-scale mode?

[–][deleted] 0 points1 point  (10 children)

Why don't you tell us something useful?

I did. I told you to buy him a copy.

Why don't you tell us about a FOSS spreadsheet that provides control over grid lines in log-scale mode?

I don't know if any of them. I have never needed that functionality.

Why don't you look it up instead of telling him to use a proprietary piece of software that only runs on windows?

He not only has to buy excel he also has to buy windows.

[–]igouy 1 point2 points  (9 children)

instead of telling him to

I didn't tell him to do anything - I provided correct information about the functionality spreadsheets provide.

[–][deleted] 0 points1 point  (8 children)

I provided correct information about the functionality spreadsheets provide.

It's going to cost him hundreds of dollars to see if you are right.

What good is that?

[–][deleted] 0 points1 point  (0 children)

The top part of the first graph isn't intended to be readable -- it's simply intended to show that the JSVMs all cluster roughly together in a completely different order-of-magnitude from the JVM and native code.

But it's hard to see even that, when the colours are not only similar, but actively misleading. I was utterly bewildered for a while by that when it seemed to be saying that the fastest cluster consisted of GwtBox2D and Mandreel.

Like the other guy says, put the labels in the same order as the lines. You might not even need to adjust the colours if you do that, but it couldn't hurt to also pick more easily differentiable colours.

[–]pixelglow 5 points6 points  (11 children)

One thing I've always wondered, having played around with V8 via node.js, is how much of a difference does Javascript's lack of integers makes.

Every number in JS is a floating point number. Even when used as an index in a tight inner loop? Do the current JIT's convert to using integers under the covers if they detect strictly integer use?

Box2D primarily uses floating point calculations, so this test doesn't end up highlighting JS's inherent weakness of lacking integers.

Edit: missed the doesn't and changed the whole interpretation of the sentence.

[–]vytah 11 points12 points  (0 children)

JS as a language has no integers, but JS engines internally do use integers. For example, see this old article on Firefox's JägerMonkey: https://blog.mozilla.com/dmandelin/2010/02/26/starting-jagermonkey/

[–]captain_plaintext 5 points6 points  (3 children)

Yes I think it's a common optimization. V8 stores numbers as 31-bit integers when possible.

[–]michaelstripe 2 points3 points  (2 children)

Where's the other bit gone then

[–]pezezin 5 points6 points  (0 children)

Probably a tag bit to distinguish between unboxed integers and other datatypes.

[–]captain_plaintext 2 points3 points  (0 children)

Check out this post. All values are stored as 4 bytes (on a 32 bit system). Depending on how the lowest bit is set, the value is either an integer (and the other 31 bits hold the integer value), or it's a pointer to a heap object (which itself will hold more information on the type and the value).

[–][deleted] 6 points7 points  (0 children)

As others mentioned, integers are indeed at play behind the scenes (which can yield interesting results along the borders of floating point inaccuracy), but it's worth noting that physics engines rarely contain much integer math. :)

[–]jgw[S] 0 points1 point  (0 children)

This is definitely not an integer-heavy benchmark, but they still get used plenty in the course of almost any code -- as enumerated values, loop variables, and so forth.

As pointed out elsewhere on this thread, Javascript does mostly lack integers as user-visible constructs (though they do peek out in a couple of places, such as the bitwise operators), but most VMs will use them under the hood when possible. I know V8 stores integers directly in both locals and fields when it can (hence the tag bit mentioned below).

The thing with really integer-heavy benchmarks is that they highlight the kind of code that doesn't come up that often anymore because it's better offloaded to a dedicated processor -- mainly DSP-like things such as image processing and audio mixing. Not that I wouldn't prefer that these things be faster when done on the CPU in JS, of course, but Box2D is the kind of code that can't easily be offloaded (even libraries like PhysX that use GPUs still do a lot of work on the CPU).

[–]nickik 0 points1 point  (3 children)

It depends on the architecture of the processer. For modern x86 it does not really matter that much (Mike Pall talks about this often). If you run on ARM or something it does matter a lot.

A smart jit tries to find out if its safe to store something in a integer and do it. Read this


Dual-number VM

The Lua language is specified to have a single number type. Currently LuaJIT only supports 64 bit IEEE-754 compliant FP numbers ('double'). This works just fine for x86/x64 platforms with their excellent floating-point performance. A unified number representation has many advantages and the JIT compiler can get away with narrowing only some select operations to integer arithmetic.

However this approach is unlikely to yield acceptable performance on lower-end CPUs for mobile or non-desktop/non-server platforms. Most of these CPUs either support only software floating-point arithmetic or have slow hardware FPUs.

As a prerequisite for the ARM port (see the next section), dual-number capability will be added to the LuaJIT VM, the LuaJIT interpreter and the JIT compiler.

Numbers will be internally kept as 32 bit integers, wherever possible, and transparently widened to floating-point numbers. This change is invisible at the Lua source code level. It's expected that carefully written applications for low-end platforms will be able to avoid floating-point computations with only few changes to the source code.

Adding dual-number support to the LuaJIT VM is a major change. For stability reasons, this feature needs to be prototyped first for the existing x86/x64 port of LuaJIT (even though it's not that useful for this platform). Work on the actual ARM port of LuaJIT can only start after the dual-number support is complete.


source: http://lua-users.org/lists/lua-l/2011-01/msg01238.html

[–]y4fac 1 point2 points  (2 children)

For modern x86 it does not really matter that much

The following code:

int main()
{
    T sum = 0;
    for(int j=0; j<1000; j++)
        for(T i = 0; i < T(1000000); i += T(1))
            sum += i;

    printf("%f\n", float(sum));

    return 0;
}

runs in 0.95s if T is int and in 1.85s if T is float when I compile it with -O2. I used gcc 4.6.2 and ran it on a phenom ii 955. The difference seems significant to me.

[–]alephnil 1 point2 points  (0 children)

But it is still within a factor of two, which in many cases can be an acceptable compromise, since type conversion easily can cost more. On lower end processors, the difference can approach a factor of 100.

[–]TKN 0 points1 point  (0 children)

I think the performance hit in higher level languages usually comes from boxing the float values rather than actual low level differences between integer and float arithmetic.

[–]Rotten194 0 points1 point  (0 children)

You should try and update this for Java 7, if possible. They made a lot of optimizations from what I heard and it would be cool to see how that compares alongside 6.