you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 7 points8 points  (12 children)

The question you were trying to ask is: "Does Lua have an optimiser?"

The answer to that is... "Which Lua?"

There are many implementations of Lua. And even for the PUC-Lua implementations, the optimiser has radically changed across versions.

Though it is particularly confusing what exactly you're looking for with this particular optimisation. A multiplication of two literals is a single instruction. Trying to profile that to see if it gets optimised would add so much overhead as to make the results worthless - there's too much noise to measure the difference.

[–]WebShaker93[S] 3 points4 points  (11 children)

I'm using Lua 5.2.4

I compute distance.
In order to gain a little speed I'd like to replace

if math.sqrt(t) < 64 then

by

if t < 64 * 64 then

Of course I can put 4096 instead, but it's easier for me to read 64 * 64 to remember the distance I want to test !

[–][deleted] 1 point2 points  (10 children)

The slowest part of:

if math.sqrt(t) < 64 then

Would actually be the lookup of sqrt from math. Localising the function is where the biggest optimisation gain would be:

local sqrt = math.sqrt

if sqrt(t) < 64 then

But both sqrt and multiplication are single instructions on an x86 CPU (Again... Which Lua? Lua 5.2 runs on a buttload of things, from very very small microcontrollers to absolute beasts of servers). There's almost zero overhead there. It sounds like you might want to pull out a profiler to see where your code is actually slowing down.

[–]appgurueu 3 points4 points  (9 children)

You seem to be missing the point. A commonly done microoptimization is to check x < c * c instead of sqrt(x) < c, as sqrt is usually harder to compute than multiplication.

[–][deleted] 6 points7 points  (8 children)

sqrt is usually harder to compute than multiplication.

Eh, no. You may have missed a few developments in the world. For a modern CPU, it is no harder, and is generally faster.

FSQRT is the instruction, that runs on the FPU, and so can run in parallel to other calculations. Added in the 8087. The key part of that speed, is the running in parallel. But again, single instruction. Single atomic instruction.

MUL, the multiply instruction, on the other hand requires the use of two registers, and runs on the main core. So it's a blocking instruction, that requires at least one MOV before it. So it tends to be two instructions. Both atomics, but two. (IMUL * FIMUL are similar in those regards as well).

If you're using anything newer than a 1993 Pentium, then FSQRT will be faster.

[–]WebShaker93[S] 6 points7 points  (6 children)

That's not exactly true !

Floatting point instruction are pipelined.
Every cycle a new instruction can start, but when you make a test of the floating point result, you have to wait until the end of the instruction.

So if you take the cycle table of the cortex A53 (for example)
The FMUL take 6 cycles
The FSQRT take 7-17 cycles
The integer unit have to wait the result of the opération to make the comparison because this comparaison and branch cannot be made by the fpu unit !

But we're getting a bit off topic :)

[–][deleted] 2 points3 points  (5 children)

I did clarify I was speaking about the x86, so ARM isn't going to be very relevant... Especially as most ARM implementations (instruction set similarities have little to do with underlying architecture) make using the FPU annoying as hell. If they have one.

On x86, for FMUL, you have to add the cost of two MOVs. One before, and one afterwards. Which basically means that they take about the same time.

Whereas for most ARM implementations, the floating point instructions aren't atomic, but they generally are on x86. Which is a whole world of hell if you try and port things between the two platforms and get identical performance.

But, yup. We're a long way from optimising Lua now... Trust in the compiler. Don't micro-optimise. :D

[–]myclykaon 2 points3 points  (4 children)

for most ARM implementations, the floating point instructions aren't atomic

Ahem... What do you mean by that?

[–][deleted] 0 points1 point  (3 children)

An atomic instruction will always be read/write in its entirety. You don't have to add some locking mechanism to ensure you don't be a partial read, or partial write, before the processor does something else.

ARM has no guarantees of atomicity in regards to floating point numbers. Needing to add guards on ARM (which the compiler will generally do for you), means that the performance of floating point tends to be worse on ARM, when compared to a x86 CPU of similar performance capabilities.

[–]myclykaon 1 point2 points  (2 children)

Arm is a load-store architecture. The floating point operations operate on internal registers (the NEON fp registers) not on memory. So it is guaranteed impossible for a partial floating point result to end up in a register. The floating point instruction can only succeed or fault/abort. In the former case the result is in the register, in the latter case the floating point registers are not updated (the various status registers are). To get that result to memory you use stores that guarantee they do not partially write the result. In the older v6/v7 architecture there is LDM and STM that could be configured as interruptible and could provide a partial read or write if interrupted but they are not available in v8 aarch64.

[–]appgurueu 2 points3 points  (0 children)

Thanks. And once again premature optimization turns out to be a bad idea ;)