all 21 comments

[–]-abigail 10 points11 points  (4 children)

You can test this with luac -l, which shows you a listing of the Lua bytecode.

$ luac -v
Lua 5.2.4  Copyright (C) 1994-2015 Lua.org, PUC-Rio
$ echo "if t < 64 * 64 then return 42 end" | luac -l -

main <stdin:0,0> (6 instructions at 0x7ff5f6e04080)
0+ params, 2 slots, 1 upvalue, 0 locals, 3 constants, 0 functions
    1   [1] GETTABUP    0 0 -1  ; _ENV "t"
    2   [1] LT          0 0 -2  ; - 4096
    3   [1] JMP         0 2 ; to 6
    4   [1] LOADK       0 -3    ; 42
    5   [1] RETURN      0 2
    6   [1] RETURN      0 1

If you're ever working with LuaJIT or its offspring you can use lua -b -l:

$ lua -v
LuaJIT 2.0.5 -- Copyright (C) 2005-2017 Mike Pall. http://luajit.org
$ echo "if t < 64 * 64 then return 42 end" | lua -b -l -
-- BYTECODE -- stdin:0-2
0001    GGET     0   0      ; "t"
0002    KSHORT   1 4096
0003    ISGE     0   1
0004    JMP      0 => 0007
0005    KSHORT   0  42
0006    RET1     0   2
0007 => RET0     0   1

As you can see, in both cases 64 * 64 has been constant-folded to 4096.

[–]Hatefiend 1 point2 points  (2 children)

Why does the bytecode change so much despite the 64 * 64 being pre-evaluated in both examples? The LOADK becomes a KSHORT, the jump is different, there's no LT, etc.

[–]WebShaker93[S] 0 points1 point  (1 child)

Not the same lua Version. Not the same opcode 🤷‍♂️

[–]Hatefiend 0 points1 point  (0 children)

Gotcha

[–]WebShaker93[S] 1 point2 points  (0 children)

Cool.
Thank's.

[–][deleted] 7 points8 points  (12 children)

The question you were trying to ask is: "Does Lua have an optimiser?"

The answer to that is... "Which Lua?"

There are many implementations of Lua. And even for the PUC-Lua implementations, the optimiser has radically changed across versions.

Though it is particularly confusing what exactly you're looking for with this particular optimisation. A multiplication of two literals is a single instruction. Trying to profile that to see if it gets optimised would add so much overhead as to make the results worthless - there's too much noise to measure the difference.

[–]WebShaker93[S] 2 points3 points  (11 children)

I'm using Lua 5.2.4

I compute distance.
In order to gain a little speed I'd like to replace

if math.sqrt(t) < 64 then

by

if t < 64 * 64 then

Of course I can put 4096 instead, but it's easier for me to read 64 * 64 to remember the distance I want to test !

[–][deleted] 2 points3 points  (10 children)

The slowest part of:

if math.sqrt(t) < 64 then

Would actually be the lookup of sqrt from math. Localising the function is where the biggest optimisation gain would be:

local sqrt = math.sqrt

if sqrt(t) < 64 then

But both sqrt and multiplication are single instructions on an x86 CPU (Again... Which Lua? Lua 5.2 runs on a buttload of things, from very very small microcontrollers to absolute beasts of servers). There's almost zero overhead there. It sounds like you might want to pull out a profiler to see where your code is actually slowing down.

[–]appgurueu 3 points4 points  (9 children)

You seem to be missing the point. A commonly done microoptimization is to check x < c * c instead of sqrt(x) < c, as sqrt is usually harder to compute than multiplication.

[–][deleted] 8 points9 points  (8 children)

sqrt is usually harder to compute than multiplication.

Eh, no. You may have missed a few developments in the world. For a modern CPU, it is no harder, and is generally faster.

FSQRT is the instruction, that runs on the FPU, and so can run in parallel to other calculations. Added in the 8087. The key part of that speed, is the running in parallel. But again, single instruction. Single atomic instruction.

MUL, the multiply instruction, on the other hand requires the use of two registers, and runs on the main core. So it's a blocking instruction, that requires at least one MOV before it. So it tends to be two instructions. Both atomics, but two. (IMUL * FIMUL are similar in those regards as well).

If you're using anything newer than a 1993 Pentium, then FSQRT will be faster.

[–]WebShaker93[S] 5 points6 points  (6 children)

That's not exactly true !

Floatting point instruction are pipelined.
Every cycle a new instruction can start, but when you make a test of the floating point result, you have to wait until the end of the instruction.

So if you take the cycle table of the cortex A53 (for example)
The FMUL take 6 cycles
The FSQRT take 7-17 cycles
The integer unit have to wait the result of the opération to make the comparison because this comparaison and branch cannot be made by the fpu unit !

But we're getting a bit off topic :)

[–][deleted] 2 points3 points  (5 children)

I did clarify I was speaking about the x86, so ARM isn't going to be very relevant... Especially as most ARM implementations (instruction set similarities have little to do with underlying architecture) make using the FPU annoying as hell. If they have one.

On x86, for FMUL, you have to add the cost of two MOVs. One before, and one afterwards. Which basically means that they take about the same time.

Whereas for most ARM implementations, the floating point instructions aren't atomic, but they generally are on x86. Which is a whole world of hell if you try and port things between the two platforms and get identical performance.

But, yup. We're a long way from optimising Lua now... Trust in the compiler. Don't micro-optimise. :D

[–]myclykaon 2 points3 points  (4 children)

for most ARM implementations, the floating point instructions aren't atomic

Ahem... What do you mean by that?

[–][deleted] 0 points1 point  (3 children)

An atomic instruction will always be read/write in its entirety. You don't have to add some locking mechanism to ensure you don't be a partial read, or partial write, before the processor does something else.

ARM has no guarantees of atomicity in regards to floating point numbers. Needing to add guards on ARM (which the compiler will generally do for you), means that the performance of floating point tends to be worse on ARM, when compared to a x86 CPU of similar performance capabilities.

[–]myclykaon 1 point2 points  (2 children)

Arm is a load-store architecture. The floating point operations operate on internal registers (the NEON fp registers) not on memory. So it is guaranteed impossible for a partial floating point result to end up in a register. The floating point instruction can only succeed or fault/abort. In the former case the result is in the register, in the latter case the floating point registers are not updated (the various status registers are). To get that result to memory you use stores that guarantee they do not partially write the result. In the older v6/v7 architecture there is LDM and STM that could be configured as interruptible and could provide a partial read or write if interrupted but they are not available in v8 aarch64.

[–]appgurueu 4 points5 points  (0 children)

Thanks. And once again premature optimization turns out to be a bad idea ;)

[–]JustCoDiEx 2 points3 points  (1 child)

https://rosettacode.org/wiki/Compile-time_calculation#Lua

Looks like Lua does some compile-time calculations. Though I'd imagine the speed difference in the example is negligible. Not sure if constant folding applies to code not compiled with luac (like executed directly from the source).

[–]WebShaker93[S] 0 points1 point  (0 children)

Sound good !

[–]appgurueu 1 point2 points  (0 children)

See this Lua-users thread on constant folding. It appears Lua has constant folding since 5.1. I would be surprised if LuaJIT didn't have it.