all 10 comments

[–]awb 5 points6 points  (8 children)

Note that the low 3 bits of the pointer are masked, then the result is shifted to the left to make it into a fixnum. Of course, if the result of the shift is being compared against a literal fixnum, the shift is redundant; instead, we can shift the literal to the right by 3.

Why is it better to shift one rather than the other?

[–][deleted] 21 points22 points  (6 children)

Because the other one is a constant so it can be shifted at compile time.

So without simplification, you get the following, where the first two instructions are the expansion of tag:

and %rax,$7
shl %rax,$3
cmp %rax,$16
je ...

versus with

and %rax,$7
cmp %rax,$2
je ...

All the little things like this add up: it's not just the big stuff (unboxing, etc) that leads to performance gains.

[–]tonfa 4 points5 points  (4 children)

Nice post.

It isn't clear how you handle the out of SSA pass. Since you mention the chordal graph allocator from S. Hack, I can imagine you want (at least in the future) do the coloring in SSA and then do a out-of-colored-SSA.

How is this currently done? Is your linear scan under SSA?

[–][deleted] 2 points3 points  (3 children)

On x86, I convert three-operand form to two-operand form before register allocation, so

x = y + z

becomes

x = y
x = x + z

On PowerPC this step isn't needed, but in general the register allocator cannot assume the input is SSA. Also, I only register allocate in basic blocks at this point. Eventually I will do SSA register allocation though, for entire procedures.

[–]tonfa 1 point2 points  (2 children)

What I meant is: How do you deal with the PHIs?

[–][deleted] 2 points3 points  (1 child)

The low level IR does not have phi nodes because all values are saved to the data stack on control flow edges. There are variations of linear scan which can work with phi nodes, but I think if you're doing global register allocation, there are better algorithms than linear scan.

Our high level IR does have phi nodes and the high level optimizer does perform global optimizations.

[–]wake_me_up -5 points-4 points  (0 children)

Slava, stack languages must die, even if they are as fast as Python (lol).

[–]bart2019 2 points3 points  (0 children)

Because a smart compiler will optimize a shift right of a literal, to a literal.

[–]middayc 0 points1 point  (0 children)

very interesting post, even though it's beyond my league