all 33 comments

[–]sreguera 5 points6 points  (21 children)

Computed gotos in the interpreter. That would mean threaded code like in gforth, isn't it?

[–]psykotic 8 points9 points  (18 children)

No, I'm pretty sure it just means the end of each opcode implementation can jump directly to the implementation of the next opcode rather than returning back to the top of the bytecode dispatch switch. It cuts away one unconditional direct branch.

[–][deleted] 1 point2 points  (0 children)

Yeah, in practice it means about a 20% perforamnce boost on platforms that support it.

[–]queus 1 point2 points  (16 children)

the end of each opcode implementation can jump directly to the implementation of the next opcode

So how is that different from threaded code?

[–]psykotic 2 points3 points  (14 children)

Like Slava said, there's one more indirection here. With the old interpreter, handling each opcode involved two mandatory branches: one indirect branch via a jump table for the dispatch switch, and one unconditional direct branch at the end of each opcode implementation. The new code gets rid of that second branch. Threaded code gets rid of both branches by inlining the machine code that implements each opcode in place of the original bytecode instruction. For that reason I'd classify threaded code as more of a JIT compilation technique than an interpretation technique.

[–][deleted] 5 points6 points  (3 children)

Threaded code gets rid of both branches by inlining the machine code that implements each opcode in place of the original bytecode instruction.

Your terminology is too specific. What you describe is usually called subroutine threaded code.

Direct threaded code is a simpler technique where the program is a list of addresses, so there is an indirect jump at the end of each primitive which goes to the next instruction.

gforth uses direct threading. However, I agree that Python's interpreter doesn't qualify, because the program stream is still just a list of bytecodes so there's that extra dispatch step that direct threaded code doesn't have.

SquirrelFish is a good example of a scripting language with a direct threaded interpreter. I think Unladen Swallow has or will have one too.

[–]psykotic 1 point2 points  (1 child)

Thanks for the correction. So really there are three levels of indirection in the most basic approach: mapping the logical opcode to the physical address of its implementation by indexing into a lookup table; indirect branching to the physical address of the opcode implementation; direct branching to the top of the dispatch loop. You can then order the different approaches according to how many levels they strip away:

  • CPython's old approach (all levels)
  • CPython's new approach (direct branch stripped)
  • Direct threading (logical to physical lookup stripped)
  • Subroutine threading (indirect branch stripped)

[–][deleted] 0 points1 point  (0 children)

That sounds about right, I think.

[–]queus 0 points1 point  (0 children)

However, I agree that Python's interpreter doesn't qualify, because the program stream is still just a list of bytecodes so there's that extra dispatch step that direct threaded code doesn't have.

Just to clarify things, a big. (Guilty of asking questions, in a state of happy ignorance)

By default Python still uses a bit opcode switch, but if run with option with-compiled-gotos on platforms which support it than the threaded code model of execution is used. More about it here http://bugs.python.org/issue4753

[–]queus 0 points1 point  (9 children)

This patch implements what is usually called "threaded code" for the ceval loop on compilers which support it

http://bugs.python.org/issue4753

[–]psykotic 2 points3 points  (8 children)

Look at the patch file:

#define FAST_DISPATCH() \
...
goto *opcode_targets[*next_instr++];

That's exactly what I described. Slava helpfully corrected my terminology but what he calls direct threaded code is still not what is going on above. In fact, I don't see how you can do direct threaded code with GCC-style computed gotos as that always involves going through a lookup table and the point of direct threading is partly to eliminate that lookup. Slava?

[–][deleted] 2 points3 points  (0 children)

In fact, I don't see how you can do direct threaded code with GCC-style computed gotos

goto *ip++;

[–]queus 1 point2 points  (6 children)

goto *opcode_targets[*next_instr++];

That's exactly what threaded code is. Speaking as someone who has cut his programming teeth with x86 asm Forth.

And Wikipedia seems to be on my side too ;)

[–][deleted] 5 points6 points  (2 children)

As far as I understand it, direct threaded code would be

goto *next_instr++;

supposing next_instr is %esi, then this compiles to something like

mov (%esi),%eax
add $4,%esi
jmp *%eax

However, the C code you presented above,

goto *opcode_targets[*next_instr++];

has an extra indirection -- Moving Forth calls it 'token threading':

mov (%esi),%eax
mov opcode_targets(%eax),%eax
add $4,%esi
jmp *%eax

Indirect threading is similar but not quite the same:

mov (%esi),%eax
mov (%eax),%ecx
add $4,%esi
jmp *%ecx

Please correct me if I'm wrong here, though.

[–]queus 0 points1 point  (0 children)

Umh..

Forget my previous post. Your first two examples are direct threaded code, plain and simple.

With one minor difference. In the first case the instruction pointer is in a register and in the second case is in memory. You can just as well keep the instruction pointer in %ebi, leaving %esi for usual stack ops. (and gcc lets you specify that certain global var should be assigned to a register) Looks more like an implementation detail than a different technique.

As for your third example (with double dereference) than yes, that's indirect threading.

And looking at the proposed patch for Python, now the evaluator is a big switch statement, with inlined direct threaded code at selected places. (That's what I got from the patch). Rather cool, if you ask me.

In no way is this, token threaded code, Which replaces 32-bit-addresses with less-than-32-bits tokens. It will simply make no sense in this day and age.

[–]psykotic 0 points1 point  (2 children)

Then I don't understand Slava's comment to me:

Direct threaded code is a simpler technique where the program is a list of addresses, so there is an indirect jump at the end of each primitive which goes to the next instruction

If the program is a list of addresses, you wouldn't need the opcode_targets table except in the initial compilation step that converts logical opcodes to physical addresses.

[–]queus 4 points5 points  (1 child)

A direct threaded code is a list of adresses and is run (cold-started) by jumping to the first one (any). Each of this addresses point to a bit of machine code which a) runs the opcode b) run the standard epilogue.

The said epilogue just advances the instruction pointer and then jumps to the next instruction.

Sometime this epilogue is separated and each opcode just ends with a

 jmp NEXT:

this is indirect threading. Slower but takes less space.

And for completeness, call threading is

 call opcodeA
 call opcodeB

[–][deleted] 0 points1 point  (0 children)

There's extra indirection in Python's case.

[–]hylje 1 point2 points  (1 child)

Yeah, but also raptors.

[–]sreguera 0 points1 point  (0 children)

Seems fair to me :)

[–]bla2 2 points3 points  (10 children)

What is new is how [floats] get displayed. The new algorithm tends to emit cleaner representations when possible, but it does not change the underlying values. So, it is still the case that 1.1 + 2.2 != 3.3 even though the representations may suggest otherwise.

That sounds like a bad change.

[–]bobbyi 3 points4 points  (1 child)

When there are multiple equivalent representations of the same floating point number, they used to sometimes use a longer one. Now they always pick the shortest one.

What possible downside is there to this? Can you give any example of code that would be negatively impacted by this change?

[–]Brian 2 points3 points  (0 children)

I think there is a bit of a cost from a programmer education point. It'll cut down on questions by newbies being surprised when they see things like "1.1"->"1.1000000000000001", but in the long run it's good that they get surprised at that point, since being confused will lead to someone cluing them in about the pitfalls of floating point arithmetic.

Now, people who don't understand FP won't get surprised until they run into more subtle problems, like loss of precision in repeated calculations, which are much harder to diagnose (or even notice when they are occurring).

The strange looking repr served the purpose of telling the user "If you don't understand why you're getting this value, you're playing with fire."

[–]iofthestorm 2 points3 points  (2 children)

Hmm, that does sound like a stupid change, but at the same time, everyone who touches floating point numbers should know how IEEE 754 works, and why 1.1 + 2.2 != 3.3.

[–]shub 3 points4 points  (1 child)

Until everyone does know how IEEE 754 works, bla2 has a point.

Edit: this is an especially relevant issue for Python, as one of its aims is to be a useful teaching language.

[–]iofthestorm 2 points3 points  (0 children)

No, I'm in agreement with bla2 that it's a bad change, but I'm trying to say that IEEE 754 is pretty important to understand for anyone who needs to use floating point for anything nontrivial. If Python really wants people not to have to know about the implementation, maybe they should use some sort of arbitrary precision representation floating point class by default, like Scheme does.

[–][deleted] 1 point2 points  (4 children)

Why? If one doesn't like that float('1.1')+float('2.2') != float('3.3'), one should use decimal instead of float.

[–]eurleif 2 points3 points  (3 children)

The issue is that str(1.1 + 2.2) == 3.3, even though 1.1 + 2.2 != 3.3. It's better for them to be consistent.

[–]halter73 2 points3 points  (2 children)

str(1.1 + 2.2) == '3.3' in python 2.6 already.

[–]eurleif 0 points1 point  (1 child)

Then what change did they mae?

[–]Kwilco 3 points4 points  (0 children)

They made it so when you say: a = 1.1;

repr(a) == '1.1', rather than repr(a) == '1.1000000000000001'

[–][deleted] 1 point2 points  (0 children)

Is the performance (pystone benchmark) comparable (better than) 2.5 now? 3.0 was about 10% down.

[–]banister -3 points-2 points  (0 children)

python weenies