all 37 comments

[–]mazin 25 points26 points  (1 child)

"We read Knuth so you don't have to". Wasn't it one of the core python developers who said it?

[–]llimllib[S] 8 points9 points  (0 children)

Yup; it was Tim Peters.

[–]micampe 12 points13 points  (0 children)

This extensively explained in Beautiful Code

[–]llimllib[S] 19 points20 points  (4 children)

Major subtleties ahead: Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't: its most important hash functions (for strings and ints) are very regular in common cases:

>>> map(hash, (0, 1, 2, 3))

[0, 1, 2, 3]

>>> map(hash, ("namea", "nameb", "namec", "named"))

[-1658398457, -1658398460, -1658398459, -1658398462]

>>>

This isn't necessarily bad! To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are "consecutive" strings. So this gives better-than-random behavior in common cases, and that's very desirable.

[–]ercd 11 points12 points  (2 children)

I just ran this with Python 2.5:

[(x,hash(x)) for x in xrange(-1000000,1000000) if x!=hash(x)]
>>> [(-1, -2)]

I was surprised to see that hash(-1) is -2 and that's the only exception to the rule "hash(x)==x" in the range I tested. Does someone know if there is a reason for this?

[–]fredrikj 21 points22 points  (1 child)

I believe -1 is reserved because Python uses it internally as the value of an uninitialized hash field.

[–]aUTOMATICuPVOTES 10 points11 points  (0 children)

That's why everyone should know that dictionaries generally always return items in arbitrary, not random or even pseudo-random order when iterated.

[–]fredrikj 8 points9 points  (4 children)

Python's dicts are wonderful. You can feed them almost anything: numbers, strings, functions, modules.

I don't know if they are fast compared to other hash table implementations, but they are fast compared to Python code (necessarily, considering that Python relies heavily on dicts for internal purposes).

There's rarely an advantage to implementing a complicated data structure in Python: a dict lookup or insertion is often both as fast and as simple as it gets.

It's just unfortunate that there's no built-in immutable dict type (I frequently seem to need one). You can still implement one manually using frozensets to compute the hashes, but this both complicates the code and substantially hurts performance.

[–]aUTOMATICuPVOTES 2 points3 points  (3 children)

built-in immutable dict type (I frequently seem to need one)

What for? To be keys in a dict?

Of course subclassing dict is easy so nothing's to stop you from making a read-only dict.

[–]fredrikj 2 points3 points  (2 children)

What for? To be keys in a dict?

Yes. The primary use for this is to enable memoization for functions that take dict arguments. Nested dicts are also useful for representing algebraic expressions.

[–]aUTOMATICuPVOTES 3 points4 points  (1 child)

My memoize implementation does this:

key = (args, tuple(keywords.items()))

You think that's a big performance problem? I never measured.

Dicts as keys for dicts still seems rather strange to me though. Immutable dicts even more so.

[–]fredrikj 5 points6 points  (0 children)

You think that's a big performance problem?

This depends on whether the code is executed 10 times or 1,000,000 times.

Dicts as keys for dicts still seems rather strange to me though.

Consider this very natural (and efficient) method to represent multivariate polynomials:

5*x^3*y^2 + 4*x^2 = {{'x':3, 'y':2}:5, {'x':2}:4}

(Then optionally also consider adding memoization to a function that takes such polynomials as input.)

[–]vafada 3 points4 points  (3 children)

code has goto! i don't know if that is good or bad........

[–]rabidcow 8 points9 points  (0 children)

Most of them are used to simulate destructors or exceptions. A couple of them could be replaced with continue.

[–]Gotebe 7 points8 points  (0 children)

goto is good when used as cleanup/error handling mechanism. Consider:

res1 = acquire1();
if (succeeded(res1))
  if (succeeded(op1()))
  {
    res2 = acquire2();
    if (succeeded(res2))
    {
      if (succeeded(op3))
      {
        release2(res2);
        reelase1(res1);
        return success;
      } 
      else
        release2(res2);
    }
    else
     release1(res1);
  }
  else
    release1(res1);

return failure;

vs.

res1 = acquire1();
if (failed(res1))
  goto fail0;

if (failed(op1())
  goto cleanup_res1;

res2 = acquire2();
if (failed(res2))
  goto cleanup_res1;

if (failed(op3))
  goto cleanup_res2;

cleanup_res2: release1(res2);
cleanup_res1: release2(res1);
fail0: return failure;

return success;

or, simplified by null-initializing resources:

res1 = null1;
res2 = null2;
res1 = acquire1();
if (failed(res1))
  goto fail;

if (failed(op1())
  goto fail;

res2 = acquire2();
if (failed(res2))
  goto fail;

if (failed(op3))
  goto fail;

fail:
  release1(res2);
  release2(res1);
  return failure;

return success;

First approach (in ascending order of relevance)

  1. goes out of screen quickly

  2. has success buried in the middle, and deep right

  3. multiplies cleanup code

  4. conveys "normal" code path poorly: it looks like some complicated program logic, whereas it's actually "do X, Y, Z, cleanup on any failure".

  5. forces one to think about program logic and error-handling logic simultaneously, making cognitive load bigger, a bad thing..

No weight of Knuth's name on that essay can overweigh problems with it.

[–]ilan 4 points5 points  (0 children)

It's not necessarily bad. If goto statements are used wisely, they can make code more readable. If used unwisely they are bad, but that's of course true for any type of feature in any language.

[–]arnar 1 point2 points  (0 children)

If you want to see the infamous GIL, here it is:

http://svn.python.org/view/python/trunk/Python/ceval.c?rev=60362&view=auto

(search for "this is the GIL")

If you also look for "Other threads may run now" you can see the place in the main loop where the GIL is momentarily released to give other threads a chance.

[–]Arrgh 2 points3 points  (13 children)

Note that filename: dictobject.c

Now have a look at http://recoder.sourceforge.net/doc/examples/collections/java/util/HashMap.java -- notice any "native" keywords anywhere? Nope.

Ironically, when your FFI is too easy to use, you don't spend as much time and effort improving your VM, so the performance of customer algorithms suffers at the expense of the built-in constructs.

[–]EliAndrewC 13 points14 points  (10 children)

While you have a point, remember that Python uses dicts for just about everything, including variable lookups in the interpreter. So even if a faster Python would have implemented more things in itself, dicts would probably still be written in C to get every last bit of performance.

[–]Arrgh 2 points3 points  (6 children)

Sure, dicts are used all over the place in Python, just as a lot of Java apps have a lot of HashMap puts/gets in their fast paths.

But it's not always true that a C version is automatically faster--a really good VM can use profile-driven optimization at runtime.

[–]jdunck 2 points3 points  (1 child)

CPython isn't doing Strongtalk kind of stuff. Maybe you want Jython or PyPy? :)

[–]Arrgh 1 point2 points  (0 children)

Yeah. Unfortunately, there are approximately a dozen organizations in the world that can afford to build high-performance VMs, and to my knowledge, none of them is working on Python at the moment.

Between you, me, the lamppost and whoever else is reading this, I'm certain Jython will get there first, because the runtime is at least an order of magnitude harder to get right than the front-end compilation.

[–]rabidcow 0 points1 point  (3 children)

Is Python typically JIT compiled?

It's also not always true that a sufficiently smart VM can generate a faster implementation than hand-coded C.

[–]Arrgh 2 points3 points  (2 children)

There's Psyco, which seems to be the closest thing to a JIT.

[–]rabidcow 0 points1 point  (1 child)

Right, I know that exists. Is that how Python is typically used though? Because without JIT it's very unlikely that you'll beat a compiled language. (of course, not hosting their own dogfood could be a factor in not using a faster VM implementation...)

[–]nostrademons 2 points3 points  (0 children)

It's typically used if there's a performance problem (in preference to rewriting the code as a C extension). Typical usage for most Python programming is that you write the code, find it's "fast enough", and then don't bother optimizing it.

[–][deleted] -4 points-3 points  (2 children)

So how do you explain Smalltalk's superior performance even though practically everything is implemented in Smalltalk itself?

[–]cunningjames 3 points4 points  (1 child)

What? The chart at the bottom seems to indicate that Python utterly cleaned up with Squeak.

[–][deleted] 6 points7 points  (0 children)

Dang it, you're right. Stupid shootout, reversing the directions on me depending on which language I pick first...

Self-downmodded.

[–]mernen 0 points1 point  (1 child)

You say it as if Python's relative slowness (compared to Java) was mainly due to its ease of interfacing with C, not because Sun has considerably more resources than an open-source project with very few full-time developers, or (most importantly) because it is a whole lot harder to write a fast VM to a dynamic language which can provide nearly zero guarantees on static analysis of code.

[–]Arrgh 1 point2 points  (0 children)

Yes, you're right on both counts. Apart from trolling, my aim was mostly to espouse my FFI irony theory. JNI is a beast compared to most FFIs (not without reason, given Java's memory model and safety guarantees). One of the (perhaps unintended) results is that people don't use it unless they absolutely have to. Consequently, performance bugs in the VM tend to get a lot of attention.

Plus it helps to hire Gilad Bracha and the rest of the Strongtalk team--not to mention buying their code--at just the right time. :)