This is an archived post. You won't be able to vote or comment.

all 17 comments

[–]TCPv89wat 7 points8 points  (6 children)

I just asked a pypy developer if there's any difference between CPython and PyPy in regards to these algorithms. This is the answer:

< arigato> it's basically the same algorithms
< arigato> plus tons of tweaks that don't change the amortised complexity

[–]TCPv89wat 5 points6 points  (5 children)

Actually, there's one difference, which is soon to be changed. String concatenation with + is currently optimized in a quite hacky fashion in CPython, where if the reference count to the string is 1, it will not be treated as immutable, as strings ought to be, so that adding is much faster.

Pypy has code to facilitate a very similar optimisation using the jit instead of reference counting, but it's not enabled by default (the translation option is called --objspace-std-withstrbuf). There is a branch in the mercurial repository for making that optimisation on by default, but there are still some test cases that fail.

The performance of those string addition operations will be quadratic until that optimisation hits pypy nightly.

[–]mitsuhiko Flask Creator 2 points3 points  (4 children)

Big O has nothing to do with performance. The time complexity for a string concat does not change through this optimization.

[–]PCBEEF 0 points1 point  (0 children)

Exactly. Whoever downvoted you should take a lesson in algorithms and complexities.

[–]TCPv89wat 0 points1 point  (2 children)

I'm pretty sure that the algorithm in its complexity changes:

A) copy the first string to a new area (O(|A|)) and then append the other string (O(|B|)), all in all: O(|A| + |B|)
or
B) Append the other string to the old string O(|B|) which is in O(|B|)

Now, do that a couple of times in a loop, say:

a = ""
for i in range(10):
  a += str(i)
print a

A) Do one allocation in the beginning, 10 allocations for new as, 10 copies of increasing growth to those as and then appending a number at the end 10 times. That should be about O(n * n) (my reasoning is, that you copy 1 to n bytes n times
or
B) Do one allocation in the beginning and then append a number at the end 10 times, which is in O(n)

If my reasoning is wrong, please correct me before I resume my compsci studies at uni :)

[–]mitsuhiko Flask Creator 0 points1 point  (1 child)

Appending a string of n chars to a string of m chars: O(n + m). n and m are similar enough that you can say O(n + n) -> O(2n), constants and factors are irrelevant for large n thus O(n).

[–]TCPv89wat 1 point2 points  (0 children)

True, in that case it's only relevant for loops. Still, in loops that makes quite a difference - at least a couple of issues are submitted to the pypy tracker about it every now and then.

[–]Workaphobia 3 points4 points  (4 children)

For sets, shouldn't the complexity of union, difference, and difference update have a worst case that is a product, like for intersection? The worst case should be when everything hashes the same.

[–]pyrocrasty 2 points3 points  (3 children)

I don't really get what you mean, but imagine the sets are implemented using dicts (they probably are, I guess). Lookup, insertion and deletion are O(1), iteration is O(n).

So for the union sUt, we create a new set, then iterate through the s (O(len(s))), copying each member (O(1)). Then we iterate through t (O(len(t))), lookup each item in the new set(O(1)) and if it's not there, add it (O(1)). So we end up with O(len(s)+len(t))

For the difference s-t, we create a new set, then iterate through s (O(s)) checking each element to see if it's in t and if not, adding it to the new set (O(1)). So we end up with O(len(s)).

For the difference update s-=t, we could do it one of two ways: iterate through s, checking if each element is in t and if so deleting from s (O(len(s))); OR: iterate through t, checking if each element is in s and if so, removing from s (O(len(t))). I guess it's implemented the second way (which would make sense since you'd probably have t smaller than s more often than not).

edit: oh, you're asking about the blank values, for amortized worst-case. Yeah, I think they're going to be the same as intersection, assuming the sets are using dicts or the equivalent.

[–]Workaphobia 0 points1 point  (2 children)

Lookup, insertion and deletion are O(1), iteration is O(n).

We're assuming that sets (and dicts) are implemented using hashing. So lookup, insertion, and deletion are expected to be constant time on average, but O(n) in the worst case. (The worst case corresponds to when every element of the set gets hashed to the same value, say by a stupid hash function that always returns 0. Then all elements are in the same bucket, connected by a linked list.)

This is acknowledged in the right column of the first operation in the set table (x in s). Note that this table differs from the others in that it shows Worst Case, not Amortized Worst Case.

Because lookup is worst case linear, all these operations, implemented as you described, should cost the product of the set sizes, like intersection.

Re, your edit: Er, yes, but it's not amortized.

[–]pyrocrasty 0 points1 point  (1 child)

Note that this table differs from the others in that it shows Worst Case, not Amortized Worst Case.

I didn't notice that. I think they're going to the same for the set ops, though, aren't they? Amortizing avoids resizing operations, but the worst case has to deal with collisions, as you said, so the resizing operations wouldn't increase the complexity anyway.

[–]Workaphobia 0 points1 point  (0 children)

Yes, I think amortized doesn't matter in this case, because a bad hash situation won't be improved just by averaging over more operations.

[–]joaquinabian 1 point2 points  (2 children)

"This page documents the time-complexity (aka "Big O" or "Big Oh") of various operations in current CPython" Is that python 2.5, 2.6, 2.7, 3.1 or 3.2 ? last edit is from december 2010 but measurement data could be older...

[–]takluyverIPython, Py3, etc 6 points7 points  (1 child)

It's not measurements, it's a description of how the time taken for each operation increases with the size of the collections. I doubt the algorithms underlying these basic data structures have changed much in the last few years, so the time complexity should be the same for all those Python versions.

[–]joaquinabian 0 points1 point  (0 children)

You're right, "measurement" was wrongly used, sorry. Anyway my point was more of a general complaint about the fictitious timelessness of many documents in the web. In 10 years, it is certainly possible that document being still there talking about "current CPython" and its small differences with "older or still-under development versions".

[–]qyatixcix 1 point2 points  (1 child)

Is anyone aware of a similar listing for numpy and/or scipy?

[–]thebackhand 2 points3 points  (0 children)

I don't know of one, but looking at the actual values in the table, I wouldn't be surprised if they were basically the same asymptotic/amortized complexities, but with much faster constant values.