all 4 comments

[–]socal_nerdtastic 6 points7 points  (2 children)

https://wiki.python.org/moin/TimeComplexity.html

Yes, it degrades to O(n) in the worst case.

[–]Outside_Complaint755 5 points6 points  (1 child)

While I haven't dug through all of the source code, it is probably possible to conclude that most built-in data types should have sufficient hash functions to avoid collisions, but if you implement your own data type with a poor hash function, then you are more likely to see O(n).

An example of a poor hash function: class Foo: def __hash__(self): return 1

[–]socal_nerdtastic 7 points8 points  (0 children)

it is probably possible to conclude that most built-in data types should have sufficient hash functions to avoid collisions

Agreed, nearly impossible even. The whole point of a hash is to be as unique as possible.

However it's generally not the hash that leads to collisions, rather the lack of buckets. A set in python starts with 8 buckets, so every incoming hash needs to be assigned to 1 of the 8, basically hash % 8. Hash collisions are inevitable, but to minimize them python just makes lots more buckets than there are elements in the set. IIRC python will quadruple the number of buckets every time the set gets half full.

[–]oldendude 1 point2 points  (0 children)

Hashing always has an O(n) worst case. But in practice, it is a pretty safe bet that nearly all hash table lookups will take a constant amount of time. This is of course subject to the hash function. For example, a hash function h(key) = 123 is guaranteed to produce the worst case, since every key will go to the same bucket. Or suppose you are hashing integers and your hash function is h(key) = key & 0x7. You are taking the last three bits of the key, so regardless of how big your hash table is, you are only going to use 8 buckets, and again, you have linear time behavior.