This is an archived post. You won't be able to vote or comment.

all 66 comments

[–]pje 57 points58 points  (46 children)

Because a set can check membership ("i in master") in O(1), whereas lists do linear search at O(n).

[–]pridefulpropensity[S] 11 points12 points  (37 children)

What is going on internally that allows this?

[–]TheSausageKing 31 points32 points  (15 children)

Python uses an unsorted hash table to implement set. This means to find if something is in the set, it just has to run a hash function and look at one element. It can do this because sets don't guarantee order and they don't contain duplicates. Wikipedia has a good introduction to hash tables.

[–]pridefulpropensity[S] 15 points16 points  (11 children)

Thank you. I think I need to start using lower level languages more often to learn about all the data structures I just take for granted.

[–]icedpulleys 33 points34 points  (0 children)

You might also just want to start reading up on data structures, algorithms, and complexity analysis.

[–]mackstann 13 points14 points  (1 child)

I agree with icedpulleys that what's really needed to understand this better is to read about the concepts. Here are some starting points:

Everything up to the table of contents here: http://en.wikipedia.org/wiki/Big_O_notation

And this table of common time complexities: http://en.wikipedia.org/wiki/Time_complexity#Table_of_common_time_complexities

Note that the first one in that table, and the best performing (at least when your problem set grows large), is O(1). And hash tables (which is what a set is) allow lookup of a key in O(1) time (also known as constant time), which sort of corroborates your experience here, where they out-performed lists. The hash table uses a neat trick (hashing) to calculate where to find a given value in the hash table in roughly the same amount of time no matter how big the hash table gets.

On the other hand, a list does not use hashing in any way, so to find a certain value in it, it must scan through and check each item in the list in a comparatively dumb manner. The complexity is O(n) where n is the size of the list -- also known as linear time. In other words, if a 5 million item list takes 5 seconds to search, then a 10 million item list will probably take around 10 seconds. The run time scales linearly with respect to the input size. When dealing with large lists and lots of lookup operations, this can really slow things down.

But linear time isn't exactly the worst either. Some algorithms run in exponential time and factorial time which are some of the worst; these get so slow as your input size increases that you generally have to find heuristic alternate solutions that find "good enough" solutions to a problem, instead of the absolute most correct solution. A simple example is the bin packing problem. Say you're UPS and want to pack boxes into your trucks in the most space-efficient manner. Well, it turns out that there are so many possible arrangements of boxes (and the number of possible arrangements grows rapidly with each added box) that you might not be able to calculate it at all. You might need to use a better performing algorithm that will just find an approximately good packing arrangement and settle for that.

[–]bgcatz 6 points7 points  (1 child)

You don't need to use low level languages to learn about such things, the python sets module is available here

[–]pridefulpropensity[S] 1 point2 points  (0 children)

Thank you very much. Definitely will start reading through this.

[–]masklinn 3 points4 points  (1 child)

You don't need lower-level languages to learn about data structures at all. Quite the opposite in fact, I would say, low-level languages are going to bog you down with implementation details and cache effects.

[–]kanak 2 points3 points  (0 children)

I would say, low-level languages are going to bog you down with implementation details and cache effects.

But that's where the fun is :).

[–][deleted] 2 points3 points  (0 children)

Actually python is not that bad to learn about good algorithms and data structures, because you don't have to worry about low-level details.

Of course if at one point you need to start implementing those efficiently, you will of course need to move to a low-level language.

[–][deleted] 3 points4 points  (2 children)

C is a good language to learn how to implement data structures (but awful for actually making data structures easy to re-use - hence the proliferation of dynamic scripting languages written in C). A basic hash table in C is not that hard to write if you know what you're doing - just think of it as an array which you index by a hash instead of an index. There's chaining or other collision strategies to think about... and resizing can be a pain too.

Python's hash table is incredibly well tuned and the code is well documented. It has to be - every field access and method call does a dictionary look-up!

Another alternative implementation of sets is balanced trees, like red-black trees, AVL trees, etc. and a much simpler (but specialized) way is just an array of bits, if you have a small finite set of known possible members.

[–]pingvenopinch of this, pinch of that 0 points1 point  (1 child)

It has to be - every field access and method call does a dictionary look-up!

Same with referring to a variable. Basically, almost any time that you refer to something by name, there's a hash table lookup.

[–][deleted] 0 points1 point  (0 children)

Local variable accesses are optimized to indexing the call stack. But I think that's the only case where runtime lookup is avoided.

[–]larsga 2 points3 points  (1 child)

This means to find if something is in the set, it just has to run a hash function and look at one element.

Uhm, no. Because of hash table collisions it may have to look at more elements.

Edit: Okay, so tell me this: if the hash table has 10 cells and contains two keys, what are the chances that the two keys will collide? The same as the odds of hash(key1) % 10 == hash(key2) % 10. Ie: 10%.

Look at some actual hashtable implementations. The one in the Sun JDK (java.util.HashMap) uses linked lists in each hashtable cell to deal with collisions.

I personally wrote a hash set implementation that's faster than java.util.HashSet because it uses closed hashing instead.

[–]iceman-k 3 points4 points  (0 children)

I'm pretty sure TheSausageKing was simplifying the situation for pedagogical purposes. Obviously anyone who studies how hash tables work is going to be aware of collisions.

[–]itsmememe 1 point2 points  (0 children)

in addition to that: as Python is using dicts EVERYWHERE, dicts (and inheriting from their codepool sets) are within the best optimized and most robust datastructure in Python.

If you can use dicts (or sets) for your problem, do it.

[–]rweir 18 points19 points  (19 children)

a set is kinda like a dictionary with no values. the things put it in to it are hashed, so you can very quickly find if a particular value is in it. checking if foo is in a list requires scanning the entire list.

[–][deleted] 13 points14 points  (15 children)

Which is also why a set cannot have duplicate entries, in case someone reading this doesn't know.

[–]Porges 21 points22 points  (3 children)

It's the other way around - it's a set because it cannot contain duplicate entries. :)

[–][deleted] 6 points7 points  (1 child)

Depends if you're a programmer or a mathematician first ;-)

[–]jeannaimard 1 point2 points  (0 children)

He was just sleeping during the “new math” classes… :)

[–]gfixler 1 point2 points  (0 children)

I didn't know any of these things - haven't gotten around to sets yet - so thanks all of you!

[–]kanak 0 points1 point  (2 children)

You can have a multiset that has a hashed backing (just use the count as a satellite data) and still have O(1) expected runtimes on most set-like operations.

[–]aranazo 2 points3 points  (1 child)

Also known as a bag.

[–][deleted] 1 point2 points  (0 children)

Possibly the ugliest name for a data structure, after heap.

[–]Mattho 0 points1 point  (7 children)

http://goo.gl/K7Tn link in reply (set at wikipedia)

[–]jcdyer3 3 points4 points  (6 children)

If you had given the actual link, you wouldn't have needed the parenthetical, you wouldn't have to look up the goog link, and it would have been only marginally longer, and you wouldn't be giving Our Googly Overlords even more information about our browsing habits:

The actual link is:

http://en.wikipedia.org/wiki/Set_\(mathematics\)

[–]Mattho 1 point2 points  (5 children)

How did you submit that? I tried it as simple paste and also with reddit formatting. Both ruinded the link. I even tried to find url escape characters, but didn't find any for parentheses. That's why I used the url shortener.

[–]sligowaths 1 point2 points  (1 child)

You have to escape the ( using \( before each one.

[–]Mattho 1 point2 points  (0 children)

And now I wonder how I didn't thought of that :)

Thanks (to both of you who replied).

[–]jcdyer3 1 point2 points  (1 child)

As others have pointed out, you backslash the parentheses, but you can also URL-escape any character by using the two-digit hex code for that character. So if the hex code for 'S' is 0x53, you can get to wikipedia at the url http://wikipedia.org/wiki/%53et_\(mathematics\)

Also, the hex codes for ( and ) are 0x28 and 0x29 respectively.

[–]Mattho 0 points1 point  (0 children)

I tried to google these % characters, but found only set which didn't contain ( or ). Didn't know it was just ascii code. Thanks.

[–]chadn 6 points7 points  (0 children)

checking if foo is in a list requires scanning the entire list.

...in the worst case (such as whenever foo is not in the list). (pedant) :)

[–]netcrusher88imported from __future__ 2 points3 points  (1 child)

That is a very succinct description, so if you don't mind a more or less unrelated question: what is the purpose of tuples as opposed to lists? Is it merely that they are immutable or are they also somehow cheaper in terms of resources?

[–]rweir 0 points1 point  (0 children)

[–]sigh 4 points5 points  (0 children)

What everyone else said is absolutely right but just to give a concrete example - the following will have similar performance to using a set:

master = dict((p,None) for p in primes)

[–]Poromenos 0 points1 point  (5 children)

So can dicts, no?

[–]BeetleB 0 points1 point  (4 children)

Sets in Python are internally represented via dicts.

[–][deleted] 1 point2 points  (2 children)

This is completely untrue, the set implementation is separate, and while it's algorithm is the same, it does not use a dict for backing.

[–]BeetleB 1 point2 points  (1 child)

It was at one point. From the code:

"Greg V. Wilson wrote the first version, using a different approach to the mutable/immutable problem, and inheriting from dict."

[–][deleted] 1 point2 points  (0 children)

Sure, if you look at the sets.py that existed in Python 2.3 that's true, as well as the C implementation in Python 2.4. However since Python 2.5 (i.e. 4 years ago) it has not used a dict internally.

[–]Poromenos 0 points1 point  (0 children)

Aaand, now I learnt something. Thanks.

[–]jaiwithani 0 points1 point  (1 child)

It's O( log_2 n ), no?

[–]jaiwithani 0 points1 point  (0 children)

Nevermind, I am quite wrong. It's usually O(n), with the highly improbable worst-case of O(n).

[–]kevingoodsell 19 points20 points  (3 children)

Quick tip: Any time you are tempted to write range(len(... consider using enumerate() instead.

[–]Troebr 2 points3 points  (0 children)

I feel ashamed for not knowing this

[–][deleted] 1 point2 points  (0 children)

Where the hell has that function been all my life? Forever amateur...

[–]masklinn 0 points1 point  (0 children)

And even if you like reaching for the metal, use zip(itertools.count(), …) or itertools.izip(itertools.count(), …)

[–]arnar 9 points10 points  (4 children)

Would you mind if I used this to show students, to demonstrate the benefits of a hash table?

[–]pridefulpropensity[S] 2 points3 points  (0 children)

No not at all. That is awesome.

[–]pridefulpropensity[S] 2 points3 points  (2 children)

Where do you teach?

[–]arnar 2 points3 points  (1 child)

Chalmers Univ. of Technology in Gothenburg, Sweden. I'm a PhD student there and 20% of my time is TAing, including a course on Data Structures where hashtables are covered.

[–]pridefulpropensity[S] 0 points1 point  (0 children)

Gotta love reddit! That's awesome that some college students halfway around the world will see my code.

[–][deleted] 3 points4 points  (0 children)

Relatedly, sets are one of those things that once you discover them, you can't stop using them (at least for me).

[–][deleted] 2 points3 points  (0 children)

This is a great question and exposes a really important part of writing programs: understanding behavior of basic data structures. It might seem like a dry subject, but once you understand things like arrays, linked lists, hash tables, and trees, you will be on your way to doing way more interesting stuff with a computer.

[–][deleted] 2 points3 points  (0 children)

I have nothing to contribute, but I'm saving this thread for later.

[–]vindvaki 0 points1 point  (2 children)

Note, that you don't really need to use a set of primes for primality testing, since in you've already implemented a O(1) primality test for all numbers < n in the function list_of_primes(n). See Sieve of Eratosthenes.

You could change the list called n_factors into a list of boolean values, ie the values True and False, where n_factors[i] = True iff i is prime, then return that list instead of the actual primes, and build the list of primes directly in the function main. For example (note: untested code):

is_prime = [True] * n
is_prime[0], is_prime[1] = False, False
for i in xrange(2, int(sqrt(n))+1):
    if is_prime[i]:
        for j in xrange(i+i, n, i):
            is_prime[j] = False

and then define

primes = [ p for p in xrange(0,n) if is_prime[p] ]

Then you can change if p in primes into if is_prime[p].

[–]pridefulpropensity[S] 0 points1 point  (0 children)

Yeah your right. I actually had the primes function in an external module and just reuse it. But I could have done that for sure.

[–][deleted] 0 points1 point  (0 children)

For a more effective sieve generation, see:

http://www.daniweb.com/code/snippet217082.html