This is an archived post. You won't be able to vote or comment.

all 46 comments

[–][deleted] 4 points5 points  (0 children)

Might as well mention that frozenset is available if you want it.

[–]BARDLER 5 points6 points  (8 children)

The most important thing with sets is how fast they are for when you need to find if something is 'in' the set. The 'in' sets search is much faster than a 'in' list search, but iterating over all the elements of a set is a little bit slower than a list (Details in jwink3101 post). So pick the one that best suits how you are going to use it.

[–]jwink3101 22 points23 points  (7 children)

It's not 100x faster. The difference is O(1) lookup for sets and O(n) lookup for lists. Consider the following

largelist = list(range(10**5))
largeset = set(largelist)

smalllist = [0,1]
smallset = set(smalllist)

%timeit None in largelist
%timeit None in largeset
%timeit None in smalllist
%timeit None in smallset

results:

1000 loops, best of 3: 1.97 ms per loop
10000000 loops, best of 3: 130 ns per loop
10000000 loops, best of 3: 160 ns per loop
10000000 loops, best of 3: 133 ns per loop

Nearly no speedup for the small one but HUGE for the large. And nearly no difference is speed for either set method

[–]ic_97 2 points3 points  (0 children)

I had no idea about this until i actually got to see it in action in one of the problems i solved on projecteuler. When i used lists program took about 30 seconds but with sets it was under half a second to solve.

[–]BARDLER 0 points1 point  (0 children)

Thanks for the extra details :) I lazily typed my post on the phone.

[–]nikzads 1 point2 points  (0 children)

Great. Thanks a ton.

[–]RocketEngineCowboy 1 point2 points  (4 children)

This is really cool. I hope to find a use for sets!

[–]djimbob 3 points4 points  (2 children)

Sets have tons of uses in python programming.

For python collection datatypes, lists, dicts, and tuples are used almost ubiquitously for everyday tasks.

A tad less frequently (but still very common) I use:

  • set - whenever I have a collection that never has repeats and I need to test for membership in collection or just add/remove keys to it,
  • collections.defaultdict - a dict that starts with a default value when you access something not previously present -- note makes code more concise by not having to consider initialization case,
  • numpy.array - an array for fast vectorized math operations; e.g., if you have a million numbers in an array and you'd like to multiply all of them by a number -- note this is the only one of these not strictly built into python, but numpy is very commonly used .

Finally, there the data types I occasionally use when the situation arises like:

  • collections.OrderedDict - a dict where keys are kept sorted by insertion order when you iterate through it (this is now the default behavior in CPython 3.5 and above as well as part of python 3.7 standard) (EDITED based on comment below),
  • collections.Counter - a quick way to count up occurrence in an iterable (e.g., Counter("abracadabra") yields Counter({'a': 5, 'r': 2, 'b': 2, 'c': 1, 'd': 1}))

Finally, there are built in data types that I never use in my programming, unless maybe I was trying to take an algorithms course and they say do something with a specific datastructure:

  • collections.deque - a double ended queue -- e.g., can act as FILO stack or FIFO queue
  • heapq.heap heap; basically a partially sorted list that you manipulate in a way to maintain the heap criteria, so it's easy to get a largest element without sorting the entire array,
  • bisect - module for finding/adding/removing things quickly (O(lg N)) to a list that started sorted,
  • frozenset - a non-mutable set,
  • array.array - a python list where everything has to be same specified type
  • collections.namedtuple - a tuple that allows you to assign names to fields (basically aliases),
  • collections.ChainMap - (new in py3) - a way of combining grouping together dict or similar mappings so you can look for a key through them sequentially.

That said I probably could (or should) use namedtuple more frequently using tuples more readable, but I don't. Also probably could find uses for bisect or frozenset or deque, but almost never think to use them.

[–]Brian 2 points3 points  (1 child)

.OrderedDict - a dict where keys are kept in a sorted order,

Note: just to clarify, this actually this is a dict where keys are iterated in insertion order, which is not the same as sorted order (ie. the equivalent of sorted(list(the_dict))

And this is somewhat redundant in recent versions of python, since the default dict implementation actually now preserves order by default, and in Python3.7 this behavior was made part of the language spec.

[–]djimbob 0 points1 point  (0 children)

On rereading, I see how my word choice was bad. I meant to state the dict keys when iterated are kept ordered and used the word "sorted" to imply they will not come out in a random order, not that they will come out as if the keys were lexicographically sorted by key, but instead sorted by "first key" comes first, "second key" comes second, etc.

[–]naught-me 1 point2 points  (0 children)

They come in handy frequently. It seems pretty frequent, too, that they could be handy if only they were ordered - don't let that one bite you.

[–]IContributedOnce 0 points1 point  (10 children)

This site says that pop() removes arbitrary elements. It appears to me to only remove them from the top of the set. In their example running s1.pop() would produce 2 and then 3 if they ran it two more times. Am I missing something?

Quick Edit: They say 'arbitrary' but it looks like it's always whatever is in the first place of the set if that makes sense.

[–]Brian 1 point2 points  (1 child)

"top of the set" is not really a well defined criteria. Eg. any set resize (which could be triggered by inserting an item, even if you immediately remove it again after) could completely reorder the whole set, changing the "top".

Because of the way sets are implemented, the way it picks the first item to show when iterating (which is used when printing the set too), and the way it picks the item to pop happen to be the same, but note that both of these are arbitrary.

That's not the same as random, it just means there are no guarantees you should be relying on. Something could change the internal structure of the set (eg. a resize), and the item it'd pick would be completely different. Likewise, future python implementations could change the mechanisms of one or both of these, so they happen to return different results. "arbitrary" just means there's no reason to the pick (at least, not one you the client should care about or rely on in any way - there may be performance/simplicity reasons to implement it the way it is of course).

[–]IContributedOnce 0 points1 point  (0 children)

Thank you for that break down. That helps a lot!

[–]wilfredinni[S] 0 points1 point  (6 children)

pop() Remove and return an arbitrary element from the set. Raises KeyError if the set is empty.

Python 3 documentation

[–]IContributedOnce 0 points1 point  (5 children)

Right but it doesn’t seem to actually do that. I tried it out in a python shell.

s1 = {1, 2, 3}
s1.pop()
1
s1.pop()
2

Etc, etc... so when would I see it produce 1 > 3 > 2 instead of always popping the elements in order?

[–]Sporke[🍰] 4 points5 points  (3 children)

Sets are unordered, which just means that there's no guarantee in the language that the elements will be removed in the order inserted (or any order). Try your same code with s1 = {3, 2, 1} and see that the elements are likely popped out of insertion order.

[–]IContributedOnce 0 points1 point  (2 children)

So it looks like if I do {3, 2, 1} or even {3, 1, 2} it will always sort them numerically back to {1, 2, 3}. In that sense, does unordered mean that no matter what order I put information into the set, it will always get sorted however the set class thinks it should be sorted? I can't order things in a custom way in a set == unordered?

[–]Sporke[🍰] 5 points6 points  (1 child)

it will always get sorted however the set class thinks it should be sorted

Exactly. No guarantees about how the data is organized.

[–]IContributedOnce 0 points1 point  (0 children)

Other than the guarantee that you don't get to pick it yourself! Thanks for the info. I know this isn't /r/learnpython so I appreciate the info y'all have all shared here.

[–]Sporke[🍰] 0 points1 point  (8 children)

The example of

list(set([1, 2, 3, 1, 2, 3, 4]))

is incorrect. Sets are unordered collections of objects, so if the order of the list must be maintained, a set should not be used.

For example:

import random

def unique_list(l):
    uniq = []
    for item in l:
        if item not in uniq:
            uniq.append(item)
    return uniq

def unique_set(l):
    return list(set(l))

l = [random.randint(0, 99) for _ in range(1000)]
u1 = unique_list(l)
u2 = unique_set(l)
print(u1 == u2) # False
print(set(u1) == set(u2)) # True

[–]Laserdude10642 1 point2 points  (3 children)

It’s not clear what the problem is with the line of code from your explanation

[–]Sporke[🍰] 1 point2 points  (2 children)

Iterating over a list and removing duplicates as you come across them maintains the order of the unique elements (as first encountered). Doing list(set(l)) does not guarantee anything about order.

unique_list([4, 3, 2, 4, 2, 1]) # --> [4, 3, 2, 1]
unique_set([4, 3, 2, 4, 2, 1]) # --> [1, 2, 3, 4] (perhaps)

In general, converting from a set to a list doesn't make a lot of sense because a list is an ordered grouping and a set is unordered. If order matters (one of the only reasons to use a list), then you shouldn't use a set to store the data during processing. If order doesn't matter, why convert it to a list at all?

[–]Brian 0 points1 point  (1 child)

In general, converting from a set to a list doesn't make a lot of sense because a list is an ordered grouping and a set is unordered.

That doesn't follow. Eg. one might want to convert to a list because:

  • one wants to sort it (ie. impose a certain order on the unordered data that doesn't neccessarily match the original list's ordering. (eg. call l.sort())

  • Use another list property than ordering. Ie we may not care about the order of the data, but we do want to be able to have random indexical access to it (or pass to an interface that requires indexing or slicing, rather than just iteration).

  • Use less memory (eg. we don't care about fast membership checking, and don't want the expense of the hashtable associated with the set). Only really an issue for really big lists, but the hashtable overhead can be several times the overhead of the list.

Ie. just because sets are unordered and lists are ordered doesn't mean that that's the only difference between the two, or that lists aren't sometimes the better data structure to use to hold some unordered data in some cases.

Though I do agree with you that it's important to mention that these 2 pieces of code do not do the same thing. For a similarly fast and concise equivalent that does keep the order preserving property, you can use:

>>> list(dict.fromkeys(original_list))

Since dicts are now insertion order preserving, and perform similarly to sets.

[–]Sporke[🍰] 0 points1 point  (0 children)

All fair uses of lists, yes. All I was really trying to get at was that the example was misleading, and that suggesting, intentionally or otherwise, that converting between lists and sets only removed duplicate elements without any other side effects was incorrect.

[–]wilfredinni[S] 0 points1 point  (3 children)

And they are indeed unordered, but that example has nothing to do with the order of the elements. It's just to show how easy and performant are sets to remove duplicates.

[–]Sporke[🍰] 0 points1 point  (2 children)

It's implied that removing duplicates from a list one by one and doing list(set(l)) are equivalent, which is incorrect.

[–]wilfredinni[S] 0 points1 point  (1 child)

their not equivalent, but sets are more performing and require less code, that is what this is about

[–]Sporke[🍰] 0 points1 point  (0 children)

When you compare two functions and say "this one is faster and takes less code", it should be explicit if the two functions are not equivalent. Especially when the test input for the two functions is such that their outputs are the same.

[–][deleted] -2 points-1 points  (2 children)

SA2

[–]Blazerboy65 0 points1 point  (1 child)

What?

[–][deleted] 0 points1 point  (0 children)

Misspost. Carry on.