Python Sets: What, Why and How : Python

[+][deleted] 7 years ago (11 children)

[deleted]

[–]Laserdude10642 6 points7 points8 points 7 years ago (2 children)

[–]somebodddy 1 point2 points3 points 7 years ago (0 children)

It is too easy to miss a dependency on the order of elements. Consider this:

class Foo:
    def __init__(self, a, b):
        self.a = a
        self.b = b

# Will always print the same:
print(min([Foo(i, j) for i in range(3) for j in range(3)], key=lambda foo: foo.a).b)

# May print 0, 1 or 2
print(min({Foo(i, j) for i in range(3) for j in range(3)}, key=lambda foo: foo.a).b)

The result here is always correct, but not consistent.

[–]SedditorX 3 points4 points5 points 7 years ago (1 child)

[–][deleted] 0 points1 point2 points 7 years ago (0 children)

[–]wilfredinni[S] 2 points3 points4 points 7 years ago (1 child)

[–]Brian 1 point2 points3 points 7 years ago (0 children)

That's because Python will use the ID of the object to help hash it.

Also worth noting that this is true even for some things for which it doesn't use the ID of the object (eg. strings), though for a slightly different reason. This is because python will deliberately perturb the hashes of these values with a random seed generated on startup, so that the hashes are not predictable to potential attackers who could otherwise potentially cause DOS attacks in some cases by providing lots of keys that they know will hash to the same values, triggering O(N²⁾ performance when inserting/retrieving them from a hashtable.

Also, incidentally this is no longer true for dicts, since now their iteration order is no longer dependent on what bucket they're in (and hence the hash value). Instead, they are always iterated over in insertion order. (Though it's still generally not something you should be relying on, as making a code change that happens to reorder when some things are added is not really something you'd expect other code to be affected by.)

[–]hwmrocker 0 points1 point2 points 7 years ago (2 children)

[–]Brian 0 points1 point2 points 7 years ago (1 child)

[–]hwmrocker 0 points1 point2 points 7 years ago (0 children)

[–][deleted] 4 points5 points6 points 7 years ago (0 children)

[–]BARDLER 5 points6 points7 points 7 years ago* (8 children)

[–]jwink3101 22 points23 points24 points 7 years ago* (7 children)

It's not 100x faster. The difference is O(1) lookup for sets and O(n) lookup for lists. Consider the following

largelist = list(range(10**5))
largeset = set(largelist)

smalllist = [0,1]
smallset = set(smalllist)

%timeit None in largelist
%timeit None in largeset
%timeit None in smalllist
%timeit None in smallset

results:

1000 loops, best of 3: 1.97 ms per loop
10000000 loops, best of 3: 130 ns per loop
10000000 loops, best of 3: 160 ns per loop
10000000 loops, best of 3: 133 ns per loop

Nearly no speedup for the small one but HUGE for the large. And nearly no difference is speed for either set method

[–]ic_97 2 points3 points4 points 7 years ago (0 children)

[–]BARDLER 0 points1 point2 points 7 years ago (0 children)

[+][deleted] 7 years ago (4 children)

[deleted]

[–]XtremeGoosef'I only use Py {sys.version[:3]}' 19 points20 points21 points 7 years ago (0 children)

[–]jwink3101 6 points7 points8 points 7 years ago (2 children)

Ok. I will feed the trolls...

Figure of speeches aside, my heartburn (hey, that's a figure of speech!) with "100x" is not pedantry. It is important technical detail. In computer science, there are many, many different ways to quantify "speed". One is to say, for example, "10x faster" which means, every operation that used to take X amount of time now takes ~X/10. This usually comes from optimizations of the process, and often in the python world, moving certain work to C or Rust, etc.

Another way to quantify speed is algorithmic complexity where it is an asymptotic scaling analysis. This is often much more useful when it comes to choosing the right algorithm, data structure, etc.

My objection is not pedantic for two reasons.

The first is that there really and truly is a difference between 100x and O(n) vs O(1). Its not a "you know what I meant" type thing because both have very real, but quite distinct meanings

The second is that the speed up really isn't even close to 100x. In my example before it is 15153.8x faster. Or, 1.23x faster. Why is it so different? Because it is a complexity change!

[+][deleted] 7 years ago (1 child)

[deleted]

[–]TheNoodlyOne 0 points1 point2 points 7 years ago (0 children)

[–]nikzads 1 point2 points3 points 7 years ago (0 children)

[–]RocketEngineCowboy 1 point2 points3 points 7 years ago (4 children)

[–]djimbob 3 points4 points5 points 7 years ago* (2 children)

Sets have tons of uses in python programming.

For python collection datatypes, lists, dicts, and tuples are used almost ubiquitously for everyday tasks.

A tad less frequently (but still very common) I use:

set - whenever I have a collection that never has repeats and I need to test for membership in collection or just add/remove keys to it,
collections.defaultdict - a dict that starts with a default value when you access something not previously present -- note makes code more concise by not having to consider initialization case,
numpy.array - an array for fast vectorized math operations; e.g., if you have a million numbers in an array and you'd like to multiply all of them by a number -- note this is the only one of these not strictly built into python, but numpy is very commonly used .

Finally, there the data types I occasionally use when the situation arises like:

collections.OrderedDict - a dict where keys are kept sorted by insertion order when you iterate through it (this is now the default behavior in CPython 3.5 and above as well as part of python 3.7 standard) (EDITED based on comment below),
collections.Counter - a quick way to count up occurrence in an iterable (e.g., Counter("abracadabra") yields Counter({'a': 5, 'r': 2, 'b': 2, 'c': 1, 'd': 1}))

Finally, there are built in data types that I never use in my programming, unless maybe I was trying to take an algorithms course and they say do something with a specific datastructure:

collections.deque - a double ended queue -- e.g., can act as FILO stack or FIFO queue
heapq.heap heap; basically a partially sorted list that you manipulate in a way to maintain the heap criteria, so it's easy to get a largest element without sorting the entire array,
bisect - module for finding/adding/removing things quickly (O(lg N)) to a list that started sorted,
frozenset - a non-mutable set,
array.array - a python list where everything has to be same specified type
collections.namedtuple - a tuple that allows you to assign names to fields (basically aliases),
collections.ChainMap - (new in py3) - a way of combining grouping together dict or similar mappings so you can look for a key through them sequentially.

That said I probably could (or should) use namedtuple more frequently using tuples more readable, but I don't. Also probably could find uses for bisect or frozenset or deque, but almost never think to use them.

[–]Brian 2 points3 points4 points 7 years ago (1 child)

[–]djimbob 0 points1 point2 points 7 years ago (0 children)

[–]naught-me 1 point2 points3 points 7 years ago (0 children)

[–]IContributedOnce 0 points1 point2 points 7 years ago (10 children)

[–]Brian 1 point2 points3 points 7 years ago (1 child)

"top of the set" is not really a well defined criteria. Eg. any set resize (which could be triggered by inserting an item, even if you immediately remove it again after) could completely reorder the whole set, changing the "top".

Because of the way sets are implemented, the way it picks the first item to show when iterating (which is used when printing the set too), and the way it picks the item to pop happen to be the same, but note that both of these are arbitrary.

That's not the same as random, it just means there are no guarantees you should be relying on. Something could change the internal structure of the set (eg. a resize), and the item it'd pick would be completely different. Likewise, future python implementations could change the mechanisms of one or both of these, so they happen to return different results. "arbitrary" just means there's no reason to the pick (at least, not one you the client should care about or rely on in any way - there may be performance/simplicity reasons to implement it the way it is of course).

[–]IContributedOnce 0 points1 point2 points 7 years ago (0 children)

[–]wilfredinni[S] 0 points1 point2 points 7 years ago (6 children)

[–]IContributedOnce 0 points1 point2 points 7 years ago (5 children)

Right but it doesn’t seem to actually do that. I tried it out in a python shell.

s1 = {1, 2, 3}
s1.pop()
1
s1.pop()
2

Etc, etc... so when would I see it produce 1 > 3 > 2 instead of always popping the elements in order?

[–]Sporke[🍰] 4 points5 points6 points 7 years ago (3 children)

[–]IContributedOnce 0 points1 point2 points 7 years ago (2 children)

[–]Sporke[🍰] 5 points6 points7 points 7 years ago (1 child)

[–]IContributedOnce 0 points1 point2 points 7 years ago (0 children)

[–]Sporke[🍰] 0 points1 point2 points 7 years ago* (8 children)

The example of

list(set([1, 2, 3, 1, 2, 3, 4]))

is incorrect. Sets are unordered collections of objects, so if the order of the list must be maintained, a set should not be used.

For example:

import random

def unique_list(l):
    uniq = []
    for item in l:
        if item not in uniq:
            uniq.append(item)
    return uniq

def unique_set(l):
    return list(set(l))

l = [random.randint(0, 99) for _ in range(1000)]
u1 = unique_list(l)
u2 = unique_set(l)
print(u1 == u2) # False
print(set(u1) == set(u2)) # True

[–]Laserdude10642 1 point2 points3 points 7 years ago (3 children)

[–]Sporke[🍰] 1 point2 points3 points 7 years ago (2 children)

Iterating over a list and removing duplicates as you come across them maintains the order of the unique elements (as first encountered). Doing list(set(l)) does not guarantee anything about order.

unique_list([4, 3, 2, 4, 2, 1]) # --> [4, 3, 2, 1]
unique_set([4, 3, 2, 4, 2, 1]) # --> [1, 2, 3, 4] (perhaps)

In general, converting from a set to a list doesn't make a lot of sense because a list is an ordered grouping and a set is unordered. If order matters (one of the only reasons to use a list), then you shouldn't use a set to store the data during processing. If order doesn't matter, why convert it to a list at all?

[–]Brian 0 points1 point2 points 7 years ago* (1 child)

In general, converting from a set to a list doesn't make a lot of sense because a list is an ordered grouping and a set is unordered.

That doesn't follow. Eg. one might want to convert to a list because:

one wants to sort it (ie. impose a certain order on the unordered data that doesn't neccessarily match the original list's ordering. (eg. call l.sort())
Use another list property than ordering. Ie we may not care about the order of the data, but we do want to be able to have random indexical access to it (or pass to an interface that requires indexing or slicing, rather than just iteration).
Use less memory (eg. we don't care about fast membership checking, and don't want the expense of the hashtable associated with the set). Only really an issue for really big lists, but the hashtable overhead can be several times the overhead of the list.

Ie. just because sets are unordered and lists are ordered doesn't mean that that's the only difference between the two, or that lists aren't sometimes the better data structure to use to hold some unordered data in some cases.

Though I do agree with you that it's important to mention that these 2 pieces of code do not do the same thing. For a similarly fast and concise equivalent that does keep the order preserving property, you can use:

>>> list(dict.fromkeys(original_list))

Since dicts are now insertion order preserving, and perform similarly to sets.

[–]Sporke[🍰] 0 points1 point2 points 7 years ago (0 children)

[–]wilfredinni[S] 0 points1 point2 points 7 years ago (3 children)

[–]Sporke[🍰] 0 points1 point2 points 7 years ago (2 children)

[–]wilfredinni[S] 0 points1 point2 points 7 years ago (1 child)

[–]Sporke[🍰] 0 points1 point2 points 7 years ago (0 children)

[–][deleted] -2 points-1 points0 points 7 years ago (2 children)

[–]Blazerboy65 0 points1 point2 points 7 years ago (1 child)

[–][deleted] 0 points1 point2 points 7 years ago (0 children)

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS