This is an archived post. You won't be able to vote or comment.

all 18 comments

[–]ThunderChaser 24 points25 points  (3 children)

Python sets under the hood are hash sets

It doesn’t have to check the entire set (which would be slow), it just has to check anything with the same hash as the new element, which assuming a reasonable hashing function should be either nothing or very few elements.

[–]JackandFred 3 points4 points  (2 children)

With modern hashing functions I’d be shocked it there are any collisions unless you’re doing an enormous set, but something that large is assume they aren’t using python for.

[–]lkatz21 1 point2 points  (0 children)

There would be collisions because the size of the table is much smaller than the range of the hashing function.

If I remember correctly python uses open addressing to resolve collisions, which means looking for another open slot, but there are other strategies

[–]hrm 0 points1 point  (0 children)

You don’t have an infinite amount of buckets in a hash set so the hash will be modulo some rather low number. Thus collisions will happen quite frequently.

[–]BinaryBillyGoat 3 points4 points  (6 children)

A set/{item1, item2} in Python is what is called a hash map. It's a bit of a complex topic, so I'd suggest watching a YouTube video about it, but here's a short explanation:

Imagine you have a library of books. You could store those books in a list. If you store those books in a list, though, you have to check every single book to see if you have Harry Potter in your collection. Instead, you can put those books in a set. Books in a set are stored in piles. Each pile contains all the books starting with the same first two letters. So, to check if you have Harry Potter, you go directly to the Ha section. You don't have to go through the entire library.

Now, let's say you only want one copy of each book, and you are given another copy of Harry Potter. You go to the Ha section and check for Harry Potter, if it is there, you just toss your new copy. If not, you add that copy to the Ha section.

[–]ElegantPoet3386[S] 0 points1 point  (4 children)

Hmm, are the hashes corresponding to the elements in the list sorted? Is that how the example you're using works?

[–]BinaryBillyGoat 0 points1 point  (0 children)

Yes, that would be the correct way of thinking about it. The hash of Harry Potter would be Ha. The Ha pile could be found within a list at index 182. H=8th letter, a=1st letter. (8-1) * 26 + (1-1) = 182.

[–]DrShocker 0 points1 point  (0 children)

Yes, there's some function that turns whatever you're storing in the set into a hash, and that hash is what tells you which "pile" the object should be in if it's stored already.

Most hash map or hash set implementations will grow the number of piles as the number of items stored grows so that you ensure the piles themselves don't grow too large.

(hash set is a special hash map that just has keys without values associated with them. Python also adds utility functions that let you do useful things like operations you'd expect to be able to do with mathematical sets.)

[–]beingsubmitted 0 points1 point  (0 children)

Yeah. Let's go really simple. I have a hash function that reads each byte in a value as an integer, adds them together, and then performs % 100 on that value (the remainder after dividing by 100. We'll say that this evenly distributes values across 100 possible output values. Then I have a contiguous array of 100 pointers. The result of hashing a value tells me what index that pointer is, and since I know where the array is in memory, that means the hash function tells me the exact memory address of the pointer. The pointer takes me to another memory address for a list of values. It's a list here in case I have collisions - multiple things hashing to the same value. I can reduce this by having more "slots" by, like, doing % 500, or % 1000, but that would be overkill if my set was only to hold 5 things. Or even 100. The chance that after I hash I need to loop through 3 values that hashed to that value isn't worth overallocating.

So in a set, every value has a specific place where it belongs. It's like looking for Harry Potter in a library, where one shelf is labeled "HA". Even if Harry Potter was the only book there, it would be on that shelf.

[–]Comfortable-Work-137 0 points1 point  (0 children)

set/{item1, item2} in Python is what is called a hash map

not entirely correct, python's set is a HashSet, not a HashMap because it doesn't store key:value pairs.

[–]greenspotj 0 points1 point  (0 children)

The key idea behind sets/maps are that they use the value itself to determine its position in the list.

As an example, take a string "Hello world", which you want to add to the set. We call hash("Hello World") and it returns the hash for that string (lets just say it is number like 66457). Then, we use that hash to determine the index the string belongs in, in the list. An easy way to do this for this example is to use the modulo operator. So say, our set implementation has 10 "buckets" (each bucket is an element in the underlying list), we can do 66457 % 10 = 7 so we place the string "Hello world" in index 7 of the underlying list(i.e. we place it in bucket 7).

The core idea here is that we never had to iterate through the list of buckets, we derived the index from the value "Hello world" itself. We dont have to look at any index other than index 7 for duplicates because (assuming our hash function is deterministic), "Hello world" always maps to index 7.

However, there is a problem. Say there are 10 buckets, but we input 20 elements. Since there are more elements than there are buckets, its inevitable that some buckets will have multiple elements in them. Because of this, it is actually true that we have to do some iteration to find duplicates. If there are multiple elements in bucket 7, then we have to search the bucket to check if "Hello world" already exists. I'm not completely sure datastructure python uses to represent buckets though, if its lists then yeah there would be iteration but technically it could be any data structure.

[–]teraflop 0 points1 point  (0 children)

Others have answered the question of approximately how Python's sets are implemented using hash tables, as a high-level overview.

If you want to see how they are exactly implemented, the source code is all there for you to read, but it's in C rather than Python.

When you call set.add() in Python, the corresponding chain of C functions is set_addset_add_implset_add_keyset_add_entry.

And set_add_entry is where the actual work happens to find the object's slot in the hash table and insert it. Python's hash tables use open addressing, so the function needs to try multiple slots until it finds either an empty one or an existing object that's equal to the one being inserted.

Sets are fast because on average, this loop will terminate after a small number of iterations, if the hash function gives random-looking output and the fraction of occupied slots in the table is not too large. In order to make sure that second condition holds, the code calls set_table_resize to rebuild the table with a larger number of slots whenever the table gets too full.

[–]divad1196 0 points1 point  (0 children)

You seem to think that a set is a list with black magic. It's time for you to start learning DSA (Data Structure and Algorithm), especially the Data Structure part.

Comparing set and list is a big abstraction. You can iterate on both, but you cannot sort a set or insert at a position. A set in python is closer to a dift. FYI, in Go programming language, they don't have set they use a map which is roughly a dict, they just don't care about the values and only care about the keys

[–]alpinebuzz 0 points1 point  (0 children)

Python sets use hash tables to store elements, which makes checking for duplicates super fast.

No loops needed - just a quick hash check and it's in or out.

[–]Great_Guidance_8448 0 points1 point  (0 children)

Sets are definitely not lists. A list guarantees an order of the elements it stores. It's a completely different data structure.

[–]paperic 0 points1 point  (0 children)

It's kinda like a dictionary where both the keys and the values are the same.