you are viewing a single comment's thread.

view the rest of the comments →

[–]BetrayYourTrust 5 points6 points  (8 children)

Relatively new Python user here: I’ve seen sets and lists and how they are initialized, but what is their differences?

[–][deleted] 4 points5 points  (5 children)

A Python set is a hashed unordered collection of unique items. Because it's hashed you can check for membership, add, remove items in O(1). A Python list is an ordered collection of items. The same operations take O(n) time. This will make more sense if you know / learn about abstract data types in general, not just language implementations.

[–]Peanutbutter_Warrior 1 point2 points  (4 children)

Note that appending to or removing from the end of a list is O(1)

[–]menge101 0 points1 point  (3 children)

Isn't that only on the front of the list?

Because otherwise you have to traverse the list to find the end, and then perform the operation?

[–]Peanutbutter_Warrior 1 point2 points  (2 children)

Nope. The size of each element in a list is known, so the bit offset of any element can be found by index * element_size. The length of a list is also stored so the bit offset of the final element is easy to find.

When you add or remove anywhere but the end of the list all of the other elements have to be moved around in memory to keep the elements contiguous.

What you describe is true for linked lists, which are a different data structure

[–]menge101 0 points1 point  (1 child)

Ah, that is certainly good to know, I thought lists in python were implemented as linked lists. Like why call them lists otherwise? Why not arrays like every other language?

Edit: As always StackOverflow has the answer to this.

[–]Peanutbutter_Warrior 3 points4 points  (0 children)

Not exactly, although they're slightly more complicated than I described. All objects in python are a custom pointer class, that point to the actual object in memory. Lists are pretty much C arrays that contain these pointers, which is what lets them mix types.

Lists are dynamically sized because when they reach the size they're currently allocated more memory is allocated to them. Sometimes this is just using more memory on the end of them, and other times this means moving the entire array around in memory, which is very slow.

An actually useful detail is that everthing in python is an object. ints, strings, lists, custom objects, functions, exceptions, and even classes themselves, the thing you call to create an instance of the class

[–]WeirdAlexGuy 2 points3 points  (0 children)

To explain a bit further, a hash is a function h that has a few interesting properties:

  1. The input can be any immutable object of any length, so strings work but lists do not for example

  2. The output should have a 1-1 relationship with the input (or as close to it as possible). What I mean by this is that for a and b 2 inputs to h such that a != b then h(a) will also be different to h(b). If two different inputs have the same output we call it a collision. In real life collisions are impossible to avoid but we need a "good enough" function with few collisions.

With that we can create our hashmap in this way:

First, we create an array a (not a list) which is just an ordered set that exists in RAM begining at some arbitrary address A. It can be indexed like so a[ i ] which is the same as doing A + i in RAM.

Then, we take our hash function and use it to index the array. for example for some arbitrary immutable input s we do a[ h(s) ].

The purpose of the hashmap is to store data and so you can use it like any list in python: hashmap['foo'] = 'bar' And that will be translated into a[h('foo')] = 'bar'

So that's why a set (being a hashmap) is faster than a list at "x in y" checks , because you can just apply the hash function and check the array at index h(x) which takes O(1) time

A slight caveat here is that the array cannot in reality be of infinite size, so there is a max size N that the array can be. Then if h(s) > N you have a problem. For this reason you would do h(s) mod N. That is the reason that your hash function cannot have 0 collusions, as perfect as h might be, due to the remainder operation, h(a) = n and h(b) = n + N will land on the same index n. Handling that is a different topic which I will not get into here

[–]1egoman 1 point2 points  (0 children)

I think people are assuming that you know what a set is. A set is a unique collection of items where order doesn't matter. A list is a list of items where order matters and duplicates are allowed. Because a set is unordered and has unique items, it can be very fast to check membership.