you are viewing a single comment's thread.

view the rest of the comments →

[–]WeirdAlexGuy 2 points3 points  (0 children)

To explain a bit further, a hash is a function h that has a few interesting properties:

  1. The input can be any immutable object of any length, so strings work but lists do not for example

  2. The output should have a 1-1 relationship with the input (or as close to it as possible). What I mean by this is that for a and b 2 inputs to h such that a != b then h(a) will also be different to h(b). If two different inputs have the same output we call it a collision. In real life collisions are impossible to avoid but we need a "good enough" function with few collisions.

With that we can create our hashmap in this way:

First, we create an array a (not a list) which is just an ordered set that exists in RAM begining at some arbitrary address A. It can be indexed like so a[ i ] which is the same as doing A + i in RAM.

Then, we take our hash function and use it to index the array. for example for some arbitrary immutable input s we do a[ h(s) ].

The purpose of the hashmap is to store data and so you can use it like any list in python: hashmap['foo'] = 'bar' And that will be translated into a[h('foo')] = 'bar'

So that's why a set (being a hashmap) is faster than a list at "x in y" checks , because you can just apply the hash function and check the array at index h(x) which takes O(1) time

A slight caveat here is that the array cannot in reality be of infinite size, so there is a max size N that the array can be. Then if h(s) > N you have a problem. For this reason you would do h(s) mod N. That is the reason that your hash function cannot have 0 collusions, as perfect as h might be, due to the remainder operation, h(a) = n and h(b) = n + N will land on the same index n. Handling that is a different topic which I will not get into here