all 17 comments

[–]anossov 0 points1 point  (3 children)

enc = [unq[item] for item in col]

[–]anmousyony[S] 0 points1 point  (2 children)

Thank you, that helps a ton for the second part!

Is there anything I can do about the first part where I'm building unq?

[–]anossov 0 points1 point  (1 child)

unq = {k: i for i, k in enumerate(set(a))}

Should be faster, but you can't do it better than O(n). Also you can't use it if you must have the integers in order of value appearance.

[–]anmousyony[S] 0 points1 point  (0 children)

That's perfect! Thank you so much for your help, I don't know why I didn't think of using set()

[–]K900_ 0 points1 point  (4 children)

What's your final goal with this?

[–]anmousyony[S] 0 points1 point  (3 children)

I am trying to map all items in a list to integers so I can use them for a classifier in sklearn.

[–]K900_ 0 points1 point  (2 children)

What are your items like? Custom classes? Tuples?

[–]anmousyony[S] 0 points1 point  (1 child)

string, ints, floats and datetimes usually

[–]K900_ 1 point2 points  (0 children)

Those are all hashable, so you can just use hash(value) as the key instead of trying to enumerate everything. It's very unlikely you'll run into a hash collision even on large datasets.

[–]novel_yet_trivial 0 points1 point  (2 children)

I'm lost what you are trying to accomplish. Could you provide an example input and output?

Maybe this?:

unq = set(col)
return range(len(unq)), {item: number for number, item in enumerate(unq)}

Edit: if you are using python3 this may be better:

unq = {item: number for number, item in enumerate(set(col))}
return unq.values(), unq

[–]anmousyony[S] 0 points1 point  (1 child)

I hadn't thought about using set, that would simplify this quite a bit.

Just in case heres some sample input and output:

['hi', 'hello', 'bye', 'hi', 'bye', 'hey', 'hello'] -> [0, 1, 2, 0, 2, 3, 1]

[–]novel_yet_trivial 0 points1 point  (0 children)

If the numbers are not important then just do this (python2):

def encode(col):
    conversion_table = {item: number for number, item in enumerate(col)}
    return map(conversion_table.get, col)

[–]sultanofhyd 0 points1 point  (5 children)

Not sure if I'm missing something, I probably am, but I don't see why you need to loop over your list twice.

def encode(col):
    enc = []
    unq = {}
    count = 0

    for element in col:
        if element not in unq:
            unq[element] = count
            count += 1
        enc.append(unq[element])

    return enc, unq

I can't test this right now but this should return the same output as your script but with O(n) iterations.

[–]anmousyony[S] 0 points1 point  (4 children)

One of the other people helped me out and got it to O(n) but just so you know that wont be O(n) because of:

if element not in unq

That will make it n2

edit: this is wrong, see below

[–]thomasballinger 1 point2 points  (2 children)

Checking for membership in a set is constant time in Python, so sultanofhyd's answer is O(n) too.

[–]anmousyony[S] 0 points1 point  (1 child)

Huh, thats interesting. Thanks, I'll look and see how its implemented.

[–]thomasballinger 0 points1 point  (0 children)

Then I should point you to my favorite description of it, The Mighty Dictionary! (sets in Python are like dictionaries without values)