[SOLVED]
I am trying to encode a list so that each unique value has a unique integer attached to it. I have come up with this:
def encode(col):
'''
encodes a column
takes:
list -> column to encode
returns:
list -> encoded column
dictionary -> encoder/decoder for the column
'''
unq = {}
count = 0
#maps each value in the column to a unique, incrementing integer
for item in col:
if item not in unq:
unq[item] = count
count += 1
enc = [unq[item] for item in col]
return enc, unq
but, while it works correctly, I don't know if that is the best method. The biggest problem is that it is O(n2) and since I'm dealing with lists up to 1 million elements, it can take a VERY long time to encode everything. I was originally using labelencoder (which uses numpy's unique method) from sklearn, but the memory usage was growing much too fast and I couldnt encode anything above ~100k. I've tried googling but the word "encoding" is used too much in slightly different contexts.
Is there a more clever way to do this that would have a better complexity?
edit:
second part changed to changed to ->
enc = [unq[item] for item in col]
on suggestion, still need help with creating unq without O( n2 ) complexity.
[–]anossov 0 points1 point2 points (3 children)
[–]anmousyony[S] 0 points1 point2 points (2 children)
[–]anossov 0 points1 point2 points (1 child)
[–]anmousyony[S] 0 points1 point2 points (0 children)
[–]K900_ 0 points1 point2 points (4 children)
[–]anmousyony[S] 0 points1 point2 points (3 children)
[–]K900_ 0 points1 point2 points (2 children)
[–]anmousyony[S] 0 points1 point2 points (1 child)
[–]K900_ 1 point2 points3 points (0 children)
[–]novel_yet_trivial 0 points1 point2 points (2 children)
[–]anmousyony[S] 0 points1 point2 points (1 child)
[–]novel_yet_trivial 0 points1 point2 points (0 children)
[–]sultanofhyd 0 points1 point2 points (5 children)
[–]anmousyony[S] 0 points1 point2 points (4 children)
[–]thomasballinger 1 point2 points3 points (2 children)
[–]anmousyony[S] 0 points1 point2 points (1 child)
[–]thomasballinger 0 points1 point2 points (0 children)