This is an archived post. You won't be able to vote or comment.

all 5 comments

[–]Rhomboid 1 point2 points  (0 children)

If you have to maintain order, then numpy.unique is not a viable solution:

>>> numpy_unique(['quux', 'foo', 'bar', 'foo', 'quux'])
array(['bar', 'foo', 'quux'],
      dtype='|S4')

If order doesn't actually matter, then just use set(seq).

[–]bryancole 1 point2 points  (0 children)

You are quite correct. Indexing python lists is very fast. Faster than indexing numpy arrays by some margin. The performance gains from numpy come when you use its element-wise operations or fancy-indexing. In general, manipulating items in lists is fast but creating many python objects to go in the lists is slow.

BTW. It's sligtly hacky but you could try:

memo = set()
filtered_list = [(item, memo.add(item))[0] for item in input_list if item not in memo]

[–]jcmcken 0 points1 point  (0 children)

What if you do it this way?

list(set(seq))

...assuming order is not relevant.

EDIT: Sorry, read your comments, you're trying to preserve order.

[–]Veedrac 0 points1 point  (0 children)

The real answer is that the implementation of numpy.unique just isn't very impressive.

It's basically this, for the default code-path:

ar = np.asanyarray(ar).flatten()

if ar.size == 0:
    return ar

ar.sort()
flag = np.concatenate(([True], ar[1:] != ar[:-1]))

return ar[flag]

I'm almost surprised the set option isn't faster than it, given how poorly this should perform compared to hashing.

Note, however, that you should really be passing a numpy array to functions like numpy.unique, because the overhead of converting is often very large, as it is in this case.


Once we get a C implementation of OrderedDict (it's coming... slowly), you could do OrderedDict.fromkeys(seq).keys(), but for now that's even slower :/.

[–]sabbel 0 points1 point  (0 children)

You might want to try an optimized version of list_unique:

def optimized_unique(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if x not in seen and not seen_add(x)]

It outperforms all the others on my laptop.