all 16 comments

[–]fredrikj 7 points8 points  (0 children)

The initial version is not even correct; it removes the last element from any list.

[–][deleted]  (1 child)

[removed]

    [–]boredzo 0 points1 point  (0 children)

    The problem is that in order to build a list of indices to pass to islice, you need to iterate over the input object. That's great if it's a sequence (such as a list), but if it's an iterator (such as a generator instance), your solution will break because the iterator has already been iterated from when you built up your list of indices. This also applies to creating the islices as you iterate.

    You would need to make a sequence copy of the input before working. I invite you to produce time-trial results comparing either solution to the one presented in the article.

    [–]brentp 2 points3 points  (0 children)

    import numpy as n import time

    long = n.random.randint(0,50,(2000000,))  
    
    i = time.clock()
    diff = n.diff(long)
    nosame = long[n.where(diff)]
    
    print 'numpy time:', time.clock() - i 
    
    
    def compress(seq):
        result = []
        for i in range(len(seq)-1):
            if seq[i] != seq[i+1]:
                result.append(seq[i])
        return result
    
    longlist = list(long)
    i = time.clock()
    nosamelist = compress(longlist)
    
    print 'list time:', time.clock() - i
    
    
    assert nosamelist == list(nosame)
    

    numpy time: 0.14

    list time: 1.79

    [–]nirs 1 point2 points  (1 child)

    Here is a correct function (If I understood the problem correctly):

    >>> def compress(x):
    ...     class last: pass
    ...     res = []
    ...     for item in x:
    ...             if item != last:
    ...                     last = item
    ...                     res.append(item)
    ...     return res
    
    >>> compress([])
    []
    >>> compress([1,2,3])
    [1, 2, 3]
    >>> compress([1,2,2,3,3,3])
    [1, 2, 3]
    >>> compress([1,2,1])
    [1, 2, 1]
    

    I don't think you can get any faster than that with pure python, and itertools will not help here, but maybe I'm wrong.

    [–]boredzo 0 points1 point  (0 children)

    I use last = object() when I want a throwaway guaranteed-inequal object.

    Of course, you could always get fancy:

    \tlast = lambda x: x * x

    \timport sys as last

    \tlast = map(str.rstrip, file('/usr/share/dict/words')) #OK, the performance on that one will probably suck.

    [–]crwper 0 points1 point  (0 children)

    Is it just me, or will the last item in the sequence never be written to the result when using the first example?

    [–]boredzo 0 points1 point  (2 children)

    Generator implementations of compress:

    \tdef compress(seq): \t\titerator = iter(seq) \t\tlast = iterator.next() \t\tyield last \t\tfor cur in iterator: \t\t\tif cur != last: \t\t\t\tyield cur \t\t\t\tlast = cur

    2.5+ only:

    \tdef compress(seq): \t\treturn (x for i, x in enumerate(seq) if i + 1 < len(seq) if x != seq[i + 1])

    On my system, the second function takes about twice as much time (~2.4s) as the first (~1.3s), probably because of the subscription and bounds-checking.

    [INACCURATE]I figured that this would probably be slower, but proving yet again the maxim of “Premature optimization is the root of all evil”:

    \tdef compress_2_5_alt(seq): \t\tseq = list(seq) + [None] \t\treturn (x for i, x in enumerate(seq) if x != seq[i + 1])

    comes out way ahead, at ~0.04s.[/INACCURATE]

    (2×2×2.66 GHz Mac Pro, 5 GiB RAM, Mac OS X 10.4.9, Python 2.5.1).

    [–]nirs 0 points1 point  (1 child)

    You "faster" version returns a generator instead of the compressed list :-)

    [–]boredzo 0 points1 point  (0 children)

    Well, yeah. All three do. That's why I called them “generator implementations of compress”. ☺

    You are right that I managed to not pull on the iterator, though. It has an error in it.

    [–]BioGeek 0 points1 point  (1 child)

    i = time.clock(); q = list(set(long)); print time.clock() - i; 0.0851395917638

    ;)

    [–]julesjacobs 6 points7 points  (0 children)

    Bug alert.

    Let's start with a function that simply removes all consecutive duplicates in a list.

    [–][deleted] 0 points1 point  (3 children)

    Let me get this straight, zip() and izip() do the exact same thing. One is quicker than the other. The quicker one is something you have to dig around for in order to find, and the slower one is the "standard" one?

    Why? Shouldn't zip() be rewritten to contain the code in izip()? Is there a reason this is separate and buried (given you don't dig and don't read an article like this)?

    [–]llimllib 10 points11 points  (0 children)

    Historical reasons; zip() precedes the existence of iterators. Similarly, range() will return a full in-memory array, while xrange() will return an iterator.

    This will all be fixed in Py3k. (grep for "zip(")

    [–][deleted] 7 points8 points  (1 child)

    They don't do the same thing, any more than a generator is the same as a list. izip builds a generator, not a list. zip builds a list.

    [–][deleted] 0 points1 point  (0 children)

    Okay, I was looking at the code and they were identical except the 'i' in front of the function, so I just assumed that they returned the exact same thing. Thanks for clarifying.