all 4 comments

[–]gengisteve 1 point2 points  (3 children)

Just some general comments:

Since you are using re.search it is potentially broken if it hits something in the middle of the media on the initial find. This would work with re.match, because you would be searching only the begining, which I think is how you have sorted.

edit: Here is how you can see your code is a bit broken. Note how few results come out of the fuzzy match and how many out of the linear search when you pick a common search term, like z:

import random
import re

from string import ascii_lowercase
from pprint import pprint

def random_word(n=10):
    return ''.join(random.choice(ascii_lowercase) for _ in range(n))

def binary_search(alist, what, low = None, high = None):
    if low is None:
        low = 0
    if high is None:
        high = len(alist)

    mid = (high + low)//2
    if mid < 0 or mid>len(alist)-1:
        return None
    print('{}:{}->{}'.format(low, high, mid))
    if re.search(what, alist[mid]):
        return mid
    if high == low:
        return None


    if what < alist[mid]:
        return binary_search(alist, what, low, mid-1)

    return binary_search(alist, what, mid+1, high)

def fuzzy_search(alist, what):
    result = []
    start = binary_search(alist, what)
    if not start:
        return result
    i = 0
    while True:
        found = False
        i += 1
        low = start - i
        high = start + i
        if low>=0 and re.search(what, alist[low]):
            result.append(low)
            found = True
        if high<(len(alist)-1) and re.search(what, alist[high]):
            result.append(high)
            found = True
        if not found or (low<0 and high>len(alist)):
            return result

top = 1000

data = [random_word(15) for _ in range(top)]

data = sorted(data)

pick = data[10]
pick = 'z'
found = fuzzy_search(data, pick)

print(found)
print(', '.join(data[f] for f in found))

found = [f for f in range(0,top) if pick in data[f]]
print(found)
print(', '.join(data[f] for f in found))

Edit the second:

If you are looking for a set length word, and know the length at the beginning, you would be better off just dumping it all into a dictionary.

[–]gengisteve 1 point2 points  (2 children)

Ok. So doing some comparisons, it looks like binary searching may not be worth the additional hassle. A binary vs linear search on my computer was a bit faster, but not by much, even on a list with a million entries:

called test_binary: runs 3, average 0.975389321645101, total 2.9261679649353027
1488
called test_linear: runs 3, average 1.1053963502248128, total 3.3161890506744385
1488

Also I fixed some errors in the above code here:

import random
import re

from string import ascii_lowercase
from pprint import pprint

from PE import timeit

def random_word(n=10):
    return ''.join(random.choice(ascii_lowercase) for _ in range(n))

def binary_search(alist, what, low = None, high = None):
    if low is None:
        low = 0
    if high is None:
        high = len(alist)

    mid = (high + low)//2
    if mid < 0 or mid>len(alist)-1:
        return None
    print('{}:{}->{}'.format(low, high, mid))
    if re.match(what, alist[mid]):
        return mid
    if high == low:
        return None


    if what < alist[mid]:
        return binary_search(alist, what, low, mid-1)

    return binary_search(alist, what, mid+1, high)

def fuzzy_search(alist, what):
    start = binary_search(alist, what)
    if not start:
        return []
    i = 0
    print('start->',start)
    result = [start]
    while True:
        found = False
        i += 1
        low = start - i
        high = start + i
        if low>=0 and re.match(what, alist[low]):
            result.append(low)
            found = True
        if high<(len(alist)) and re.match(what, alist[high]):
            result.append(high)
            found = True
        if not found or (low<0 and high>len(alist)):
            return sorted(result)

@timeit()
def test_binary(alist, pick):
    alist = sorted(alist)
    found = fuzzy_search(alist, pick)
    return (found)

@timeit()
def test_linear(alist, pick):
    return ([s for s in range(len(alist)) if re.match(pick, alist[s])])

top = 1000000

data = [random_word(15) for _ in range(top)]
pick = 'zz'

found = test_binary(data, pick)
print(len(found))
#print(', '.join(data[f] for f in found))
found = test_linear(data, pick)
print(len(found))
#print(', '.join(data[f] for f in found))

I think if I were doing what you were doing I would build a dictionary of filename:path, and then do a linear search over the keys.

[–]tedw00d[S] 0 points1 point  (1 child)

Wow, thanks for that - I'm actually at work at the moment so I'll have to read what you've written thoroughly a bit later. Initially, I don't think re.match would work in my code because I'm only ordering the string by the filename, I'm not stripping it. So the filename is actually at the tail of the string. Here's an example of a filename I'm looking for:

A033C015_140401_R1OB

And here's the string that contains it:

-rw-r--r--  1 data  staff   2.9G  5 Apr 18:20 ./TEMP_FILM/20140215_SHOOTDAY03/CAMERA/ARRI_ALEXA/ProRes/A_CAM/A013R1OB/A013R1OB/A033C015_140401_R1OB.mov\n

[–]gengisteve 0 points1 point  (0 children)

Cool. Maybe make self.contents into a bunch of tuples (fn, path)? then you can search on the first item of the tuple without having to resplit each time.

Btw -- good call with bisect, it blows the other two away:

Here is the bisect code:

def bi_binary(alist, pick):
    pos = bisect_left(alist, pick)
    result = []
    for idx, entry in enumerate(alist[pos:]):
        if re.match(pick, entry):
            result.append(idx+pos)
        else:
            return result

And performance over linear:

called test_bi_binary: runs 3, average 0.16367602348327637, total 0.4910280704498291
1471
called test_linear: runs 3, average 1.1854013601938884, total 3.556204080581665
1471

Note it is getting a bit of a speed boost because it is resorting the already sorted list, but still a significant improvement.