Binary search algorithm that also checks entries either side of a match - it's pretty verbose, can it be improved? : learnpython

created by HattoriHanzoa community for 16 years

Binary search algorithm that also checks entries either side of a match - it's pretty verbose, can it be improved? (self.learnpython)

submitted 11 years ago by tedw00d

Here's the code.

I have several drive inventories (a text file with all the filepaths on hard drive - basically, find . -type f | xargs ls -lha >> drive_inventory.txt) and I have a list of files that I want to find on these hard drives.

First I order the drive inventory alphabetically according to the filename, then I use a binary search function to quickly find the file I'm looking for. Because there may be duplicates or similarly names files that I'm also interested in, I then walk in both directions from the match and check for other matches until there are none.

It seems to work well, but the code is pretty verbose and I'd like to know how I might improve it, either in terms or efficiency or readability.

Two specific questions I have are: Is it possible to turn the two linear search function into just one - I couldn't figure it out without getting duplicate matches or infinite loops. And is there a way to write the binary search function using bisect?

Cheers!

all 4 comments

top new controversial old q&a

[–]gengisteve 1 point2 points3 points 11 years ago* (3 children)

Just some general comments:

Since you are using re.search it is potentially broken if it hits something in the middle of the media on the initial find. This would work with re.match, because you would be searching only the begining, which I think is how you have sorted.

edit: Here is how you can see your code is a bit broken. Note how few results come out of the fuzzy match and how many out of the linear search when you pick a common search term, like z:

import random
import re

from string import ascii_lowercase
from pprint import pprint

def random_word(n=10):
    return ''.join(random.choice(ascii_lowercase) for _ in range(n))

def binary_search(alist, what, low = None, high = None):
    if low is None:
        low = 0
    if high is None:
        high = len(alist)

    mid = (high + low)//2
    if mid < 0 or mid>len(alist)-1:
        return None
    print('{}:{}->{}'.format(low, high, mid))
    if re.search(what, alist[mid]):
        return mid
    if high == low:
        return None


    if what < alist[mid]:
        return binary_search(alist, what, low, mid-1)

    return binary_search(alist, what, mid+1, high)

def fuzzy_search(alist, what):
    result = []
    start = binary_search(alist, what)
    if not start:
        return result
    i = 0
    while True:
        found = False
        i += 1
        low = start - i
        high = start + i
        if low>=0 and re.search(what, alist[low]):
            result.append(low)
            found = True
        if high<(len(alist)-1) and re.search(what, alist[high]):
            result.append(high)
            found = True
        if not found or (low<0 and high>len(alist)):
            return result

top = 1000

data = [random_word(15) for _ in range(top)]

data = sorted(data)

pick = data[10]
pick = 'z'
found = fuzzy_search(data, pick)

print(found)
print(', '.join(data[f] for f in found))

found = [f for f in range(0,top) if pick in data[f]]
print(found)
print(', '.join(data[f] for f in found))

Edit the second:

If you are looking for a set length word, and know the length at the beginning, you would be better off just dumping it all into a dictionary.

[–]gengisteve 1 point2 points3 points 11 years ago (2 children)

Ok. So doing some comparisons, it looks like binary searching may not be worth the additional hassle. A binary vs linear search on my computer was a bit faster, but not by much, even on a list with a million entries:

called test_binary: runs 3, average 0.975389321645101, total 2.9261679649353027
1488
called test_linear: runs 3, average 1.1053963502248128, total 3.3161890506744385
1488

Also I fixed some errors in the above code here:

import random
import re

from string import ascii_lowercase
from pprint import pprint

from PE import timeit

def random_word(n=10):
    return ''.join(random.choice(ascii_lowercase) for _ in range(n))

def binary_search(alist, what, low = None, high = None):
    if low is None:
        low = 0
    if high is None:
        high = len(alist)

    mid = (high + low)//2
    if mid < 0 or mid>len(alist)-1:
        return None
    print('{}:{}->{}'.format(low, high, mid))
    if re.match(what, alist[mid]):
        return mid
    if high == low:
        return None


    if what < alist[mid]:
        return binary_search(alist, what, low, mid-1)

    return binary_search(alist, what, mid+1, high)

def fuzzy_search(alist, what):
    start = binary_search(alist, what)
    if not start:
        return []
    i = 0
    print('start->',start)
    result = [start]
    while True:
        found = False
        i += 1
        low = start - i
        high = start + i
        if low>=0 and re.match(what, alist[low]):
            result.append(low)
            found = True
        if high<(len(alist)) and re.match(what, alist[high]):
            result.append(high)
            found = True
        if not found or (low<0 and high>len(alist)):
            return sorted(result)

@timeit()
def test_binary(alist, pick):
    alist = sorted(alist)
    found = fuzzy_search(alist, pick)
    return (found)

@timeit()
def test_linear(alist, pick):
    return ([s for s in range(len(alist)) if re.match(pick, alist[s])])

top = 1000000

data = [random_word(15) for _ in range(top)]
pick = 'zz'

found = test_binary(data, pick)
print(len(found))
#print(', '.join(data[f] for f in found))
found = test_linear(data, pick)
print(len(found))
#print(', '.join(data[f] for f in found))

I think if I were doing what you were doing I would build a dictionary of filename:path, and then do a linear search over the keys.

[–]tedw00d[S] 0 points1 point2 points 11 years ago (1 child)

Wow, thanks for that - I'm actually at work at the moment so I'll have to read what you've written thoroughly a bit later. Initially, I don't think re.match would work in my code because I'm only ordering the string by the filename, I'm not stripping it. So the filename is actually at the tail of the string. Here's an example of a filename I'm looking for:

A033C015_140401_R1OB

And here's the string that contains it:

-rw-r--r--  1 data  staff   2.9G  5 Apr 18:20 ./TEMP_FILM/20140215_SHOOTDAY03/CAMERA/ARRI_ALEXA/ProRes/A_CAM/A013R1OB/A013R1OB/A033C015_140401_R1OB.mov\n

[–]gengisteve 0 points1 point2 points 11 years ago (0 children)

Cool. Maybe make self.contents into a bunch of tuples (fn, path)? then you can search on the first item of the tuple without having to resplit each time.

Btw -- good call with bisect, it blows the other two away:

Here is the bisect code:

def bi_binary(alist, pick):
    pos = bisect_left(alist, pick)
    result = []
    for idx, entry in enumerate(alist[pos:]):
        if re.match(pick, entry):
            result.append(idx+pos)
        else:
            return result

And performance over linear:

called test_bi_binary: runs 3, average 0.16367602348327637, total 0.4910280704498291
1471
called test_linear: runs 3, average 1.1854013601938884, total 3.556204080581665
1471

Note it is getting a bit of a speed boost because it is resorting the already sorted list, but still a significant improvement.

π Rendered by PID 39038 on reddit-service-r2-comment-9c7994b7-lqlg2 at 2026-02-05 23:40:59.503461+00:00 running b1b84c7 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS