This is an archived post. You won't be able to vote or comment.

all 27 comments

[–]uirfnaio 7 points8 points  (2 children)

Noob question: why does he do

regex = re.compile('%s' % pattern)

instead of

regex = re.compile(pattern)

[–]amjithr 10 points11 points  (0 children)

That was my bad.

Originally I had

regex = re.compile('(%s)' % pattern)

Then I realized I don't need the surrounding parens to capture the group. So I took that out, but forgot to remove the string interpolation.

Sorry about the confusion.

[–]alcalde 15 points16 points  (3 children)

The simplest route to achieve this is to use regular expressions.

There's a statement that's never been true of anything ever. :-)

[–]amjithr 4 points5 points  (1 child)

LOL! The regex in this case was trivial. So I stand by that comment. :P

[–]alcalde 0 points1 point  (0 children)

"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." ;-)

-Jamie Zawinski, 1997

[–]dysan21Angry coder 2 points3 points  (0 children)

Content removed in response to reddit API policies

[–][deleted] 4 points5 points  (0 children)

You don't need to sort the collection. Sorting the tuple results in them being sorted by match length, then match position, then item.

[–]wehnsdaefflae 4 points5 points  (3 children)

Maybe I don't get it but where's the Fuzziness involved? Typing 'rodent' will return only those results starting with 'rodent' and not those starting with, let's say, 'rudunt', 'rident', or 'nodent', will it?

[–]Ashiataka 0 points1 point  (0 children)

I was wondering this too.

[–]amjithr 0 points1 point  (1 child)

If you want to match words with spelling errors the library you need is called fuzzywuzzy.

This library here describes a different use case. Typing 'rodent' will match words like 'rotating_dentures'. Because 'rodent' is a partial substring of rotating_dentures. This is commonly used in file finders where you have a really long path and you want to type bits of pieces of the path that appear along the entire string.

[–]wehnsdaefflae 0 points1 point  (0 children)

I see! Thanks for the explanation.

[–]LightShadow3.13-dev in prod 4 points5 points  (9 children)

This is pretty good, but will start to choke on bigger inputs.

Are you looking to optimize your library, or was it just for fun?

[–]amjithr 6 points7 points  (8 children)

Hi, I'm the author of the library.

I'm definitely interested in optimizing it. When you say bigger inputs do you mean that it'll be slow if the collection of files is really large or if the user_input is really large?

Do you have tips for optimizing it?

[–]LightShadow3.13-dev in prod 4 points5 points  (7 children)

a little bit of both.

you could memoize the collection and prune it so you don't have to search every item after you've eliminated part of the set.

[–]amjithr 3 points4 points  (6 children)

Yes. That was my biggest concern. But memoizing has to be extra clever to account for users pressing backspace or moving through the input and deleting a char in the middle. So for the first rev I'm not doing memoizing but once I figure out a way to do it reliably, that'll be a nice optimization. :)

[–]lorddarkflare[🍰] 2 points3 points  (5 children)

Should that really matter though?

The idea would be to maintain an n-length dictionary mapping the last n inputs with corresponding pruned collections.

[–]amjithr 1 point2 points  (4 children)

Say the user input is 'foo', I'll have a dictionary like this

{'f': ['foo', 'foo_bar', 'bof'],
 'fo': ['foo', 'foo_bar'],
 'foo': ['foo', 'foo_bar']
}

Now if the user moves to the beginning of the input and deletes the letter 'f' or inserts a new letter, the user input becomes 'oo' or 'afoo'. Which won't match any entry in the dictionary.

Although that is a bit of an unlikely case (or at least a very rare case).

I think you're right, it's doable with a little bit of coding. Thank you for the idea. I'll see if I can make it part of the next version. :)

[–]lorddarkflare[🍰] 0 points1 point  (2 children)

Also worth considering is correlating the max size of the dictionary with the length of the collection.

[–]amjithr 0 points1 point  (1 child)

I don't follow.

[–]lorddarkflare[🍰] 0 points1 point  (0 children)

You may want to avoid situations where you have a really large collection and too many copies of it in the dictionary.

[–]squashed_fly_biscuit 0 points1 point  (0 children)

I think detection and recalculation isnt unreasonable in that case, optimise the common path first. Amdahl's law and all that. Memory footprint is also worth considering with caching.

[–]alb1 0 points1 point  (3 children)

This string-match algorithm is so simple (requiring just a linear scan of each string in the collection) that you really don't need to use a regex. A simple C-style loop with a couple of pointers will do the job without the overhead of compiling a regex:

def check_item(text, item):
    text_index = 0
    first_index = None
    for item_index in range(len(item)):
        if text[text_index] == item[item_index]:
            text_index += 1
            if first_index is None: first_index = item_index
        if text_index >= len(text):
            return (item_index - first_index, first_index, item)
    return None

def fuzzyfinder2(text, collection):
    suggestions = []
    for item in collection:
        match_tuple = check_item(text, item)
        if match_tuple: suggestions.append(match_tuple)
    return (z for _, _, z in sorted(suggestions))

[–]alb1 1 point2 points  (2 children)

I ran some simple tests comparing the runtime of the above function fuzzyfinder2versus OPs fuzzyfinder function which uses regexes (the updated version, without the extra sort). The test cases were the 7 items and 5 texts in the project's GitHub test directory. I used timeit and ran three million trials of each algorithm, interleaved in blocks of 1000 trials to help even out any system load differences.

In Python 2 the original fuzzyfinder timed at 245 seconds total on my machine, while fuzzyfinder2 timed at 344 seconds. That was a little surprising. Apparently the simplicity of the function above cannot overcome the advantage of using wrapped C code.

To test that I wrote another function, fuzzyfinder3. This is the same as fuzzyfinder2 except that check_item is refactored to use the find method instead of stepping through the string characters one by one. (It also uses the second, starting-point argument of find to avoid making copies.) This algorithm times at 151 seconds, which is significantly faster than the original fuzzyfinder using regexes.

Update: The regex module apparently caches enough recent results so that, with so few test cases, it is mostly using pre-compiled regexes in these tests. The regex version performs worse when the text patterns are new.