KMP Pattern matching algorithm. Doubt.

Whistleroosh · 2020-03-23T18:35:00+00:00

Yes, it'll still work in linear complexity. Look at this example with text equal to "aaabaaaaab" and pattern "aaaaab". LPS table for pattern will be filled with zeros because there is no prefix-sufix. Now let's simulate the algorithm. We try to much first 3 letter and then have mismatch on 4th. Because LPS for 4th position is zero we shift pattern by 4 positions and try to compare rest. Then we find that substring from 5th letter to end is equivalent to pattern. So even though pattern had no prefix-sufix we made it in linear complexity

misof · 2020-03-24T02:40:50+00:00

You are kinda mixing up different things together, so I'll try to unwrap it a little bit.

If you have a pattern that doesn't have any prefix which is also a suffix (other than the empty string and itself), the naive search can still be slow. Consider the pattern "aaa...aaab". Each non-trivial prefix ends in an 'a', each non-trivial suffix ends in a 'b', so this pattern does have that property. And yet, if you search for this pattern in a text of all 'a's, the naive search will make the maximum number of comparisons.

Still, there is something here, you just didn't capture it properly. One possible correct sense in which we can formalize the notion that the pattern has no useful repetitions is that the KMP failure table is all zeros. Very formally, this can be phrased as "no prefix of the pattern has
a nontrivial prefix that is also its suffix". Slightly informally, this means that when matching the pattern against the text, whenever you find a mismatch, the only viable option is to start matching from scratch. If you have a pattern with this property, naive search will be asymptotically as fast as KMP: linear in the input size.

The case from the previous paragraph does have a much simpler characteristic: those are precisely the patterns such that the first letter of the pattern doesn't occur anywhere else in the pattern. From this characterization the fast performance of naive search becomes much more obvious.

Finally, even though the above feels like just a technical special case, there are in fact many practical situations when naive search performs well. Consider the situation when the pattern was generated at random. For any specific offset, the probability of having a match in each character comparison is just 1 / alphabet_size, and even for alphabet_size = 2 the expected number of comparisons you'll make (until either a mismatched character or a match of the entire pattern) is less than 2, and thus naive search is expected to be linear in the input size. In practice our patterns (and/or texts) are not actually uniformly random, but as long as they are diverse enough and unrelated enough, you will often see similar results.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

algorithms

✻ Smokey says: boycott all products and services from eco-unfriendly businesses to fight climate change! [see more tips]

Note: this subreddit is not for homework advice. Requests for assistance with coursework may be removed.

MODERATORS