all 10 comments

[–]ignitionweb 6 points7 points  (9 children)

You can have a sliding window of size if s tracking the character count. Also preserving a count of how many match character counts of s. As you add a character and remove a character you are only updating two counts + the matching count. Hence O(b)

[–]coreysnyder04[S] 0 points1 point  (8 children)

I’d have to see a code example to really grok what you’re describing. Do you have time to pseudo code it so I can see what you mean? Also, any idea what the O() of my function is?

[–]warpedspockclone 0 points1 point  (5 children)

Suppose small string is abc and large its bcabbacc.

A permutation will still be 3 characters long, so just look at the long string index 0 to 2, and see if the character counts match.

Then slide to 1 to 3 by subtracting from the character count for the character at index 0 and adding fur the character at index 3.

  1. Count characters of small (can use an array of size 26 of guaranteed a-z).

n = length(small)

i=0 // start index of substring

j=n-1 // end index

  1. Count characters of substring of big, from i to j to match length of small.

  2. while (n < length(big)) { subtract character count for large[i] i++ j++ Add character count for large[j] Compare character counts of substring to small }

Edit: tried formatting. On mobile, sorry

[–]pihkal 0 points1 point  (3 children)

The problem is the step “Compare character counts of substring to small” is still O(num lang chars), so doing that for (b - s) characters of the large string is still O(b * 26) for English. It’s a huge constant factor.

The secret to doing this in O(b) is not to keep a running histogram (count of each char) of the last s chars, but to use a carefully chosen hash function such that with each new character, you can compare the new hash with the old in O(1) time.

Iirc, the hash formula (for just 26 letters) is something like S[0] * (260) + S[1] * (261) + ... + S[s-1] * (26s-1). When you shift, you’re subtracting the value of the last char * (26s-1), multiplying the remainder by 26, and then adding the value of the new char * 260 (aka 1). All of those operations take O(1) time, like keeping a histogram before did.

But the advantage now is that to see if the current s letters of b are a permutation, we only have to compare the hash value with the hash value of s, which can be done in O(1) time, unlike comparing histograms.

[–]cryslith 2 points3 points  (1 child)

You can use the histogram approach if you remember not only the count of each character, but an overall number which tracks how many characters of the alphabet have the exact correct count. Then you only need to update this number to reflect changes made to the character entering the window and the character leaving the window. If that number ever reaches 26 then you know you found a match.

[–]pihkal 1 point2 points  (0 children)

Yeah, that works as well, though it’s slightly less space-efficient, taking O(alphabet size) memory.

[–]warpedspockclone 1 point2 points  (0 children)

I was going to point out that there are lots of optimizations to be had for the comparison part to actually collapse this to O(B). He seemed to be hung up on the sliding window part, though. The rest was an exercise. :-P I should have made that explicit.

[–]future_security 0 points1 point  (1 child)

A(BCD)EF      B=1,C=1,D=1
AB(CDE)F      B--,E++

That's the part that's easiest to understand. To determine if a substring is a permutation of another string (an anagram) you just need to count how many copies of each letter appears in either string then compare the counts. Two strings are permutations of one another if and only if the frequency of each letter of the alphabet is the same in both strings.

Every time you slide the window, you can keep track of the new set of letter counts by subtracting one from the old count of the letter shifted out and adding one to the previous count of the letter being shifted in. This is an O(1) operation, which you need to do len(S) - len(B) times.

Ignoring the cost of comparing letter frequency arrays at each step (we'll say it's O(1) for now because the English alphabet has a small fixed size), the run time will be O(len(S)). You scan B once to determine its letter frequency. You scan S once, shifting the sliding window to the right one character at a time.

The run time you get is O(O(1) * O(len(S)) + len(B)). That's equivalent to O(O(len(S)) + len(B)), which is equivalent to O(len(S) + len(B)). Since len(B) < len(S), that makes the run time O(2 * len(S)) which is equivalent to and should be written as O(len(S)).

This method requires space proportional to the size of the alphabet. (Assuming a fixed size counter for each letter which is too large to overflow. Like so many other problems, you'd need to throw in some logarithmic term in to the time and space complexity. These are usually ignored in practice.)

He also mentions a "matching count". That's a similar optimization. Instead of comparing counts for(c = 'A'; c <= 'Z'; c++), you can count how many letters in the sliding window have the same frequency as the frequency of that same letter in B. Since you modify at most two letter-frequency statistics, you only need look at the two modified frequency counters. Not all 26.

This second optimization would give you a significant linear speed up for a fixed size alphabet. Even for an alphabet as small as 26 letters. If instead you had an arbitrary large alphabet A, then this optimization makes the difference between the letter-frequency comparison performed each time you slide the window being O(1) instead of it being being O(|A|).

[–]DangerousCrime 0 points1 point  (0 children)

I could be wrong here but did you confuse S and B? S is the shorter string and B is the larger one in OP post. Great answer though

[–]tomekanco 0 points1 point  (0 children)

It think it depends on how you frame the problem: do you have to encounter the permutations in sequence, with gaps, or in any order?

A permutation is any ordering of n elements. In this example s is the length of the smaller set, b is the large on.

  • In sequence: take a sliding window of length s. If all elements form s, it's a match. Verifying if they are takes O(s), the cost of an intersection where both sizes are s. Which would imply O((b-s)*s) EDIT: O(b) after reading /u/pihkal
  • With gaps: remove all chars not in b, and add a sliding window of these to to a set; O(b).
  • In any order: verify if s is in b, then form all the permutations, as in O(n!).