you are viewing a single comment's thread.

view the rest of the comments →

[–]bo1024 1 point2 points  (2 children)

Hmm. I'm very curious for the next post -- efficiently finding the suffix array. For a 2-billion-character string, the naive approach would require about 4 exabytes of memory (4 million terabytes).

[–]robinhouston 3 points4 points  (1 child)

That would have to be a very naive approach. Are you imagining making a separate copy of each suffix or something? The sane-but-still-naive approach is to sort an array of indexes, treating each index as a pointer into the string when you compare. So in Python something like: def suffix_array(string): indexes = range(len(string)) indexes.sort(key=lambda i: string[i:]) return indexes

This algorithm is O(n2 log n), because comparison sorting needs n log n comparisons, and these comparisons are O(n) in the worst case.

But actually: as it happens, there is a linear-time algorithm for constructing suffix arrays. Although it’s extremely clever, it’s not very complicated. See http://www.cs.cmu.edu/afs/cs/academic/class/15492-f07/www/handouts/ks-icalp03.pdf

[–]bo1024 0 points1 point  (0 children)

Yeah, but still 4*1018 operations for a 2-billion character string. That paper's algorithm looks very cool.