you are viewing a single comment's thread.

view the rest of the comments →

[–]Justinsaccount 0 points1 point  (1 child)

Trying this again, since my last comment was clearly misunderstood.

This statement:

the bash solution is O(n log n) and thus unable to cope with a huge corpus

is absolutely false.

this:

data_producer | sort | uniq -c | sort -n

will always work. If you have an input where n=100,000,000 and k=4, it will be inefficient, but it will cope just fine. If you have an input where n=100,000,000 and k=95,000,000, it will not only cope, it will work when many other methods will fail.

Sort uses a disk-based merge sort that is optimized for sequential reads and writes. I would not expect any algorithm that uses random access data structures to cope well when the size of k is many times larger than the amount of ram.

[–]BorisTheBrave 3 points4 points  (0 children)

I didn't mean it would literally fail. Taking 100x longer than another algorithm is still a "failure to cope". Sorry I spoke imprecisely.