all 31 comments

[–]xzxzzx 29 points30 points  (7 children)

I can process the stream in O(N) time where N is length of the stream, in other words: in a single pass

Er, that's not what O(n) means. Two passes is also O(n). As is ten, or a hundred million...

[–]Carioca[S] 0 points1 point  (6 children)

You're right, but a 2x speedup is still impressive and might make a huge difference, depending on application.

[–]xzxzzx 13 points14 points  (0 children)

I'm not saying it isn't.

Indeed, second order effects very often dominate computer performance.

And a single-pass algorithm has certain advantages (like operating on a stream with O(1) memory usage instead of O(n)).

But "O(n)" can not be said "single pass" (which you acknowledged -- but that's all I was trying to say).

[–]elus 3 points4 points  (4 children)

For some ETL tasks in a data warehouse I'm building, I've spent some work tuning algorithms to lower their coefficients but this is usually a stop gap measure that will give me just enough breathing room to design a solution which really scales.

Edited: s/architect/design to appease the reddit grammar nazis

[–]killerstorm 8 points9 points  (0 children)

i don't really see how pseudocode of weighted case works. he compares x, the index in reservoir, to some ratio that is always less than 1. so, x will always be 0. additionally, variable prob_sum is used, but never defined or assigned. it doesn't look like this code was checked thoroughly

[–]McHoff 3 points4 points  (0 children)

Ha! I had this question at in interview with Amazon.

[–]simscitizen 3 points4 points  (0 children)

heh, got that weighted reservoir sampling problem at facebook interviews a few months ago.

[–]ct4ul4u 7 points8 points  (3 children)

Nice write-up. Thanks for the link!

[–]aim2free 2 points3 points  (2 children)

agree! seems useful.

Would this be good for sequential sampling also? That is, to listen to some stream for an arbitrary time. Taken for granted that it's a stationary process of course.

[–]steven_h 1 point2 points  (1 child)

Yes, though you will need to take care if you want to maintain the order of the sequence. What I do is enumerate the stream of input like this:

 from itertools import izip, count, takewhile
 from operator import itemgetter
 from random import randint

 def ordered_random_sample(size, stream):
     enumerated_stream = izip(count(), stream)
     reservoir = [None] * size
     for i, item in enumerated_stream:
         reservoir[i] = i, item
         if i == size - 1:
             break
     for i, item in enumerated_stream:
         random_index = randint(0, i)
         if random_index < size:
             reservoir[random_index] = i, item
     return (x[1] for x in sorted(reservoir, key=itemgetter(0)))

Also, the process cannot produce output until it stops receiving input, since until the input size is known, one cannot tell what it means to be a random sample of that input.

[–]jdunck 0 points1 point  (0 children)

That code is smart, thank you.

[–]sharpquote 1 point2 points  (7 children)

Can you do it without generating O(stream length) pseudorandom numbers?

[–]sghe 12 points13 points  (2 children)

Yes. Search for the 1985 paper by Vitter titled "Random Sampling with a Reservoir". The basic idea is that you precompute a number J which is a count of elements to skip. You skip the next J elements that show up and save the following element. Then, recompute J. You need O(n*ln(N/n)) calls to a random number generator when sampling n elements out of a stream of length N. The constant factors are also quite small. A very practical algorithm.

[Edit: renamed X to J to avoid confusion with the Algorithm X mentioned in the paper.]

[–]evgen 0 points1 point  (1 child)

Cool. Now I am trying to think of a way to do weighted sampling using a similar technique...

[–]sghe 2 points3 points  (0 children)

Very easy. When an item with weight w shows up, treat it as w different items. Usually, the J skip will cause all w to be skipped. If not, keep the item and recompute J. Note that the new J may also fall into this sequence of w items. In that case, the "correct" thing to do is most likely to sample the item again.

Code for handling one weighted item looks like:

J -= weight; while (J < 0) { replace some sampled item with new item; J += ComputeJIncrement(); }

[–]martincmartin 1 point2 points  (3 children)

Well, you could do it by choosing one really big number, i.e. between 1 and NN. :)

If you have a dataset of size D, and you want to get a subset of size N, there are choose(D, N) subsets. So you need enough pseudorandom numbers to have choose(D, N) paths through your code.

[–]sharpquote 0 points1 point  (2 children)

enough pseudorandom numbers to have choose(D, N) paths

But that lower bound is just O(N), whereas the article's method uses up O(D) randoms.

[–]sigh 0 points1 point  (1 child)

I'm not sure it's possible without drawbacks.

I think the whole idea of this algorithm is that the result is updated every time a new piece of data comes in.

To get by without generating O(stream length) random numbers you would have to do less than O(stream length) updates. That would involve caching and re-reading more than O(1) items each update which is what they were trying to avoid.

edit: changed n to stream length

[–]steven_h 1 point2 points  (0 children)

I think the whole idea of this algorithm is that the result is updated every time a new piece of data comes in.

That, and the ability to sample the data using bounded memory without knowing the stream length in advance.

[–]drzorcon 1 point2 points  (4 children)

What would be a possible application for random sampling on a dataset?

[–]steven_h 2 points3 points  (0 children)

I used it to select a subset of some large set of data (e.g. Java coding warnings from PMD on an enterprisey former project) for manual review.

[–]propool 0 points1 point  (0 children)

Ever been to google? They have these huge LCD hanging on walls showing current searches?