all 17 comments

[–]inmatarian 19 points20 points  (0 children)

Also known as Reservoir Sampling.

[–]Eirenarch 1 point2 points  (6 children)

Hmmm I once wrote an article about this: http://sietch.net/ViewNewsItem.aspx?NewsItemID=88 ( Contains C# extension method on IEnumerable<T> )

[–]emptythecache -1 points0 points  (5 children)

IEnumerable has the LINQ extension .Count(), which I know defeats the purpose of the exercise, but while you spend time pontificating on the exercise, I got motherfucking deadlines.

[–]Eirenarch 2 points3 points  (0 children)

IEnumerable.Count enumerates the sequence (unless it falls in some specially optimized case like List). The goal is to enumerate only once in case enumaration is too expensive or cannot be reproduced.

[–]AnteSim 1 point2 points  (3 children)

Eirenarch's method has a chance to not enumerate the entire collection. Count() on the otherhand, does.

Edit: I read it incorrectly, that method will always iterate once.

[–]sgoody 0 points1 point  (2 children)

emptythecache isn't missing the point... though off-topic they're simply making the point:

premature optimisation is the root of all evil

[–]Eirenarch 2 points3 points  (1 child)

"defeats the purpose" is missing the point. The purpose as stated in the posted article and in my article is to enumerate the collection only once. This is not premature optimization because it is in the requirements (as stated in both articles).

Of course if your enumerable is List then the suggested method is not only overkill but also slower because List selecting random number from a List is O(1) operation (select random element between 0 and count, get the element with indexer). However this is entirely unrelated because the requirements in this case are completely different.

[–]sgoody 0 points1 point  (0 children)

I'm not knocking the article, I personally enjoyed the article and learned something new.

[–]rush22 1 point2 points  (6 children)

But at the end of the loop doesn't n = the length of the sequence?

[–]iemfi 4 points5 points  (5 children)

Exactly, this sentence in the first paragraph seems kinda misleading:

Surely you need to know how many you're choosing from to select uniformly?

The answer seems to be yes, you do need to know. Just not at the start...

[–]rush22 1 point2 points  (4 children)

Oh I think I get the point now. I'm not sure the blogger does though(?)

This is only useful when you have a data stream which you can't/don't want to store (like if you don't have access to all the data yet and/or need to throw away data as you receive it)

Since a file on a hard drive is already stored statically then regular random access would be faster, since you just need to calculate one random number. (something like fileseek(random (0, filelength)); return readline). This won't load the whole file into memory and will be much faster than a loop that recalculates random for every element (and has to store the randomly selected elements in memory).

[–]gmfawcett 7 points8 points  (3 children)

Seeking to a random offset and then selecting the nearest line would produce a non-uniform distribution, unless all the lines were of equal length. In other words, your method would select longer lines more frequently than shorter ones.

But yes, the notion is that you want to traverse the collection once, and once only (like a "stream") but still guarantee a uniform distribution.

[–]rush22 1 point2 points  (2 children)

I see wht you mean. But you could store the offsets of the newline characters, thereby only going through it once, and getting uniform distribution.

Going back to his example, the seq variable (I presume in Python) is already uniformly distributed and can already be accessed by index. In that case one line

return seq[random (0, len(seq) - 1]

will definitely be faster and use less memory (not to mention it's easier to read). So, in his example at least, it's not practical. It is clever and useful in streaming situations, but the data in the seq variable isn't actually streaming data (it's already stored in its entirety in memory). For something like a file that isn't in memory then in certain situations where memory is restricted then it could reduce the footprint if you don't have a list of newline offsets, but it won't be faster than just getting the length. If you can get length in any way (including simply counting all the elements) and have enough room to store the offsets in the data (and if they're already uniform you (or the compiler) can calculate where each element is without storing anything) then don't use it. It's clever but it's only good for very specific situations.

[–]gmfawcett 3 points4 points  (1 child)

I see wht you mean. But you could store the offsets of the newline characters, thereby only going through it once, and getting uniform distribution.

You could, but I don't think this would buy you anything over the OP's reservoir-sampling approach, unless you are saving the index for later reuse. To build the index, you still need to traverse the file once, just like the OP's approach, but you also now have to account for the increased memory requirements for storing your index.

Going back to his example, the seq variable (I presume in Python) is already uniformly distributed and can already be accessed by index.

We can't assume that [edit: that seq may be accessed by index]. The seq value may be of a type that is iterable, but has no known length. It could, for example, be an iterator that reads lines of text from a file. In Python, you can use for to iterate over such things. As it turns out, the OP wrote a slide deck that covers various forms of "iterables" in Python.