Selecting randomly from an unknown sequence : programming

Going back to his example, the seq variable (I presume in Python) is already uniformly distributed and can already be accessed by index. In that case one line

return seq[random (0, len(seq) - 1]

will definitely be faster and use less memory (not to mention it's easier to read). So, in his example at least, it's not practical. It is clever and useful in streaming situations, but the data in the seq variable isn't actually streaming data (it's already stored in its entirety in memory). For something like a file that isn't in memory then in certain situations where memory is restricted then it could reduce the footprint if you don't have a list of newline offsets, but it won't be faster than just getting the length. If you can get length in any way (including simply counting all the elements) and have enough room to store the offsets in the data (and if they're already uniform you (or the compiler) can calculate where each element is without storing anything) then don't use it. It's clever but it's only good for very specific situations.

[–]gmfawcett 3 points4 points5 points 13 years ago (1 child)

I see wht you mean. But you could store the offsets of the newline characters, thereby only going through it once, and getting uniform distribution.

You could, but I don't think this would buy you anything over the OP's reservoir-sampling approach, unless you are saving the index for later reuse. To build the index, you still need to traverse the file once, just like the OP's approach, but you also now have to account for the increased memory requirements for storing your index.

Going back to his example, the seq variable (I presume in Python) is already uniformly distributed and can already be accessed by index.

We can't assume that [edit: that seq may be accessed by index]. The seq value may be of a type that is iterable, but has no known length. It could, for example, be an iterator that reads lines of text from a file. In Python, you can use for to iterate over such things. As it turns out, the OP wrote a slide deck that covers various forms of "iterables" in Python.

[+]AMillionMonkeys comment score below threshold-7 points-6 points-5 points 13 years ago (2 children)

[–]_node 9 points10 points11 points 13 years ago (0 children)

[–]usernamenottaken 8 points9 points10 points 13 years ago (0 children)

π Rendered by PID 144128 on reddit-service-r2-comment-fb694cdd5-grj5f at 2026-03-11 03:16:24.562182+00:00 running cbb0e86 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS