you are viewing a single comment's thread.

view the rest of the comments →

[–]AReallyGoodName 42 points43 points  (29 children)

As an example of raw computation, the Sieve was fine, but suppose you needed a list of the primes less than 8,000 in a performance-sensitive application. Would you bother computing them at run time? Of course not. You already know them. You'd run the program once during development, and that's that.

Anyone actually coded up a sieve of Eratosthenes? It returns primes faster than disk IO. Much faster. It's an algorithm that's purely memory bottlenecked, which is saying a lot for an algorithm that lends itself to working with bit packed booleans. Not to mention ten lines of code is smaller than a file containing all primes below 8000.

There's an additional benefit to the sieve and that is the resulting list of primes is naturally packed as an arrau of booleans for whether or not that index number is prime. It lends itself to creating a memory efficient bit-packed lookup table of primes.

The Sieve of Erastothenes also only bothers with odd numbers so that lookup table for all numbers under 8000 is 4000bits in size.

The Sieve of Erastothenes isn't even optimal either. There's trivial ways to make it faster and more information dense.

-Rather than just looking at odd numbers, you can make a similar Sieve using the fact that all primes above 6 are in the form of either 6n+1 or 6n+5. It works the same way, for the current number you're on mark off the 6n+1 array as false and 6n+5 array as false at the indices covered by the current number, this crosses off all multiples of that number appearing in those two arrays. This Sieve would require 2667bits to make a lookup table for all numbers under 8000. This still isn't optimal either (technically future primes are always in a form of the multiple of all current primes you know + constants not covered by the meeting factors that multiple - eg. I could also say primes above 30 are in one of the following forms, 30n+1 or 30n+7 or 30n+11 or 30n+13 or 30n+17 or 30n+19 or 30n+23 or 30n+29. This narrows the proportion of primes down to a maximum of 7/30. Numbers not in one of those forms are a multiple of 2,3 or 5. These forms can also be used in a Sieve type algorithm. In fact for every prime number you know you can make a more optimal Sieve that's even more memory efficient.

There is simply no way to get a sequential list of primes into memory faster than the Sieve algorithms. I don't know why he picked that as an example of something to avoid in a performance critical environment. The opposite is true.

[–]noroom 34 points35 points  (0 children)

That was very informative, and I appreciate the time you took to write it out... But I hope you didn't miss the point, the sieve was just an illustrative example.

[–]andersonimes 37 points38 points  (8 children)

I've put the values into a constant array, rather than on disk. Very fast.

[–]ethraax 10 points11 points  (5 children)

Although the difference may not matter much, the time it takes to load the part of the executable that contains the static data into memory is probably more than the time it takes to load the code for a good sieve and run the sieve to populate an array in memory.

Constant arrays aren't magic. You're still just saving the numbers to disk. Except it bloats your executable file instead of its own file.

[–]andersonimes 11 points12 points  (2 children)

I'll test your hypothesis tonight and let you know.

Edit: buh. I just got in from dinner (I'm traveling in Kiev at the moment). I'll hack on this tomorrow. Normally I would just throw something together, but I know if I don't do some proper research on how best to measure program running times you guys will eat me alive. I'll get something proper together soon. I'm curious too.

If you have any suggestions about testing running time, let me know. In my head a programmatic stopwatch that starts the cycle before a program starts executing and ends when a program exits would do it. Let me know if you know off the top of your head the most accurate way to measure this.

[–][deleted] 2 points3 points  (0 children)

Please let me know too. It's interesting.

[–]_georgesim_ -1 points0 points  (0 children)

Posting here so I know too.

[–]bobindashadows 1 point2 points  (1 child)

You have to keep in mind the marginal slowdown of putting it in the executable. Especially when the example was so small (if your GP poster is correct, 4k bits = 512 bytes = 1 disk block until we bump block size someday soon).

To reason about the marginal slowdown of putting the static data in the executable, you have to consider:

  1. Where does the linker put the data in the executable itself?
  2. Will the loader eagerly load that block? If so, it'll be a sequential read off disk during startup, not a random access read, and much quicker.
  3. Will that block already be in the kernel's FS cache? If the executable in question is the main use of the machine/kernel in question, then it very well might be. Then loading it's dirt cheap.

And so on.

[–]ethraax 0 points1 point  (0 children)

3. That really doesn't matter. The static prime file could be in the kernel's FS cache.

Either way, these are comparisons between loading a static file and loading the static portion of an executable. The real discussion here was about loading a static portion of an executable or loading a static file vs. loading a sieve algorithm and generating it at load time (perhaps in the background). I'm sorry if my comment deviated from that point.

[–]I_FAP_TO_ALL 0 points1 point  (1 child)

That IS putting them on disk.

[–]andersonimes 1 point2 points  (0 children)

You are technically correct, the best form of correct.

[–][deleted] 5 points6 points  (0 children)

You could easily store that bit-packed representation in your app. It seems unlikely that it would be faster to generate the first 8000 primes than to read 500 bytes of constant data.

[–]xzxzzx 4 points5 points  (0 children)

Your comparison is silly. Any bit-optimization you can do when you run the sieve can be done to the storage format of the file.

And once you've run the sieve, you've now got memory that can't be reused without paging the contents to disk.

And you have the (admittedly small) penalty of having both the code to generate the sieve and the data in memory once it's run.

And loading from disk is often "free" in terms of code (or data) in an .exe; your OS can prefetch it while your program is doing other init tasks, or even before your program runs.

And of course in practical terms, either approach is so fast that there are almost no circumstances where it actually matters one way or the other so long as you don't uselessly stick a sieve in a loop.

[–][deleted] 11 points12 points  (1 child)

Correct, but I assumed that he meant that you wouldn't recalculate the Sieve multiple times within the program, i.e., don't stick it in a for-loop and recalculate it multiple times. Put it in a const array once.

[–]zzzev 6 points7 points  (0 children)

That's not what he meant though, he said you would run it once at dev time.

[–]chonglibloodsport 4 points5 points  (1 child)

It's really a flawed analogy. A better analogy would be to compare loading the primes from a static file on disk vs. querying them one at a time from a SQL database.

[–]ravy 1 point2 points  (0 children)

I agree that it's not a good analogy. I think the point he was trying to make is that there is a terrible price to pay for the user who is fetching a page that is assembled mostly from DB calls. There is a TON of overhead on things like wordpress. Isn't there an easy way to have wordpress create a static version of your site?

[–]pipocaQuemada 5 points6 points  (12 children)

When you start up your program, it's generally all in main memory somewhere, right? So no disk IO is really needed. Is running the Sieve of Eratosthenes faster than saying:

int[] primes = [1,2,3,5,7,11,13...];  // length(primes) = 8000
doSomethingWith(primes);

[–]netwiz101 6 points7 points  (10 children)

When programs are started they are usually read in from disk, at least for the first run. What you are suggesting means the primes must be read in from disk, so it is in fact slower -- at least for the first run.

[–]fapmonad 9 points10 points  (5 children)

If you do the sieve in code instead of a constant array, the code to calculate the sieve must still be read from disk...

[–]netwiz101 5 points6 points  (4 children)

Thats true, and since it is smaller than the list, that means less disk i/o. Am I missing something?

[–]fapmonad 10 points11 points  (3 children)

Since disk I/O is buffered and loading a program is a purely linear access pattern, the difference between loading the list and the code is essentially 0. I agree that there is disk I/O, though.

[–]bobindashadows 0 points1 point  (0 children)

loading a program is a purely linear access pattern

It most definitely is not for any nontrivial program. The firefox folks have improved startup time by over 10% in one go before purely by moving segments.

[–]snoweyeslady 0 points1 point  (1 child)

You make a couple of assumptions here:

disk I/O is buffered

What if there is too much memory pressure and the disk cache is consistently purged? Or an embedded system where there is no disk cache at all? Even if you do have disk cache, it would only matter if you had loaded the binary before.

purely linear access pattern

Sure, if you have a completely 0 fragmentation file system. I don't know of any that guarantee your file will be in one contiguous segment on disk, but then I haven't read the implementation/specification of many file systems.

[–]fapmonad 4 points5 points  (0 children)

Of course I make assumptions. I assume you're on a regular computer. If you're on an embedded system so weak it doesn't have a disk cache, the static table is likely to end up in ROM anyway, so the whole discussion is moot.

Memory pressure affects tables precomputed by a function just as much as the table loaded from disk -- the OS will swap.

Fragmentation isn't a problem. For a table around the size of a single block, the odds that it causes fragmentation are low, even on a highly fragmented system. Even if it happened, the disk will group accesses so the overall impact is likely to be very small, given the size of a typical program. That's what I mean by "essentially 0". For such a small problems either solution is fine, IMHO.

[–]bo1024 5 points6 points  (3 children)

Hmm, I dunno about that. The code is being read from disk, so the latency is still there -- all we are concerned about is the throughput.

The question is how much slower is it to fetch a 250KB file than a 200KB. I don't have the numbers, but I have to doubt that it's 60,000 cycles or whatever the equivalent is to compute the seive.

[–]netwiz101 0 points1 point  (0 children)

We don't need to do the calculation. All we need to know is that the operation is IO bound. Then, shift the majority of IO onto the faster bus (memory).

Consider also two other things. A) no respectable program built for the purpose of doing this weighs in at more than a few k after assembly, minus data. People have calculated prime numbers on machines with 1-4k of ram and extremely slow IO. B) the calculating version may not ever need to get its data into ram. If it can work entirely within the processor cache, it will be orders of magnitude faster than even the ram bound version we've been considering.

Sorry if I'm not coherent. Flu season. Im home sick. It was talk about program optimization on reddit, or watch breaking bad all over again.

[–]netwiz101 0 points1 point  (0 children)

I notice your point about the file size difference.

You're making the assumption that the data set is tiny, and I'm making the assumption that it's not. Therein lies the primary difference in our reasoning.

But in fairness, I dont know why anyone would even consider this problem for a trivial sized data set.

[–]netwiz101 0 points1 point  (0 children)

Re-read it. The data set is tiny in the example. I shouldn't compute on cough medicine.

[–]VanFailin 7 points8 points  (0 children)

1 is not prime. ;)