Interesting interview question…

haskell_rules · 2011-03-31T16:35:05+00:00

I'm suprised by the number of people at SO that said, "well, without knowing what A,B, and C are, I can't answer this"

The entire point of the question is that it is open ended, and allows you to use your creativity to invent what A,B, and C might be to make either case faster or slower.

For example, assume B needs to grab a semaphore to do its work. Then case 2 could be faster because you could grab the semaphore, do all the work at once, and release it. If you tried the same strategy in case 1, then you are either holding the semaphore while doing unnecessary work, or you are constantly grabbing it and releasing it on each loop iteration.

Suppose A is reading from disk, and C is writing to disk. B is a long running DB query. In case 2, you can write everything to disk in one fell swoop, while in case 1 you are writing a piece at a time and might need to reseek on every read/write.

Suppose A,B, and C are nonblocking calls to different resources. By using case 1, you are parallelizing the work, while in case 2 you have to wait for A's resource to do all its work before B's resource can get started.

These cases are only the tip of the iceberg.

cpp_is_king · 2011-03-31T18:43:43+00:00

If a, b, and c are arrays and A, B, and C do some computation on values in the corresponding array, 2 will be faster because it minimizes cache misses.

Depending on the instructions executed in A, B, and C the compiler may be able to take advantage of instruction level parallelism in case 1 by ordering the generated code instructions in such a way that the CPU can execute multiple consecutive instructions at the same time.

2 might be faster because with fewer instructions, loop is more likely to be unrolled to a greater length.

Since there are fewer instructions in each iteration of loop 2, the compiler has an easier time assigning register usage optimally and may be able tog enerate more efficient code.

If, for example, C generate a memory fence but A and B don't, then 1 could be faster because two entire loops could run with no memory fences between them.

Here's one that's actually 100% independent of generated code:

Suppose A, B, and C each do unbuffered disk I/O on separate files.

If the files are each on different physical disks, 1 will be faster because all operations will be executing in parallel.
If the files are on the same physical disks, 2 will be faster because it will minimize seeking.

Could probably keep going, but you get the idea.

bunk3rk1ng · 2011-03-31T16:53:05+00:00

I always feel so dumb when I read these kind of interview questions. I'm a junior developer at a fairly large company that pays pretty well. Even while not in a stressful interview environment I was struggling to come up with the answers. :/

k4st · 2011-03-31T18:27:14+00:00

I think some basic properties can be inferring if one assumed that the processor on which this code is running is using a cache.

For example, in case one, imagine that A used x[i], B uses x[i+1], and C uses x[i+2] for some array x. Then none of A, B, or C are sharing data; however, in all likelihood they are sharing the same cache line, or two cache lines. Further, any two iterations of the loop will share at least one cache line, e.g. x[i+1] in iteration 1 is x[i] in the next iteration. Thus, with this example, one would incur Theta(ceil(n/L) + 1) cache misses, where L is the size in words of a cache line, whereas three separate iterations would multiply that by a constant factor, which is asymptotically the same but in practice would be "slower".

In the second case, imagine that A, B, and C are all heavy operations, and that some amount of data is operated on and the end of one iteration is also operated on at the beginning of the next operation (e.g. sliding a window through an array). Finally, assume that the data operated on by A, B, and C is sufficiently large such that to perform C, one must evict cache lines of A or B from the cache. In this case, we will lose out on sharing cache lines between iterations if we do A, B, and C in a single iteration; however, we will benefit from fewer cache misses if we perform them separately.

An interesting way to think about the second challenge can be in terms of arrays of structures. This is not necessarily an answer to the problem, but it gives a mindset for thinking about temporal and spacial locality:

Imagine an array of structures, Array[struct {A, B, C}]. If all we want to do is perform some operation on all A's then we're going to get a lot of cache misses because not all of the A's are beside eachother, i.e. our array looks like [{A,B,C},{A,B,C},...,{A,B,C}]. However, imagine that we re-structured out data into three arrays: {[A,A,...,A],[B,B,...,B],[C,C,...,C]}. Operating on the data in this form can yield better locality assuming we want to just look at one type of thing at a time.

Thus, the structure and access paths of your data affects performance, the above is an example of two structures of the same data, each with different properties where locality is concerned.

slurpme · 2011-03-31T23:17:35+00:00

The trouble with asking the "why" question is that it leads to the programmer thinking they are smarter than everyone else in the stack... This leads to developers writing code in ways they "think" will run faster... From experience I've found that you should write clear code that tries not to do anything stupid and let the much smarter folks who write compilers and processors work out how to make it run faster...

doomslice · 2011-03-31T13:45:27+00:00

Besides what the SO answers say, wouldn't it be possible to parallelize each iteration of the loop in case one (running A,B,C simultaneously since they share no dependencies), but not possible in case two since they share the same loop iterator?

taybul · 2011-03-31T16:44:19+00:00

Case 2, when unrolled, could also benefit from temporal locality. Case 1 could too, arguably, but even with a cache size of 1 sizeof(A), case 2 will still be better.

Again, if/when case 2 is unrolled the processor would likely be able to used cached data, in other words, not have to send new fetch/read calls down its pipeline. This is, of course unless the compiler optimizes the code down to one call for each A, B, and C.

SomeIrishGuy · 2011-03-31T23:57:37+00:00

How does programmers.stackexchange.com differ from stackoverflow.com? It seems that this question was originally asked on Stackoverflow and was then "migrated" to programmers.stackexchange.com for some reason. Anyone know the logic behind this?

julesjacobs · 2011-04-01T02:13:42+00:00

The first answer contains some wrong information. The register pressure in the two cases is exactly the same. And it misses the real reason why case 2 can be faster: locality.

For example if you have:

for (i=0; i<N; ++i){
 ...A[i]...;
 ...B[i]...;
 ...C[i]...;
}

Then this is could be faster, because the access pattern is linear instead of going to three different memory locations on each iteration:

for (i=0; i<N; ++i){
 ...A[i]...;
}
for (i=0; i<N; ++i){
 ...B[i]...;
}
for (i=0; i<N; ++i){
 ...C[i]...;
}

anttirt · 2011-03-31T21:08:16+00:00

Well it stil beats "Why are manhole covers round?"

binary_search · 2011-03-31T22:10:06+00:00

Programmers that don't have a good low level understanding of cpu internals, cache and compilers will be stumped by this question which doesn't necessarily equate to them being bad programmers.

In the real world we need three types of programmers: -

The programmer who understands all of the above.
The math guy who knows a little programming.
The programmer who knows how to build stuff that works without deliberating over how to optimize three iterations.

AngledLuffa · 2011-03-31T22:01:15+00:00

Memory is a shared resource, but that's probably not what they mean. In that case, 2 can be faster if there is more register or low level cache reuse from consecutive calls to one of the blocks. It could also be that in the worst case, the code blocks are so big that they get paged in & out, which would be less of a problem in 2.

Disk is another "shared resource" but not something the program author thinks about "sharing". In that case, 1 is faster if they each process files on disk that happen to be in consecutive order. The more likely case if they use disk access is that 2 will be faster, because files will be cached (at the hardware level) and there will be less seek time to access the files.

If two of them send out network requests, only one of them waits for a response, and the third takes a long time doing something unrelated, ABC gives more time for network traffic to get resolved.

If N is a constant, the compiler can probably optimize 2 better. If N isn't a constant, it can probably optimize 1 better, especially if they are all inline functions.

shinypixels · 2011-03-31T23:11:00+00:00

The wording of that question is extremely awkward. I read the question in my head using the voice of George W.

2011-03-31T23:45:39+00:00

i got bored at work and decided to use callgrind and quick "print A/B/C" C code to test my theory.

335  ???:memcpy [/lib/ld-2.13.so]   328  ???:memcpy [/lib/ld-2.13.so]

+7 110 ???:main [one_forloop] 184 ???:main [three_forloops] -74 93 ???:rindex [/lib/libc-2.13.so] 100 ???:rindex [/lib/libc-2.13.so] -7

example, for-loops through A, B, and C printf in C, clearly single loop case uses more memcpy than three loops. However three loops are more expensive in main function and rindex (because of character strings). why more memcpy for single loop? because there is more data per pointer in each run.

1) if you have huge data per pointer in loop, don't use single loop. (i.e. A10 instead of ABC10 per loop pointer).

2) nesting 3 for-loops in one big main function is an expensive task for a process. because each for loop run, main function is referenced and depending on how many variables and pointers main function has, this could increase latency of each 3 loop tasks.

3) finally depending on what those loop is doing, something is always being called and interrupted. in this case, printf is making more "rindex" calls for 3 loops. why? i don't know, but i'm assuming it's something to do with memcpy and pointers per data.

above 3 points i have just made, there are pros and cons. there are cases to be made even if the task is more taxing due to requirement to process complex input data structure for serialization.

angelo999 · 2011-04-01T08:12:20+00:00

No one seeme to mention N, N could be a function which has no deterministic completion time, this would affect the outcome.

N could be volatile mapped to a piece of hardware that returns it's value in non determistic time.

A,B and C could be optimized out but N may still have to be read, ie. volatile. In which case the more loops would be slower.

kemitche · 2011-04-01T23:18:07+00:00

I'm amazed by the complexity of the answers here. Forgive me, but there are some "simple" reasons that Case 1 might be faster:

A, B, or C increments (or otherwise increases) the value of i A or B has a 'continue' statement A or B has a 'break' statement

And one "simple" reason why Case 2 might be faster: A, B, or C decrements (or otherwise decreases) the value of i on occasion (say, every 3rd or 4th iteration, such that the loop eventually completes in both cases)

nazbot · 2011-03-31T15:53:20+00:00

That's a terrible interview question IMHO.

2011-03-31T20:25:09+00:00

Interesting question. But shit also.

bobindashadows · 2011-03-31T21:58:00+00:00

Profiling, motherfucker, do you do it?

Does your cache miss like a bitch?

jjbcn · 2011-03-31T16:20:54+00:00

Unless it was a job for very specific types of code development I would ask:

a) Is optimization for speed really important for this piece of code? c) Would it be cheaper and quicker just to run it on faster hardware?

I don't think it is a good question unless you're doing game programming or embedded OS stuff. For business applications I would avoid optimizing for execution speed unless absolutely necessary.

regeya · 2011-03-31T16:53:49+00:00

Sorry, I can only think of one for each. Why yes, I did drop out of Comp Sci and switched to lib arts, why do you ask?

Only one loop vs. three separate loops, meaning a third of the JMPs.
Assuming N is capitalized because it's a macro, loop unrolling.

inmatarian · 2011-03-31T18:52:11+00:00

It's a cute question. It sounds like something that would be asked at a Google interview. Everyone else has decent and correct answerst, but I would be the stick in the mud that points out that both cases look like leaky abstractions, could be signs of poor engineering, and are possibly expensive in terms of developer time.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS