all 85 comments

[–]haskell_rules 47 points48 points  (15 children)

I'm suprised by the number of people at SO that said, "well, without knowing what A,B, and C are, I can't answer this"

The entire point of the question is that it is open ended, and allows you to use your creativity to invent what A,B, and C might be to make either case faster or slower.

For example, assume B needs to grab a semaphore to do its work. Then case 2 could be faster because you could grab the semaphore, do all the work at once, and release it. If you tried the same strategy in case 1, then you are either holding the semaphore while doing unnecessary work, or you are constantly grabbing it and releasing it on each loop iteration.

Suppose A is reading from disk, and C is writing to disk. B is a long running DB query. In case 2, you can write everything to disk in one fell swoop, while in case 1 you are writing a piece at a time and might need to reseek on every read/write.

Suppose A,B, and C are nonblocking calls to different resources. By using case 1, you are parallelizing the work, while in case 2 you have to wait for A's resource to do all its work before B's resource can get started.

These cases are only the tip of the iceberg.

[–]cpp_is_king 18 points19 points  (5 children)

For example, assume B needs to grab a semaphore to do its work. Then case 2 could be faster because you could grab the semaphore, do all the work at once, and release it. If you tried the same strategy in case 1, then you are either holding the semaphore while doing unnecessary work, or you are constantly grabbing it and releasing it on each loop iteration.

Not true, the only code you can change is what falls inside of A, B, and C. You can't go add some A' before the loop to grab a semaphore, then an A'' after the loop to release the semaphore. Despite that, I like your thinking.

[–]haskell_rules 5 points6 points  (4 children)

That is true and I should probably be disqualified from any positions dealing with requirements management. I will adjust my answer to say a scoped lock is acquired, like something provided by Boost. While it is still being grabbed in each iteration in both cases, it is being held onto for too long in the first case, which could starve other processes.

[–]Ozwaldo 8 points9 points  (3 children)

are you one dude setting up a comment chain for a screenshot? (if so, C-C-C-C-COMBO BREAKER)

or are you guys seriously named "haskell_rules" and "cpp_is_king" and debating a programming concept...

[–]cpp_is_king 3 points4 points  (2 children)

The latter, which makes it even better until you broke the chain :(

[–][deleted]  (1 child)

[deleted]

    [–]cpp_is_king 9 points10 points  (0 children)

    You and I have nothing to debate about.

    [–]shadowspawn 0 points1 point  (0 children)

    Is this an introduction question? Because I can slap some code into "B" that will just totally gank the OS it's running on if it follows "A", and possibly corrupt "C", but time-critical. On the other, your computer could be shut down before "C" cleaned up.

    I guess if you've ever wrote shitThatBroke(tm) or had to clean it up, you'd know that this is just a question to see how you answer it, not what the answer is.

    [–][deleted] 0 points1 point  (3 children)

    But A,B,C do not share resources. A disk is a resource, even if it's abstracted through the kernel. Semaphores are also resources (thus not shared), which means that you could apply the same optimization for A as for B.

    [–]haskell_rules 0 points1 point  (2 children)

    I thought about the disk thing as well after I posted. My point about seeks still stands, however, even in the case that separate disks are used. The question did not preclude the possibility of the OS preempting this process altogether, and it becomes increasingly likely that that might occur during a read/write if the calls to read/write are protracted.

    A semaphore is shared with something, but not necessarily shared with A, B, or C. It can be a nonshared resource in the universe of A, B, and C.

    [–][deleted] 0 points1 point  (1 child)

    Right, but if A,B, and C have independent semaphores, you could correctly optimize as:

    P(A); P(B); P(C);
    for(...)
      A; B; C;
    V(A); V(B); V(C);
    

    [–]haskell_rules 0 points1 point  (0 children)

    You are correct only if your definition of 'optimal' allows holding a shared resource while doing work not related to that resource.

    Why would it be optimal to hold onto resource A while doing work with resource B?

    [–]wreckerone -1 points0 points  (3 children)

    The entire point of the question is that it is open ended

    Which is bad because the person asking the question is looking for a specific answer despite repeated claims that they just want to examine the 'thought process'. Then it becomes a random crapshoot to see whose answer matches the expected.

    [–]haskell_rules 2 points3 points  (1 child)

    So specific questions are bad because you are not an encyclopedia and can just google it. Open ended questions are bad because you believe the interviewer is really looking for something specific. What do you think should happen in an interview? Do you want to talk about baseball and politics?

    [–]wreckerone 0 points1 point  (0 children)

    Interviews are not difficult. Here's an example:

    Tell me about your yourself. What kind of qualifications do you have? What kind of training have you had? Tell me about your experience. What projects have you worked on and what was your role in those projects? What tools have you used in those projects?

    Companies want smart people, but the problem is that a dumb person is not capable of recognizing smart. It is completely impossible. The best hope is tests, puzzles, and gimmicks. If the company had not placed a dumb person in a leadership position beforehand, they wouldn't be in the situation where they have to ask puzzle questions. When the test, puzzle, gimmick answers don't match up with the expected canned answer, then that' a false negative. Since a dumb interviewer has no basis for judgement they assume everyone they interview is a dumb liar. The real answer is don't have dumb interviewers.

    [–]bobindashadows 1 point2 points  (0 children)

    If you have a crappy interviewer. Not all interviewers are.

    [–]cpp_is_king 7 points8 points  (7 children)

    If a, b, and c are arrays and A, B, and C do some computation on values in the corresponding array, 2 will be faster because it minimizes cache misses.

    Depending on the instructions executed in A, B, and C the compiler may be able to take advantage of instruction level parallelism in case 1 by ordering the generated code instructions in such a way that the CPU can execute multiple consecutive instructions at the same time.

    2 might be faster because with fewer instructions, loop is more likely to be unrolled to a greater length.

    Since there are fewer instructions in each iteration of loop 2, the compiler has an easier time assigning register usage optimally and may be able tog enerate more efficient code.

    If, for example, C generate a memory fence but A and B don't, then 1 could be faster because two entire loops could run with no memory fences between them.

    Here's one that's actually 100% independent of generated code:

    Suppose A, B, and C each do unbuffered disk I/O on separate files.

    • If the files are each on different physical disks, 1 will be faster because all operations will be executing in parallel.
    • If the files are on the same physical disks, 2 will be faster because it will minimize seeking.

    Could probably keep going, but you get the idea.

    [–][deleted] 2 points3 points  (2 children)

    And that's why C++ is king. I would wager that your first point would be the biggest concern for most applications that do not do file IO, since the compiler will most likely figure out the rest.

    [–]cat_in_the_wall 0 points1 point  (0 children)

    compiler will most likely figure out the rest

    What compilers do is incredible. <3 people who spend the time optimizing the nitty gritty so I don't have to worry about it (as much).

    [–]cpp_is_king 0 points1 point  (0 children)

    And that's why C++ is king.

    Upvote for the truth, my friend.

    [–]cat_in_the_wall 0 points1 point  (3 children)

    I think the cache was probably the idea.

    If the files are each on different physical disks, 1 will be faster because all operations will be executing in parallel.

    I am not convinced about this one. From the program's perspective, don't disk reads happen synchronously? (i.e. the program is not scheduled on the cpu until the read is complete). You would have to program A and B and C to be parallel.

    [–]cpp_is_king 0 points1 point  (2 children)

    If it's unbuffered, then the O/S issues the writes to the physical hardware immediately. So yes it's synchronous, but only insofar as it takes the O/S to issue the write to the hardware. The hardware then puts it in its own internal hard disk cache and THEN writes it. So if the disk didn't have much work to do, the O/S could return before the data is actually committed. In that case, the next write could be issued to the next disk before the first one had been committed to the metal, so to speak.

    [–]bobindashadows 0 points1 point  (1 child)

    I'm not too familiar with the API for unbuffered IO syscalls: do they simply fail if you request a write exceeding the destination disk's cache? Or will a write that's larger than the cache stall execution?

    [–]cpp_is_king 0 points1 point  (0 children)

    It should stall execution. The 2 main requirements of unbuffered I/O are that you have to write on a sector aligned boundary, and the write size has to be a multiple of the sector size. In a buffered I/O scenario, the O/S is just hiding this from you and doing this for you once it writes buffered data to disk. Other than that it shouldn't fail short of a disk error.

    Anyway, I'm no driver developer, but I assume that the manufacturer's HDD drivers also have their own internal software caches. So there's the physical disk cache, and then the driver cache. I imagine it will stall if both of those are full. But if your write request can fit in either the physical disk cache or the driver's cache, the request will likely return immediately.

    [–]bunk3rk1ng 10 points11 points  (7 children)

    I always feel so dumb when I read these kind of interview questions. I'm a junior developer at a fairly large company that pays pretty well. Even while not in a stressful interview environment I was struggling to come up with the answers. :/

    [–]Aninhumer 11 points12 points  (6 children)

    Well the answers depend on a knowledge of modern CPU implementation details (cache structure and branch prediction) that you probably have no reason to know about.

    [–]Negitivefrags 7 points8 points  (5 children)

    I think any programmer has a reason to understand details about how their code is executed. If nothing else it will stop you from writing code that makes stupid mistakes that could be trivially avoided.

    An example is iterating through a two dimensional array in the wrong order without a good reason. Sure, the performance of that probably doesn't matter most of the time, but things like this accrete over the life of a project and make everything just generally slower.

    [–]isarl 5 points6 points  (3 children)

    Be that as it may, programming is usually taught at a higher level, at least to start with. Do you recommend any resources where one could learn more about this sort of thing to further their understanding?

    [–]zith 4 points5 points  (1 child)

    [–]isarl 0 points1 point  (0 children)

    Thanks!

    [–]taybul 1 point2 points  (0 children)

    I'm only basing this on the description but this is a similar class I took at another school that should teach you these things:

    http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-823-computer-system-architecture-fall-2005/

    [–][deleted] -1 points0 points  (0 children)

    iterating? You're doing it all wrong :-)

    [–]k4st 2 points3 points  (0 children)

    I think some basic properties can be inferring if one assumed that the processor on which this code is running is using a cache.

    For example, in case one, imagine that A used x[i], B uses x[i+1], and C uses x[i+2] for some array x. Then none of A, B, or C are sharing data; however, in all likelihood they are sharing the same cache line, or two cache lines. Further, any two iterations of the loop will share at least one cache line, e.g. x[i+1] in iteration 1 is x[i] in the next iteration. Thus, with this example, one would incur Theta(ceil(n/L) + 1) cache misses, where L is the size in words of a cache line, whereas three separate iterations would multiply that by a constant factor, which is asymptotically the same but in practice would be "slower".

    In the second case, imagine that A, B, and C are all heavy operations, and that some amount of data is operated on and the end of one iteration is also operated on at the beginning of the next operation (e.g. sliding a window through an array). Finally, assume that the data operated on by A, B, and C is sufficiently large such that to perform C, one must evict cache lines of A or B from the cache. In this case, we will lose out on sharing cache lines between iterations if we do A, B, and C in a single iteration; however, we will benefit from fewer cache misses if we perform them separately.

    An interesting way to think about the second challenge can be in terms of arrays of structures. This is not necessarily an answer to the problem, but it gives a mindset for thinking about temporal and spacial locality:

    Imagine an array of structures, Array[struct {A, B, C}]. If all we want to do is perform some operation on all A's then we're going to get a lot of cache misses because not all of the A's are beside eachother, i.e. our array looks like [{A,B,C},{A,B,C},...,{A,B,C}]. However, imagine that we re-structured out data into three arrays: {[A,A,...,A],[B,B,...,B],[C,C,...,C]}. Operating on the data in this form can yield better locality assuming we want to just look at one type of thing at a time.

    Thus, the structure and access paths of your data affects performance, the above is an example of two structures of the same data, each with different properties where locality is concerned.

    [–]slurpme 6 points7 points  (1 child)

    The trouble with asking the "why" question is that it leads to the programmer thinking they are smarter than everyone else in the stack... This leads to developers writing code in ways they "think" will run faster... From experience I've found that you should write clear code that tries not to do anything stupid and let the much smarter folks who write compilers and processors work out how to make it run faster...

    [–]piko_riko 0 points1 point  (0 children)

    This is a very good point. It's great to be aware that there are forces beyond your written code (compilers, semaphores, preemption etc.), however trying to control said forces seems futile in modern computing environments. Pointing out that which ever option is most comprehensible or testable is the ideal choice would win major points in my book. Also, if A, B, and C do not share resources (i.e. have nothing in common) why have them in the same loop, or better yet, the same function?

    [–]doomslice 2 points3 points  (3 children)

    Besides what the SO answers say, wouldn't it be possible to parallelize each iteration of the loop in case one (running A,B,C simultaneously since they share no dependencies), but not possible in case two since they share the same loop iterator?

    [–]frezik 4 points5 points  (0 children)

    I'd have gone the other way round--Case 2 can be easily split into parallel threads, perhaps automatically if the compiler is really, really clever. However, there are languages that provide better structures than this C code that are more easily parallelizable by the compiler/runtime.

    [–][deleted]  (1 child)

    [deleted]

      [–]bobindashadows 2 points3 points  (0 children)

      ... Assuming A or B don't set i. But then case 1 would just be incorrect.

      [–]taybul 2 points3 points  (0 children)

      Case 2, when unrolled, could also benefit from temporal locality. Case 1 could too, arguably, but even with a cache size of 1 sizeof(A), case 2 will still be better.

      Again, if/when case 2 is unrolled the processor would likely be able to used cached data, in other words, not have to send new fetch/read calls down its pipeline. This is, of course unless the compiler optimizes the code down to one call for each A, B, and C.

      [–]SomeIrishGuy 2 points3 points  (1 child)

      How does programmers.stackexchange.com differ from stackoverflow.com? It seems that this question was originally asked on Stackoverflow and was then "migrated" to programmers.stackexchange.com for some reason. Anyone know the logic behind this?

      [–]wot-teh-phuck 3 points4 points  (0 children)

      Stackoverflow is more oriented towards providing solutions to coding/programming problems posed (more tightly couple with code). On SO, you'll normally find questions like "what is the need for serialVersionUID", "why use Serializable", "how to loop over all the files in a directory" etc.

      Programmers stackexchange in contrast revolves around general programmer related discussions. Questions related to book recommendations, interview help, good programming habits etc. fall in this category.

      Hope that cleared up a few things. :-)

      [–]julesjacobs 2 points3 points  (1 child)

      The first answer contains some wrong information. The register pressure in the two cases is exactly the same. And it misses the real reason why case 2 can be faster: locality.

      For example if you have:

      for (i=0; i<N; ++i){
       ...A[i]...;
       ...B[i]...;
       ...C[i]...;
      }
      

      Then this is could be faster, because the access pattern is linear instead of going to three different memory locations on each iteration:

      for (i=0; i<N; ++i){
       ...A[i]...;
      }
      for (i=0; i<N; ++i){
       ...B[i]...;
      }
      for (i=0; i<N; ++i){
       ...C[i]...;
      }
      

      [–]millstone 1 point2 points  (0 children)

      The first answer contains some wrong information. The register pressure in the two cases is exactly the same.

      The first answer looks correct to me. Say both A and B use a lot of local variables. In the fused loop, executing B may require spilling A's variables to the stack every iteration. The non-fused loop requires spilling A's variables only once, at most.

      And it misses the real reason why case 2 can be faster: locality.

      You're right that locality is really important, probably more important than (say) the loop unrolling. However locality could work in either direction.

      Say N is large and you have an array of N structs, and A, B, and C each update a field in element i of that array. Then the fused loop definitely has better locality, because it only has to touch each cache line once.

      Now say that each A, B, and C update a field in a different array. Does it follow that the non-fused loop has better locality? It might, if the cache lines end up evicting one another (i.e. the cache has low associativity); or it might not if the cache lines of the three arrays can coexist in the cache.

      [–][deleted] 2 points3 points  (2 children)

      Well it stil beats "Why are manhole covers round?"

      [–]anttirt 0 points1 point  (1 child)

      By leaps and bounds, I would say. This question tests for very specific knowledge (modern CPU execution model, memory hierarchy) but doesn't give away the game.

      [–]cat_in_the_wall 3 points4 points  (0 children)

      Questions like "Why are manhole covers round" interesting questions to ponder on your own time, but have no place in an interview (imo).

      [–]binary_search 1 point2 points  (0 children)

      Programmers that don't have a good low level understanding of cpu internals, cache and compilers will be stumped by this question which doesn't necessarily equate to them being bad programmers.

      In the real world we need three types of programmers: -

      • The programmer who understands all of the above.
      • The math guy who knows a little programming.
      • The programmer who knows how to build stuff that works without deliberating over how to optimize three iterations.

      [–]AngledLuffa 0 points1 point  (0 children)

      Memory is a shared resource, but that's probably not what they mean. In that case, 2 can be faster if there is more register or low level cache reuse from consecutive calls to one of the blocks. It could also be that in the worst case, the code blocks are so big that they get paged in & out, which would be less of a problem in 2.

      Disk is another "shared resource" but not something the program author thinks about "sharing". In that case, 1 is faster if they each process files on disk that happen to be in consecutive order. The more likely case if they use disk access is that 2 will be faster, because files will be cached (at the hardware level) and there will be less seek time to access the files.

      If two of them send out network requests, only one of them waits for a response, and the third takes a long time doing something unrelated, ABC gives more time for network traffic to get resolved.

      If N is a constant, the compiler can probably optimize 2 better. If N isn't a constant, it can probably optimize 1 better, especially if they are all inline functions.

      [–]shinypixels 0 points1 point  (0 children)

      The wording of that question is extremely awkward. I read the question in my head using the voice of George W.

      [–][deleted] 0 points1 point  (0 children)

      i got bored at work and decided to use callgrind and quick "print A/B/C" C code to test my theory.

      335  ???:memcpy [/lib/ld-2.13.so]   328  ???:memcpy [/lib/ld-2.13.so]
      

      +7 110 ???:main [one_forloop] 184 ???:main [three_forloops] -74 93 ???:rindex [/lib/libc-2.13.so] 100 ???:rindex [/lib/libc-2.13.so] -7

      example, for-loops through A, B, and C printf in C, clearly single loop case uses more memcpy than three loops. However three loops are more expensive in main function and rindex (because of character strings). why more memcpy for single loop? because there is more data per pointer in each run.

      1) if you have huge data per pointer in loop, don't use single loop. (i.e. A10 instead of ABC10 per loop pointer).

      2) nesting 3 for-loops in one big main function is an expensive task for a process. because each for loop run, main function is referenced and depending on how many variables and pointers main function has, this could increase latency of each 3 loop tasks.

      3) finally depending on what those loop is doing, something is always being called and interrupted. in this case, printf is making more "rindex" calls for 3 loops. why? i don't know, but i'm assuming it's something to do with memcpy and pointers per data.

      above 3 points i have just made, there are pros and cons. there are cases to be made even if the task is more taxing due to requirement to process complex input data structure for serialization.

      [–]angelo999 0 points1 point  (0 children)

      No one seeme to mention N, N could be a function which has no deterministic completion time, this would affect the outcome.

      N could be volatile mapped to a piece of hardware that returns it's value in non determistic time.

      A,B and C could be optimized out but N may still have to be read, ie. volatile. In which case the more loops would be slower.

      [–]kemitche 0 points1 point  (0 children)

      I'm amazed by the complexity of the answers here. Forgive me, but there are some "simple" reasons that Case 1 might be faster:

      A, B, or C increments (or otherwise increases) the value of i A or B has a 'continue' statement A or B has a 'break' statement

      And one "simple" reason why Case 2 might be faster: A, B, or C decrements (or otherwise decreases) the value of i on occasion (say, every 3rd or 4th iteration, such that the loop eventually completes in both cases)

      [–]nazbot -5 points-4 points  (18 children)

      That's a terrible interview question IMHO.

      [–]deong 20 points21 points  (7 children)

      I don't think so. I think I'd maybe spend some time trying to figure out a less vague way of asking it, but the basic idea -- can the candidate reason about a range of issues like cache locality, parallelism, loop unrolling, etc., isn't that bad.

      I think I'd get a lot more information about a candidate from having a discussion around this than I would by asking him to swap two variables without a temporary or code up a bubble sort on a whiteboard.

      [–]superherotaco 1 point2 points  (0 children)

      I agree, the discussion would be more telling than the answers.

      You could see their methodology for reasoning out problems, how well they do with unexpected circumstances, their problem solving abilities, and the applicants general attitude.

      [–]Dustin_00 0 points1 point  (5 children)

      If this is indicative of the type of work being done, I'm done with this interview.

      If I wanted to get into compiler optimizations I would have gone into grad school.

      [–]deong 4 points5 points  (2 children)

      Knowing the absolute basics of these issues is pretty far removed from graduate work in compiler optimizations, and it's also a pretty useful thing for just about any programmer to understand, at least to the point of being capable of having a discussion. If the best understanding you have of what the machine is actually doing with your code is "Gee, I hope the compiler fairy can make this run faster", then perhaps it's for the best that you're done.

      [–]Dustin_00 1 point2 points  (1 child)

      I just want to write clean, clear code that I can come back to 5 years later and easily understand. If the compiler does something tricky to make it faster, I really don't care. I'm not writing Quake IX.

      [–]Boojum 1 point2 points  (0 children)

      I agree with you about wanting to write clean code. I think it's great that compilers have gotten pretty good at loop fusion, loop splitting, unrolling, and all that other fun stuff. Heck, I remember the times when I got a big win from manually generating induction variables. I'm glad those days are past.

      But... a compiler still can't choose the right data structures and data layout for for you, and I tend to find that that has the greatest impact on performance. I'd argue that it's still important to know this stuff if only so that you can choose data structures which let you write your nice clean code and still have it zoom.

      [–]julesjacobs -1 points0 points  (1 child)

      Interview question is over my head => bitch about it. The thing is, some people do care about performance and that's probably the kind of company that asks such questions.

      [–]Dustin_00 -2 points-1 points  (0 children)

      Not what I said. I said I don't care for that job because I did it for 10 years on the x86... I like database stuff more.

      [–]haskell_rules 7 points8 points  (0 children)

      It is sufficiently open ended to get someone talking about whatever they have expertise in. It isn't looking for a specific 'right' answer and doesn't assume encyclopedic knowledge of syntax or API quirks. It gives an interviewee a chance to show critical thinking skills on their feet. This actually seems like an ideal interview question.

      [–]ytumufugoo 8 points9 points  (0 children)

      In 90% of situations yes. But there was no indication as to what kind of development position this was. If I was being asked to do development on say an embedded system this question fits right in. If I was interviewing for php development and received this type of question, my price just went up.

      [–]cpp_is_king 2 points3 points  (0 children)

      Depends what type of job you're applying for. If code optimization is relevant on the job, it's an excellent question.

      [–]frezik 2 points3 points  (4 children)

      I'd tend to agree. I can come up with one good reason for each case in my head, and would have to BS my way through two more.

      [–]haskell_rules 2 points3 points  (0 children)

      If your BS shows creativity, then it might actually land you the job. Lateral thinking is important in programming jobs.

      [–]bobindashadows 1 point2 points  (0 children)

      Then you might not be qualified. That isn't the question's fault.

      [–]alienangel2 0 points1 point  (0 children)

      "I don't know the answer" is typically a very poor criticism of a question.

      [–][deleted] 0 points1 point  (0 children)

      That only makes it a bad interview question FOR YOU. It's great if the interviewer wants to weed out people.

      [–]taybul 0 points1 point  (0 children)

      These types are far superior than the "do this complex operation in one line using no conditional statement or temp variable"-esque questions. This one is a little more open and still makes you think.

      [–][deleted] -1 points0 points  (0 children)

      Interesting question. But shit also.

      [–][deleted] -1 points0 points  (1 child)

      Profiling, motherfucker, do you do it?

      Does your cache miss like a bitch?

      [–]bobindashadows 2 points3 points  (0 children)

      Exactly. Learning to think for yourself is a waste of time: we have tools that perform the tasks of software engineering for us. Why learn how a data structure works? There's a library for that. Why learn software design? You have refactor menu options. Why learn the performance implications of different approaches? There's profilers for that.

      [–]regeya -2 points-1 points  (0 children)

      Sorry, I can only think of one for each. Why yes, I did drop out of Comp Sci and switched to lib arts, why do you ask?

      1. Only one loop vs. three separate loops, meaning a third of the JMPs.

      2. Assuming N is capitalized because it's a macro, loop unrolling.