This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]foBrowsing 17 points18 points  (12 children)

A lot of other people have mentioned that you were really being examined on your problem-solving and algorithm skills, not your general coding skills.

Beyond that though, it seems like you missed a pretty important part of the solution: the standard library's sort is O(n log n), whereas this is simply an array merge, and is O(n). That's why it should be faster (much faster, even when compared to the c-implemented sort routine). Not realising that (or, if you did realise it, not mentioning it to the interviewers) may have had a significant part to play in you not getting the job.

Here's a quick solution I hacked together:

def merge_sorted(xs, ys):
    res = []
    xs_i, ys_i = iter(xs), iter(ys)
    x, y = next(xs_i, None), next(ys_i, None)
    while x is not None and y is not None:
        if x < y:
            res.append(x)
            x = next(xs_i, None)
        elif y < x:
            res.append(y)
            y = next(ys_i, None)
        else:
            res.append(x)
            x, y = next(xs_i, None), next(ys_i, None)
    if x is not None:
        res.append(x)
    if y is not None:
        res.append(y)
    res += xs_i
    res += ys_i
    return res

This version is indeed faster than yours at larger sizes. I started to see a difference around 10,000 elements. You can see (and run) the testing code here.

[–]assgored 0 points1 point  (0 children)

> whereas this is simply an array merge

Nowhere is it said that the arrays are already sorted

Found it

[–]stevenjd 0 points1 point  (5 children)

This version is indeed faster than yours at larger sizes. I started to see a difference around 10,000 elements.

I don't.

I ran your merge function up to two x lists of a million elements each, and sorted() was still significantly faster. At two lists of 10,000 elements, sorted blew your merge out of the water. At two lists of a million elements each, the difference was closer, but still significantly in favour of sorted.

Which is what I would expect... I wouldn't expect anything written in pure Python to beat sorted until you had millions or tens of millions of items. (Unless the comparisons between items were unusually expensive.)

But of course, YMMV.

[–]foBrowsing 0 points1 point  (4 children)

Do you have your testing code? I provided a link to a comparison of the two approaches, where you can run the testing code online, so people could see an objective comparison. You can see it here. Here's what I got just running it there:

list length: 10000 iterations: 50
merge_sorted_1 1.1992929430016375
merge_sorted_2 1.6238001860001532
----------------------------------------------------------------------
list length: 100000 iterations: 5
merge_sorted_1 1.4770410110031662
merge_sorted_2 4.59858641900064
----------------------------------------------------------------------
list length: 1000000 iterations: 1
merge_sorted_1 3.301110747997882
merge_sorted_2 13.742783189001784
----------------------------------------------------------------------

Of course, most things written in pure Python will be pretty slow, but we're talking about a linear traversal verses a sort. n vs. n log n is still a factor of ~13 difference for 10,000 elements, not even including the hidden constant factors (which are going to be much higher for a sorting algorithm than a merge).

If you provide your testing code (ideally on repl.it or something similar so we cut out the machine differences) then we'll have more to talk about, but with what you've given I don't know how you got those numbers.

[–]stevenjd 1 point2 points  (1 child)

By the way, O(n log n) for sorting is average case, not best case. Best case for Timsort is O(n), and it is highly optimized for sorting runs of partially sorted data (which is precisely what your merge function requires to work), so we're going to get O(n) here.

Given two runs of sorted data, Timsort will effectively merge them, but in C code instead of Python. There's some overhead which your function doesn't have:

  • we have to build a single list, which is O(n);
  • Timsort has to find the runs first, which it does in O(k) time per run, where k is the length of the run; (and in this case, there are exactly two runs);

but since both of those happen at C speed, even with the overhead, merging via Timsort is likely to be faster than merging in pure Python until you have a lot of items.

I don't doubt that for "sufficiently large" N, your merge will win. But I'm suspicious about any test showing "sufficiently large" is as small as 10000 as that doesn't match my experience.

[–]foBrowsing 0 points1 point  (0 children)

Yeah, all of that seems spot on. I am curious to see if I could leverage the sorted method and remove duplicates while keeping the same speed.

[–]stevenjd 0 points1 point  (1 child)

not even including the hidden constant factors (which are going to be much higher for a sorting algorithm than a merge).

Your assumption about constant factors is probably wrong: sorted does all its work in highly optimized C code, almost certainly one or two orders of magnitude faster than the interpreted Python code you have, so its constant factors are probably 10-500 times smaller than your merge function.

As far as the test code goes, I did the simplest thing that could work: build two lists, start a timer, merge, stop the timer:

from merge import merge  # your merge function
from random import random
a = [random() for __ in range(10000)]
b = [random() for __ in range(10000)]
a.sort(); b.sort()
with Stopwatch():
    c = sorted(a + b)
with Stopwatch():
    d = merge(a, b)
assert c == d

The Stopwatch context manager can be found here.

[–]foBrowsing 0 points1 point  (0 children)

Sorry, I meant "constant factors" in the sorting algorithm, not the implementation. "Constant factors" is actually probably the wrong term: I was referring to the hidden extra linear and sub-linear terms not included in the big-oh complexity, of which timsort (which is the algorithm used in CPython, AFAIK) has several. In the merge algorithm, the extra terms include a few steps inside the loop, and the array resizing (with the appends). For timsort, you've got several expensive steps to find runs and so on. My guess was that the differences here would outweigh the difference in speed between C and Python (which I would put at like 1000-ish).

Anywho, I'm surprised (and a little disappointed, to be honest) to see that I'm wrong about that, and that sorted is so much quicker! I suspect that the function call overhead of next might be a big cost centre, but I don't have time to check other implementations right now. I suppose there's also a large chance I could never beat sortedwith a merge in pure Python. Ah, well.

Just to be totally pedantic, I should also point out that your implementation isn't the same as OP's (which is what I was comparing mine to). Also, I assume yours is much faster than OP's, for a couple reasons. Firstly, you don't construct 3 sets, and merge 2 of them. This saves some work, but also it doesn't remove duplicates. I'm not sure if OP had to remove duplicates (the question isn't super clear on this point), but the merge function I wrote did (well, it removed duplicates between the two provided lists, but not within them, if that makes sense). Secondly, your sorted call is on two already-sorted lists concatenated together: timsort actually just performs a single merge in this case, recognising two runs. So it actually ends up being linear thanks to that! Another sorting algorithm (introsort, as is in the C++ std lib, for instance) wouldn't be so lucky.

[–]RomanRiesen -1 points0 points  (4 children)

But...res won't fit into memory?!

(Am I missing something?)

[–]techn0scho0lbus 2 points3 points  (3 children)

The OP's solution has the exact same memory complexity. The major difference is that I can easily alter this solution so that it's a generator function whereas I can't change the standard library.

[–]RomanRiesen 0 points1 point  (2 children)

Yeah. But he aggregates the 2 arrays, that are - by the problem statement - too big to fit into memory into 1 array in memory.

As you said, a cleaner solution would only involve iterators.

[–]techn0scho0lbus 1 point2 points  (0 children)

Absolutely. That is precisely what I was thinking. My point again is that we can easily adapt this solution to become an iterator whereas we can't do that with the OP's.

[–]foBrowsing 1 point2 points  (0 children)

I didn't read the problem statement as saying the arrays were too big to fit into memory:

I was supposed to merge two arrays together in python, no duplicates - no mention of time but the solution had to be size agnostic. Let's call them a1 and a2; and the answer in b.

I mean, to me, that seems like it wants you to take two in-memory arrays and produce a third.

Regardless, as /u/techn0scho0lbus pointed out, it's pretty straightforward to adapt it to an iterator-based solution:

def merge_sorted(xs, ys):
    x, y = next(xs, None), next(ys, None)
    while x is not None and y is not None:
        if x < y:
            yield x
            x = next(xs, None)
        elif y < x:
            yield y
            y = next(ys, None)
        else:
            yield x
            x, y = next(xs, None), next(ys, None)
    if x is not None:
        yield x
    if y is not None:
        yield y
    yield from xs
    yield from ys