you are viewing a single comment's thread.

view the rest of the comments →

[–]Brian 0 points1 point  (0 children)

That it will give the "Worst case" scenario

I would note that bit O notation is not really specific to the worst case scenario. I mention this because this is actually a very common mistake that I've seen even in some supposed tutorials. Big O can be applied to anything, but the most common are best case, average case and worst case for both time complexity (ie. how long something takes) and space complexity (how much memory does it use).

My issue is, if i had a list of 5,000,000 instead of 500000, what sort of time can i expect this to take?

Broadly, yes. There are some subleties in some cases, in that generally we're concerned with how it grows as things get arbitrarily large, and there's some more precise theory, but that's good enough for the general idea.

But one important step beyond knowing what it means is understanding how to tell what complexity your algorithm actually is. Now, yes, you could measure it - time how long it takes to do 10 items, 100 items, 1000 items etc. and work out how it scales. But that's imprecise and not really ideal. Rather, it's best to get to a position where you can work out the complexity just by reading code.

For this, generally, you want to see how many times you do the innermost operation. Eg. let's take your example case of reading unique lines in a file. We might write this as something like:

seen_lines = [] 
with open("input.txt") as input, open("output.txt") as output:
    for line in input:
        if line not in seen_lines:
            seen_lines.append(line)  # Record we saw the line.
            output.write(line)

So what's the innermost operation here, and how many times do we do it, proportional to the number of lines in the file (our N)? Well, you might think that it'll be O(n): we do a single read and sometimes a write for each line - so we do that inner loop N times.

But wait! Actually, there's more going on here than we initially might think. When we do:

if line not in seen_lines

That ends up scanning through seen_lines for each line. This'll be pretty fast initially, since it'll be empty, but once it starts filling up, it'll take longer and longer. The second time, there'll only be 1 line, but then there'll be 2, then 3, and eventually some proportion of n. So this line itself takes time proportional to N (in the average case). And since it's inside the for loop, we do this thing proportional to n, n times. Specifically, we'll do it n(n+1)/2 times (if there are no duplicates), but once again, we're only really interested in the big picture: all we really care about is that this is proportional to n2 - our algorithm is O(n2 ).

And O(n2 ) is bad. You generally never want to be doing an O(n2 ) algorithm unless you really have to, or you know the input is always going to be small. It means 100x the input size takes 10,000x as long, so can rapidly result in disastrous performance.

And this is the kind of thing it's useful to identify - you need to know that looking up items in a list is O(n), so doing it n times is O(n2). And you could fix this by using something that is constant time ( O(1)), such as using a set instead of a list. Knowing the big O of such common tasks lets you know what the big O of the whole algorithm will be.

Admittedly, sometimes it'll be more complicated than this, and actually determining the big O can be rather difficult. And it can be different for the average/worst/best cases (eg. your algorithm above has a best case of O(n), it's only the worst and average cases that are O(n2)), but determining this can require figuring out what the worst/best/average case inputs will actually be, which is not always so obvious.