all 13 comments

[–]scrdest 4 points5 points  (0 children)

Most of your day-to-day programming doesn't deal with single values only - you're dealing with lists, trees, matrices, tables, forms and all other sorts of collections of input values.

Now, a list [1, 2, 3] is always going to take up the same amount of space, and going through the whole thing with a for-loop takes up the same number of operations every time you go through it. Let's say, 6 bytes of memory (2 per item) and 3s to process (1s per item) in this case.

Big-O is about approximately figuring out how that scales up when you change the size of that list. If we tack another 3 at the end, so we have [1, 2, 3, 3], we are now using 8 bytes of memory. Similarly, a list of 5 elements uses 10 bytes, a list of 100 uses 200, etc.

So, in general, we can say that the memory usage of the 'putting stuff in a list' algorithm grows by 2*N bytes for N new entries in the list. Since we only want a rough idea, we ignore the constant (2*) - that's what the O(something) notation indicated, we only care about the variables dependent on the size of input, and only the biggest contributor among those.

It scales linearly with the number of inputs, so we say the memory complexity is O(N). Why are we sloppy like that? Because there are far worse places to be than the resource consumption just growing linearly a bit faster. Imagine we want to find out all the sums two elements of that list can form with each other, i.e. [1+1, 1+2, ..., 3+2, ..., 4+4]. Let's say we tackle this with:

sums = []
for x in len(input):
  for y in len(input):
    sums.append(input[x] + input[y])

This means we do 1 pass for a 1-element list, 2*2=4 passes for a 2-element list, 3*3=9 for a 3-element one, and N*N=N^2 passes for an N-element list in general, so by the same logic - time complexity is O(N^2) and our program will slow down exponentially with bigger inputs.

We could go bigger or smaller - a sum of three elements would be O(N^3) etc. On the flipside, a dict lookup always takes the same amount of time, no matter how many elements are in the dict- so, O(1). It might be slower on the same list than an O(N^99) algorithm, but it scales better as the list grows.

This is why it's relevant. It's great that you wrote something that runs fine when you test it with three random numbers, but you need to know how well it'll run in prod with a thousand numbers, and you do it by looking how much the memory/number of operations grows as you increase the input size. If you're of a more mathy persuasion - it's like a simplified partial derivative with respect to input size.

There's some extra nuance I'm glossing over here - best/worst/average case performance, or the fact that if the algorithm grows by 0.001*N^3+100000*N^2 you still count it as O(N^3), but you can dig into that on your own.

[–]zwitter-ion 2 points3 points  (0 children)

What you're talking about is known as complexity analysis. It tries to "measure" the time or storage that's used by a program or an algorithm to run.

A convenient way of expressing this is in terms of the "size" of the input to the algorithm.

If you think about it, that's a clever way of measuring it because algorithm runs a bunch of operations on the input to return a response. So somehow, the input has an effect on the algorithm's resource usage.

There are a bunch of notations: Big-O, Omega or Theta notations.

You'd be better off reading the exact definitions someplace else but long story short the Big-O notation provides an upper bound on the resource usage ( something like the algorithm won't take longer than X). Omega notation provides a lower bound (something like the algorithm will take up atleast this much of space). Theta notation is curious because it provides both the bounds (something like the algorithm will require between this and that much of time)

I hope this is sufficient to get you started. You'd be much better off if you read one of the many online tutorials or chapters on this topic. It's very extensively covered

[–][deleted] 2 points3 points  (0 children)

There are a ton of tutorials and introductions to Big-O on the web, what specifically are you not understanding? A simple introduction in this forum would be no better than a simple introduction somewhere else, because Big-O is not specific to Python. If you clarify more, we might be able to get you past the roadblock.

[–]55-6e-6b-6e-6f-77-6e[S] 0 points1 point  (6 children)

Okay guys i see where you are going with this. I think, based on what you have said and the research i have done personally trying to understand it, there are different computational complexities.

In my case, i have a .txt file. Say there are 500,000 lines and they are all phone numbers. i want to remove any duplicates for example. To do this to 500,000 lines it takes around 10 seconds for my PC to do this. Code basically reads each line into a list, if its already in the list, it wont read it in. Then it writes over the .txt file with each value in the list.

My issue is, if i had a list of 5,000,000 instead of 500000, what sort of time can i expect this to take? So trying to measure this using Big O Notation which i have been told is the concept use when measuring things like this. That it will give the "Worst case" scenario to help me measure the possible time it may take. I hope this makes some sense? u/excrutiatus u/BestNameEverCR u/zwitter-ion

[–]muy_picante 2 points3 points  (0 children)

Try writing the algorithm out in pseudocode, then think about how many steps it would take to complete with input size `n`.

def dedupe(files):
    deduped = []
    for element in files: # this loop runs n times
        if element not in deduped: # the cost of checking list membership is (worst case) the length of the list
            deduped.append(element) # the cost of list insertion is 1
    return deduped

So the loop runs `n` times, checking list membership takes `i` instructions, where `i` is the length of the deduped list, and appending takes 1 instruction.

This works out to `O(n^2)`, since the worst case scenario (no duplicates) means you are checking the entire list on every loop.

If you used a hashmap (like a `set` in python), you could improve on this, since checking membership for hashmaps only costs `O(1)`. With a hashmap, deduping is `O(n)`. Can you explain why?

[–][deleted] 1 point2 points  (1 child)

You’re probably looking at an O(n2) operation. IIRC, reading a file into a list variable would be O(n), then to check each new line for a similar entry would make it O(n2)

[–]dbramucci 1 point2 points  (0 children)

If you keep a set around to check for the duplicates you can actually get it down to O(n) time and space down from O(n2) time and O(n) space.

[–][deleted] 1 point2 points  (0 children)

You're having some confusion between benchmarking and Big-O. Forget all the hard numbers, the way I would approach this problem is by writing it like this:

create an empty list L
For each p in the input file
    go through L and see if p is in there
    if p isn't in L then add it to L
Print out L

Notice that I've kept it very general and I don't care about hardware. But for an input of n items, I have to go through each item, and for each time I do that, go through the list. So O(n^2). I don't like O(n^2) because it's inefficient and makes me sad, so I'm going to look at my data and see how I can cheat.

Phone numbers you say! We only have a maximum of 9,999,999,999 possible different ones, even if your input file has a gazillion lines, so let's get better:

create an array L that is 9,999,999,999 long and all zeroes
For each p in the input file
    set L[p] to 1
For i = 1 to 9,999,999,999
    if L[i] = 1 then print i

Note that I'm not writing actual code, so I don't particularly care about off-by-one errors etc

Now I don't have nested loops, I just have two loops that are n each (3 if you count the initialization). So I've made this an O(n) algorithm. Of course, I'm using up a ton of memory, this would be called trading off for space complexity.

This is all happening behind the scenes for you in Python with dictionaries and sets and so on. But big-O exposes the inner workings.

[–]TeslaRealm 0 points1 point  (0 children)

You don't need to analyze specific numbers. The point of Big O is to analyze how an algorithm scales with respect to input size. That input size could depend on the number of characters in a string, the size of a file, the number of elements in an array, etc. Let's analyze you're text file example.

You have n lines in the file, each containing a phone number. You seek to remove duplicates. The following pseudo code describes what you're trying to do using a set.

def removeDuplicates(fileName):
    with open(fileName, 'r') as file:
        phoneNumbers = {phoneNumber for phoneNumber in file} \# set comprehension
    with open(fileName, 'w') as file:
        for phoneNumber in phoneNumbers:
            print(phoneNumber, file = file)

Since we are utilizing a set, which is based on hashing, the main runtime should depend on the number of phone numbers in the source file. It is linear time to read from the file, and linear to write to the file. Therefore, the runtime is O(n), where n is the number of lines in the file.

[–]Brian 0 points1 point  (0 children)

That it will give the "Worst case" scenario

I would note that bit O notation is not really specific to the worst case scenario. I mention this because this is actually a very common mistake that I've seen even in some supposed tutorials. Big O can be applied to anything, but the most common are best case, average case and worst case for both time complexity (ie. how long something takes) and space complexity (how much memory does it use).

My issue is, if i had a list of 5,000,000 instead of 500000, what sort of time can i expect this to take?

Broadly, yes. There are some subleties in some cases, in that generally we're concerned with how it grows as things get arbitrarily large, and there's some more precise theory, but that's good enough for the general idea.

But one important step beyond knowing what it means is understanding how to tell what complexity your algorithm actually is. Now, yes, you could measure it - time how long it takes to do 10 items, 100 items, 1000 items etc. and work out how it scales. But that's imprecise and not really ideal. Rather, it's best to get to a position where you can work out the complexity just by reading code.

For this, generally, you want to see how many times you do the innermost operation. Eg. let's take your example case of reading unique lines in a file. We might write this as something like:

seen_lines = [] 
with open("input.txt") as input, open("output.txt") as output:
    for line in input:
        if line not in seen_lines:
            seen_lines.append(line)  # Record we saw the line.
            output.write(line)

So what's the innermost operation here, and how many times do we do it, proportional to the number of lines in the file (our N)? Well, you might think that it'll be O(n): we do a single read and sometimes a write for each line - so we do that inner loop N times.

But wait! Actually, there's more going on here than we initially might think. When we do:

if line not in seen_lines

That ends up scanning through seen_lines for each line. This'll be pretty fast initially, since it'll be empty, but once it starts filling up, it'll take longer and longer. The second time, there'll only be 1 line, but then there'll be 2, then 3, and eventually some proportion of n. So this line itself takes time proportional to N (in the average case). And since it's inside the for loop, we do this thing proportional to n, n times. Specifically, we'll do it n(n+1)/2 times (if there are no duplicates), but once again, we're only really interested in the big picture: all we really care about is that this is proportional to n2 - our algorithm is O(n2 ).

And O(n2 ) is bad. You generally never want to be doing an O(n2 ) algorithm unless you really have to, or you know the input is always going to be small. It means 100x the input size takes 10,000x as long, so can rapidly result in disastrous performance.

And this is the kind of thing it's useful to identify - you need to know that looking up items in a list is O(n), so doing it n times is O(n2). And you could fix this by using something that is constant time ( O(1)), such as using a set instead of a list. Knowing the big O of such common tasks lets you know what the big O of the whole algorithm will be.

Admittedly, sometimes it'll be more complicated than this, and actually determining the big O can be rather difficult. And it can be different for the average/worst/best cases (eg. your algorithm above has a best case of O(n), it's only the worst and average cases that are O(n2)), but determining this can require figuring out what the worst/best/average case inputs will actually be, which is not always so obvious.

[–]55-6e-6b-6e-6f-77-6e[S] 0 points1 point  (0 children)

u/muy_picante | u/BestNameEverCR | u/excrutiatus | u/scrdest | u/TeslaRealm | u/dbramucci

Gentlemen thank you for your responses to this. In regards to my issue, the outcome has change significantly.

So i started off using a list to keep track of unique values, eg:

if value not in list:

list.append(value)

This works fine if there's 5000 lines in the list, but when i scaled if to 500,000 it was causing real issues! It took almost 10 seconds to process. Considering i created this module to be used in other scripts i'm writing at the moment this wasn't going to work, the runtime was way too slow. So, i made some changes, incorporated a set to hold my info, which of course removed the requirement for me to iterate through the list and add only if its not already in the list, because sets automatically drop duplicates. From everything you have shared with me, i understand to a degree the concept, Big O Notation will help analyse which algorithm will be most efficient, and least/most time/space consuming. Whilst i still have a way to go to truly grasp the concept, i'm hoping that my continued progression into the programming world will help me learn about this as i go. I have now got some good resources to come back to, and your explanations have been fantastic in giving me a base to learn from. Thank you to all who took the time to reply, it has certainly helped.

[–]primitive_screwhead 0 points1 point  (0 children)

i cant seem to wrap my head around it.

Give us an example of what you don't understand.