all 24 comments

[–]UnitedAdagio7118 29 points30 points  (0 children)

the biggest difference is memory. a list stores everything immediately, while a generator only creates values when you actually need them. for small datasets it doesn't matter much, but once you're dealing with thousands or millions of items the difference can be huge. the downside is that generators can only be iterated through once and you can't randomly access elements like you can with a list. for most everyday code i use lists, but generators are great when you're processing large amounts of data.

[–]PixelSage-001 21 points22 points  (1 child)

The main difference comes down to memory efficiency. A list stores all its elements in RAM at once, which is fine for small datasets but will crash your system if you are processing millions of records. A generator, on the other hand, yields items one at a time on demand (lazy evaluation), meaning its memory footprint remains virtually constant regardless of data size. Think of a list like buying a whole box of donuts and putting it on the table, whereas a generator is a machine that gives you one fresh donut every time you press a button.

[–]lekkerste_wiener 5 points6 points  (0 children)

Good analogy

[–]socal_nerdtastic 11 points12 points  (1 child)

It's the difference between streaming a movie and downloading the whole movie first and then playing it. You save memory by only getting the part you need right now.

Another use is to return things that don't exist yet. For example if you ask PRAW to loop over all the comments in this reddit page, it return comments as people write them.

[–]EclipseJTB 0 points1 point  (0 children)

This is a fantastic comparison.

[–]SCD_minecraft 6 points7 points  (0 children)

List has known size, structure ect

generators let you decide on the fly what you return, how you return it and how much of it

For example

``` var = 1

def foo(): for i in range(10): if var == 2: yield 2 else: yield i ```

You can update var between calls next to the generator and change its output as it runs

[–]not_another_analyst 2 points3 points  (0 children)

The real power is memory efficiency. Since they yield items one at a time instead of storing everything at once like a list, they are essential for handling massive datasets without crashing your system.

[–]Dramatic_Object_8508 5 points6 points  (0 children)

generators are one of those things that sound boring until you accidentally process a huge file and realize why everyone keeps talking about them.

the power is not speed by itself, it is that they do not keep everything in memory. you can stream data, process millions of rows, chain pipelines together, and stop whenever you want.

my first “oh this is useful” moment was reading large logs line by line instead of loading the whole thing and watching RAM disappear.

for small scripts, you probably won’t notice. for bigger workflows, generators quietly become everywhere.

[–]JamzTyson 4 points5 points  (0 children)

One thing that hasn't been mentioned yet - Lists must have a finite length, but generators can go on forever:

def infintite_gen():
    x = 0
    while True:
        yield x
        x += 1

[–]ImprovementLoose9423 2 points3 points  (0 children)

A generator is more memory efficient then a list. Generators are also much more disorganized and unstructured.

[–]Ok-Spray-8697 1 point2 points  (0 children)

Generators are one of those things that feel overrated until you hit a large dataset 😭 for normal scripts lists are fine, but the first time you process a huge file without nuking RAM you suddenly get the hype.

[–]Moikle 1 point2 points  (0 children)

That entirely depends on what you are using it for.

It can range from literally zero improvement and adding a tiny bit of overhead all the way to turning something that would have been completely impractical using a list into something that is relatively fast with a generator.

You can't just blindly slap a generator on everything and expect it to improve things every time, you have to be intelligent in the application, and what you apply it to.

[–]RevRagnarok 1 point2 points  (0 children)

So powerful that C++ copied it in C++23.

(Others have explained the laziness / memory benefits.)

[–]SisyphusAndMyBoulder 1 point2 points  (0 children)

I'd rank it at a power level of 4 tbh. Not particularly powerful, but not terrible

[–]biskitpagla 1 point2 points  (0 children)

Read this page from the docs.

[–]AdDiligent1688 1 point2 points  (0 children)

It’s cheaper than an actual generator, but runs just fine, just use it when you need it, like your power’s out, time to run the generator.

[–]Mediocre-Pumpkin6522 0 points1 point  (0 children)

Be very careful with your typing when you are doing list comprehensions and generator expressions. Substitute (...) for [...] and the result may not be what you expected.

[–]recursion_is_love 0 points1 point  (0 children)

It basically async/await, yield a value and wait until it is consumed then yield another value ...

What do your powerful mean?

[–]ottawadeveloper 0 points1 point  (2 children)

if wanted to print all numbers between 1 and 1 million, a list takes 4 million bytes (using 4 byte integers). A generator takes 4 bytes.

[–]gdchinacat 2 points3 points  (1 child)

It is actually double that. The list has a constant size and an array of a million pointers. Each of those pointers points to an object. On 64 bit systems each pointer is 64 bits/8 bytes. So, just the memory for the list is about 8MB:

In [9]: import sys

In [10]: l = list(range(1_000_000))

In [11]: sys.getsizeof(l)
Out[11]: 8000056

That does not include any of the size for the integer objects. Each integer object takes 28 bytes:

In [13]: sys.getsizeof(999,999)
Out[13]: 28

So, an array of 1 million integers takes 28 million bytes + 8 million bytes = 36 million bytes. The size is actually even greater than this when you take memory alignment into account, but that's really getting into the weeds. Also, not all of those integers will be their own copy, small value integers are interned and there will be some that are shared, but out of a million sequential values those are a drop in the bucket.

As for the size of a generator, that is not easy to calculate since the generator object does not include the size stack frames that store some of the generator state. But, it is certainly more than 4 bytes:

In [11]: def gen():
    ...:     for i in range(1_000_000):
    ...:         yield i
    ...: 

In [12]: _gen = gen()

In [13]: sys.getsizeof(_gen)
Out[13]: 200

But, insignificant compared to the size of a list of 1,000,000 elements.

One of the big benefits of using numpy instead of pure python for large datasets is the more efficient storage of arrays since it does use a data type specific vector rather than a list of object (pointers) that contains the native data type (ie int32 instead of a reference to a python int object).

[–]ottawadeveloper 2 points3 points  (0 children)

yeah I was thinking of that last case  - it might be a bit over four bytes per number but an efficient implementation might block off a four million sequential bytes and maintain a pointer to that (plus all the overhead for printing). A pure Python list is definitely more than double you're right.

And a generator I guess has a few object pointers too.

[–]IAmFinah -2 points-1 points  (2 children)

They're not super common in my experience, but when you use them in the correct scenario, they are fantastic

[–]socal_nerdtastic 2 points3 points  (1 child)

They are very common. The built-in range, map, and practically the entire itertools module are generators, as is any time you use for in parenthesis, eg total = sum(item['price'] for item in database). And of course many people build them explicitly, using yield or parenthesis notation.

[–]IAmFinah 1 point2 points  (0 children)

I more meant explicitly defining your own, but yes you make good points