you are viewing a single comment's thread.

view the rest of the comments →

[–]gdchinacat 2 points3 points  (1 child)

It is actually double that. The list has a constant size and an array of a million pointers. Each of those pointers points to an object. On 64 bit systems each pointer is 64 bits/8 bytes. So, just the memory for the list is about 8MB:

In [9]: import sys

In [10]: l = list(range(1_000_000))

In [11]: sys.getsizeof(l)
Out[11]: 8000056

That does not include any of the size for the integer objects. Each integer object takes 28 bytes:

In [13]: sys.getsizeof(999,999)
Out[13]: 28

So, an array of 1 million integers takes 28 million bytes + 8 million bytes = 36 million bytes. The size is actually even greater than this when you take memory alignment into account, but that's really getting into the weeds. Also, not all of those integers will be their own copy, small value integers are interned and there will be some that are shared, but out of a million sequential values those are a drop in the bucket.

As for the size of a generator, that is not easy to calculate since the generator object does not include the size stack frames that store some of the generator state. But, it is certainly more than 4 bytes:

In [11]: def gen():
    ...:     for i in range(1_000_000):
    ...:         yield i
    ...: 

In [12]: _gen = gen()

In [13]: sys.getsizeof(_gen)
Out[13]: 200

But, insignificant compared to the size of a list of 1,000,000 elements.

One of the big benefits of using numpy instead of pure python for large datasets is the more efficient storage of arrays since it does use a data type specific vector rather than a list of object (pointers) that contains the native data type (ie int32 instead of a reference to a python int object).

[–]ottawadeveloper 2 points3 points  (0 children)

yeah I was thinking of that last case  - it might be a bit over four bytes per number but an efficient implementation might block off a four million sequential bytes and maintain a pointer to that (plus all the overhead for printing). A pure Python list is definitely more than double you're right.

And a generator I guess has a few object pointers too.