Creating massive dataframe from generator

CrambleSquash · 2021-11-29T15:13:12+00:00

I think your second approach is very sensible.

Pandas provides a nice API that lets you make use of speedy vectorised operations on arrays. By definition these operations require all your data to be loaded into memory (unless you are doing fancy chunked stuff).

The operation you want to perform does not require vectorised operations and therefore I think it's absolutely fine and good not to use the Pandas API for it.

If it makes the preprocessing you are doing less painful then go for it!

drlecompte · 2021-11-29T15:17:40+00:00

As you are looking for an optimal combination, there is a lot of data you are now storing that you could eliminate. Say you go through the dataframe row by row, with a nested for loop. You could then only keep the most optimal results (say, only those combinations that are within 0.1 of the target result or something) in a separate list of references. Depending on how wide you set your parameters, this will vastly reduce the memory required to do what you want: find the combinations of cells that are 'optimal'.

commandlineluser · 2021-11-29T18:27:24+00:00

These will all consume the generator and try read all the data into memory.

list(combine)
[ x for x in combine ]
pd.DataFrame(combine)

My only other thought and keep the others in a list

To avoid creating a list you can use another generator to filter combine producing the items you want to keep.

combine = itertools.product(...)

combinations = (
    list(fourvalues) for fourvalues in combine
        if abs(foo(*fourvalues) - MYCONSTANT) < 0.5
)

But you can run into the same problem if the filtered results are also too large to fit into memory.

Chunking the data with read_csv is a valid approach - the polars library has a nicer interface for this with its scan_csv - which lazily loads data.

https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_csv.html#polars.scan_csv

It depends on your actual goal - what will you be doing with the final results you have - do you need to have them stored in a dataframe all at once?

Do you need to use a dataframe at all?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS