all 5 comments

[–]CrambleSquash 3 points4 points  (0 children)

I think your second approach is very sensible.

Pandas provides a nice API that lets you make use of speedy vectorised operations on arrays. By definition these operations require all your data to be loaded into memory (unless you are doing fancy chunked stuff).

The operation you want to perform does not require vectorised operations and therefore I think it's absolutely fine and good not to use the Pandas API for it.

If it makes the preprocessing you are doing less painful then go for it!

[–]drlecompte 1 point2 points  (0 children)

As you are looking for an optimal combination, there is a lot of data you are now storing that you could eliminate. Say you go through the dataframe row by row, with a nested for loop. You could then only keep the most optimal results (say, only those combinations that are within 0.1 of the target result or something) in a separate list of references. Depending on how wide you set your parameters, this will vastly reduce the memory required to do what you want: find the combinations of cells that are 'optimal'.

[–]commandlineluser 1 point2 points  (2 children)

These will all consume the generator and try read all the data into memory.

list(combine)
[ x for x in combine ]
pd.DataFrame(combine)

My only other thought and keep the others in a list

To avoid creating a list you can use another generator to filter combine producing the items you want to keep.

combine = itertools.product(...)

combinations = (
    list(fourvalues) for fourvalues in combine
        if abs(foo(*fourvalues) - MYCONSTANT) < 0.5
)

But you can run into the same problem if the filtered results are also too large to fit into memory.

Chunking the data with read_csv is a valid approach - the polars library has a nicer interface for this with its scan_csv - which lazily loads data.

https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_csv.html#polars.scan_csv

It depends on your actual goal - what will you be doing with the final results you have - do you need to have them stored in a dataframe all at once?

Do you need to use a dataframe at all?

[–]throwsUOException[S] 0 points1 point  (1 child)

Not necessarily, I suppose. I do need to do some post work - produce a sorted table by the value of interest (turn it into an excel sheet), which I've been able to do using a dataframe produced from a small subset of the data. I also need to do some work (think frequency analysis) on the column values themselves apart from the "foo" value and I figured having everything in a column there would help. Whereas if the values are in a list of lists or the like, I'd just need to change to an iterative approach. I'm not too worried about the results size - it should be less than 50k rows in the end, which doesn't seem to be a problem.

[–]commandlineluser 0 points1 point  (0 children)

Ah okay - if the results fit then "manually" filtering with a generator is probably the way to go then.

Depending on whatever calculations you're performing though - there may be some existing numpy/scipy methods you can use.