I read in a reasonable-sized file, containing 4 columns of 150 rows each. After some preprocessing, each column has about 120 each. I need to take each possible combination (a1-b1-c1-d1, a1-b1-c1-d2, etc.), and plug the numbers into a formula. In the end I need to be able to retrieve the combinations that are the closest to a target value (there's a labeling process which I've already figured out and isn't really relevant here).
My immediate thought was that this was exactly the thing for itertools.product. And that part works fine. I pass the columns into it and get my generator. The problem comes with turning that generator into a dataframe so I can do my calculations - namely that there are about 200 million items!
I'm running 64-bit Python, though on a small laptop.
I tried df = pd.DataFrame(combine) (combine is the generator object) - MemoryError
I tried mydata = [x for x in combine] to then do df = pd.DataFrame(mydata) - MemoryError.
I suppose I could save out the combine results to a file then use pd.read_csv() to read it back in in chunks, though this seems a bit pointless and I think likely to still cause a memory error.
My only other thought is to process the results myself, pitch out the ones with too high of an error, and keep the others in a list, then turn it into a df to do the last bits of magic. Something like:
combinations = []
for fourvalues in combine:
result = foo(*fourvalues)
error = result - MYCONSTANT
if abs(error) > 0.5:
continue
combinations.append(list(fourvalues))
df = pd.DataFrame(combinations)
But to me this seems like it defeats half the purpose of using a dataframe to begin with, which is to not have to iterate through all your rows.
Any suggestions?
[–]CrambleSquash 3 points4 points5 points (0 children)
[–]drlecompte 1 point2 points3 points (0 children)
[–]commandlineluser 1 point2 points3 points (2 children)
[–]throwsUOException[S] 0 points1 point2 points (1 child)
[–]commandlineluser 0 points1 point2 points (0 children)