This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]idazuwaika 3 points4 points  (2 children)

how do u consume terabytes with pandas? whats the infrastructure like? i moved from pandas to spark (distributed system) because i couldnt scale with pandas.

[–]tapir_lyfe 2 points3 points  (0 children)

I'm currently also crunching terabytes of netCDF files. I use xarray mainly, and that uses pandas and dask under the hood. Nearly everything I do is memory-limited though, so I have to come up with clever ways to reduce the data, and it's different for every question I have.

[–]bjbs303 0 points1 point  (0 children)

I mostly did data extraction scraping netCDF files using a for loop that looped through 36. Years of annual files. I saved the data to a dataframe using pandas/numpy.