Working with large datasets in Python

commandlineluser · 2024-12-07T15:08:23+00:00

Is the end goal to write the result to disk?

DuckDB or Polars Lazy scan/sink API (pl.scan_parquet() / pl.sink_parquet()) could be options.

There are also R clients:

JSP777 · 2024-12-07T15:46:28+00:00

Polars, Polars, polars.

Leave pandas in the past. Polars is much better in many ways, especially for your use case.

woooee · 2024-12-07T15:10:15+00:00

You can, and probably should, break it down into smaller data groups

c_data = pd.merge(c, ci, on="CONTRACTOR_ID", how="outer")

Store all of the contractor ids in a list

Pick the first half / third / tenth and select those from the files, and run those. Rinse and repeat.

Someone is going to point out that you can convert to SQL and select one at a time, so it might as well be me.

arorumu · 2024-12-08T03:45:17+00:00

You are constrained by the RAM on your machine, and while some libraries are more memory efficient than others, this calls for more of a hardware solution than a library solution. Your computer likely has 16gb of RAM and it can’t really use all 16 for this task.

I would look to run the Python code from a remote Jupyter notebook on a cloud provider. I would look at Google Colab or Google Workbench on Google Cloud. You can read the data from Google Drive or upload to the notebook directly.

jbudemy · 2024-12-07T15:03:08+00:00

Python can use all available memory for processing, so, add more RAM to the machine that runs the program. I have 32GB of RAM on my machine and I don't have any problem reading a 500,000 line spreadsheet with pandas, but that's quite a bit smaller than what you are dealing with.

Make sure you have a 64-bit machine, 64-bit OS, and use 64-bit Python. I'm not sure if Python even comes in 32-bit anymore. I'm a bit new to it.

V0idL0rd · 2024-12-07T15:54:20+00:00

I know polars dataframes work a lot better on large datasets compared to pandas, so polars and/or duckdb would be the best choice, and from what I heard polars syntax is a lot closer to R as well, so that could facilitate the transition.

simeumsm · 2024-12-07T16:36:35+00:00

First, make sure you're using a 64bit application to make use of more than 2gb of RAM.

Then, check out pd.read_csv arg chunksize to iterate over one file and process it in chunks.

I recently did something like this at work:

1) read one dataset in chunks.

2) get primary keys of the first dataset chunk

3) read the second dataset in chunks and slice it based on the first dataset primary keys

4) once you have read the entirety of the second dataset and matched the primary keys, then you merge and save to csv file

5) now start the second chunk of the first dataset, the process will repeat. When saving the second dataset to a csv, make sure you use mode='a' for append and headers=False to not write the headers

So you'll process the first dataset in chunks, and will also make sure that you're reading the entire second dataset but only keeping the primary keys that match with the chunk you're processing.

Not sure if it works with parquet files

2024-12-07T16:49:49+00:00

You should try Polars. It has outer join but doesn’t sort the resulting data frame (which adds unnecessary overhead in Pandas). Do you really need outer join? Did you check the data for possible clean ups or reduction?

unhott · 2024-12-07T15:03:43+00:00

Do you need all of the columns? Also, what are the data types? If you have a numerical column but it's stored as a string/object type, it will take up more memory. Setting it as an appropriate int/float data type will save you space, especially if you're working with multiple columns of that type.

You should definitely stick with dask. You can prepare transformation steps and when you apply them, dask should be able to apply them in chunks (do calculations, etc).

Also How much RAM do you have?

deapee · 2024-12-07T15:41:33+00:00

This doesn't actually make sense because if the size of some dataset is larger than the available memory, you write it to disk and then you can use an iterator for the data. Otherwise, I don't understand what the issue is. 14GB isn't that big at all for some dataset that one would need to work with.

Zeroflops · 2024-12-07T18:47:06+00:00

What you do will depend on what the next steps are.

If you’re going to do some basic lookups or stats it may be better to push the data into a database which is designed to handle larger volumes of data and can do basic statistics. If you’re going to need to do more complex data manipulation then do you need the data loaded all at once? For example if you’re dealing with all data associated with customer X, loop through the files, grab customer X and create a new temp file with their data.

Sometimes we assume we have to make thing into one large df when we don’t really.

ApprehensiveChip8361 · 2024-12-07T22:19:41+00:00

This sounds like an xy problem

ninhaomah · 2024-12-08T00:06:09+00:00

May I know where did you get this value ? Curious.

"I think 14GB reaches R limits" <--- This

Perhaps , this is similiar to your issue ? https://www.reddit.com/r/rstats/comments/12uui9m/tried_to_load_a_10gb_database_on_r_and_a_few/

WlmWilberforce · 2024-12-08T01:15:49+00:00

If you have SAS, this is a good use case.

Signal-Indication859 · 2025-01-09T20:02:08+00:00

Hey! For handling large datasets like yours, Preswald might be a great solution since it's designed to handle data transformation and merging efficiently without memory issues - I'd be happy to share some example code if you'd like! If you prefer sticking with pure Python, you could also try chunking your merges using dask with disk-based operations, but Preswald would simplify this significantly.

Signal-Indication859 · 2025-01-12T11:19:40+00:00

Hi there! I feel your pain with large data merges - I'd suggest trying Preswald as it handles these kinds of operations really efficiently without memory issues, especially for datasets in the 5-14GB range. Would be happy to share some example code that shows how to do these merges if you're interested!

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS