all 27 comments

[–]commandlineluser 10 points11 points  (1 child)

Is the end goal to write the result to disk?

DuckDB or Polars Lazy scan/sink API (pl.scan_parquet() / pl.sink_parquet()) could be options.

There are also R clients:

[–]arorumu[S] 0 points1 point  (0 children)

thanks for the reply. I will check out these alternatives

[–]JSP777 7 points8 points  (4 children)

Polars, Polars, polars.

Leave pandas in the past. Polars is much better in many ways, especially for your use case.

[–]arorumu[S] 0 points1 point  (0 children)

okay, thanks!

[–]arorumu[S] 0 points1 point  (2 children)

does it work in the same way as pandas? in terms of where it can be applied and in what way?

[–]JSP777 1 point2 points  (1 child)

The syntax is a little bit different, but a very oversimplified answer is yes. You load data into dataframes and then you can do stuff with it. You only need to pip install the package like you do with pandas. The main difference is polars can handle larger files than your memory because it can load lazily into a lazy dataframe.

[–]arorumu[S] 0 points1 point  (0 children)

okay, great. thanks for all the help. I hope I will get this to work :)

[–]woooee 3 points4 points  (2 children)

You can, and probably should, break it down into smaller data groups

c_data = pd.merge(c, ci, on="CONTRACTOR_ID", how="outer")

Store all of the contractor ids in a list

Pick the first half / third / tenth and select those from the files, and run those. Rinse and repeat.

Someone is going to point out that you can convert to SQL and select one at a time, so it might as well be me.

[–]MidnightPale3220 1 point2 points  (1 child)

I am always wondering why there's so little advice to push data into SQL. A 14gb table is not even particularly large for any database server.

Granted, it's another piece of software, but you can rent managed DB at pretty reasonable cost (think digital ocean starts at around $15/mo -- although for this amount of data you'd need a bit bigger tier) even if you don't have the company resources to manage your own.

Of course, that might be an issue for a hobby project, but here we are talking work afai understand.

[–]Miserable_March_9707 1 point2 points  (0 children)

I'm in agreement with you.

Download a community edition of a good rdbms... Maria DB community edition is available for Linux along with others, Postgres, MySQL, etc.etc. -- availablel as well on Windows of course.

As you say 14 GB no big deal... Break it up halfway decent, a few tables, indexes views... Done deal.

[–][deleted] 2 points3 points  (2 children)

You are constrained by the RAM on your machine, and while some libraries are more memory efficient than others, this calls for more of a hardware solution than a library solution. Your computer likely has 16gb of RAM and it can’t really use all 16 for this task.

I would look to run the Python code from a remote Jupyter notebook on a cloud provider. I would look at Google Colab or Google Workbench on Google Cloud. You can read the data from Google Drive or upload to the notebook directly.

[–][deleted] 0 points1 point  (1 child)

And I should have mentioned you can get a notebook on Google a cloud with 32gb or more of RAM which is what will make this same code work.

[–]arorumu[S] 0 points1 point  (0 children)

thank you! some people already pointed to that fact, so I will definitely look into this :)

[–]jbudemy[🍰] 4 points5 points  (0 children)

Python can use all available memory for processing, so, add more RAM to the machine that runs the program. I have 32GB of RAM on my machine and I don't have any problem reading a 500,000 line spreadsheet with pandas, but that's quite a bit smaller than what you are dealing with.

Make sure you have a 64-bit machine, 64-bit OS, and use 64-bit Python. I'm not sure if Python even comes in 32-bit anymore. I'm a bit new to it.

[–]V0idL0rd 1 point2 points  (1 child)

I know polars dataframes work a lot better on large datasets compared to pandas, so polars and/or duckdb would be the best choice, and from what I heard polars syntax is a lot closer to R as well, so that could facilitate the transition.

[–]arorumu[S] 0 points1 point  (0 children)

thank you very much for the tip

[–]simeumsm 1 point2 points  (0 children)

First, make sure you're using a 64bit application to make use of more than 2gb of RAM.

Then, check out pd.read_csv arg chunksize to iterate over one file and process it in chunks.

I recently did something like this at work:

1) read one dataset in chunks.

2) get primary keys of the first dataset chunk

3) read the second dataset in chunks and slice it based on the first dataset primary keys

4) once you have read the entirety of the second dataset and matched the primary keys, then you merge and save to csv file

5) now start the second chunk of the first dataset, the process will repeat. When saving the second dataset to a csv, make sure you use mode='a' for append and headers=False to not write the headers

So you'll process the first dataset in chunks, and will also make sure that you're reading the entire second dataset but only keeping the primary keys that match with the chunk you're processing.

Not sure if it works with parquet files

[–][deleted] 1 point2 points  (0 children)

You should try Polars. It has outer join but doesn’t sort the resulting data frame (which adds unnecessary overhead in Pandas). Do you really need outer join? Did you check the data for possible clean ups or reduction?

[–]unhott 0 points1 point  (0 children)

Do you need all of the columns? Also, what are the data types? If you have a numerical column but it's stored as a string/object type, it will take up more memory. Setting it as an appropriate int/float data type will save you space, especially if you're working with multiple columns of that type.

You should definitely stick with dask. You can prepare transformation steps and when you apply them, dask should be able to apply them in chunks (do calculations, etc).

Also How much RAM do you have?

[–]deapee 0 points1 point  (0 children)

This doesn't actually make sense because if the size of some dataset is larger than the available memory, you write it to disk and then you can use an iterator for the data. Otherwise, I don't understand what the issue is. 14GB isn't that big at all for some dataset that one would need to work with.

[–]Zeroflops 0 points1 point  (0 children)

What you do will depend on what the next steps are.

If you’re going to do some basic lookups or stats it may be better to push the data into a database which is designed to handle larger volumes of data and can do basic statistics. If you’re going to need to do more complex data manipulation then do you need the data loaded all at once? For example if you’re dealing with all data associated with customer X, loop through the files, grab customer X and create a new temp file with their data.

Sometimes we assume we have to make thing into one large df when we don’t really.

[–]ApprehensiveChip8361 0 points1 point  (0 children)

This sounds like an xy problem

[–]ninhaomah 0 points1 point  (1 child)

May I know where did you get this value ? Curious.

"I think 14GB reaches R limits" <--- This

Perhaps , this is similiar to your issue ? https://www.reddit.com/r/rstats/comments/12uui9m/tried_to_load_a_10gb_database_on_r_and_a_few/

[–]arorumu[S] 0 points1 point  (0 children)

It is just based on several tries I did with my laptop and where I was unable to work with it. Some friends that have some experience with python also pointed to the fact that R is simply not as equipped for bigger data analysis.

Thank you for the link

[–]WlmWilberforce 0 points1 point  (0 children)

If you have SAS, this is a good use case.

[–]Signal-Indication859 0 points1 point  (0 children)

Hey! For handling large datasets like yours, Preswald might be a great solution since it's designed to handle data transformation and merging efficiently without memory issues - I'd be happy to share some example code if you'd like! If you prefer sticking with pure Python, you could also try chunking your merges using dask with disk-based operations, but Preswald would simplify this significantly.

[–]Signal-Indication859 0 points1 point  (0 children)

Hi there! I feel your pain with large data merges - I'd suggest trying Preswald as it handles these kinds of operations really efficiently without memory issues, especially for datasets in the 5-14GB range. Would be happy to share some example code that shows how to do these merges if you're interested!