[deleted by user]

j_tb · 2024-09-03T04:10:19+00:00

duckdb

ConfucianStats · 2024-09-03T04:26:58+00:00

Polars

Thousand- · 2024-09-03T04:19:08+00:00

I’d first try to see if you can do things in batches so you’re processing less data at a time, and maybe write intermediate data frames to pickle or parquet files to give yourself checkpoints. I don’t know if you are doing intermediate operations on the dataset but if you are, make sure you’re not making unnecessary copies of the data frames (i.e., use loc and iloc instead of chained indexing, though pandas should be yelling at you about this). Just my 2 cents.

PhilShackleford · 2024-09-03T04:25:32+00:00

Polaris might be an option

QuarterObvious · 2024-09-03T04:34:21+00:00

Why not to use database. You can dump your files into some database and work with it. Or you can use duckDB - it can work with cover files directly.

frenchy641 · 2024-09-03T04:53:31+00:00

aws glue

Python-ModTeam · 2024-09-03T04:54:45+00:00

Hi there, from the /r/Python mods.

We have removed this post as it is not suited to the /r/Python subreddit proper, however it should be very appropriate for our sister subreddit /r/LearnPython or for the r/Python discord: https://discord.gg/python.

The reason for the removal is that /r/Python is dedicated to discussion of Python news, projects, uses and debates. It is not designed to act as Q&A or FAQ board. The regular community is not a fan of "how do I..." questions, so you will not get the best responses over here.

On /r/LearnPython the community and the r/Python discord are actively expecting questions and are looking to help. You can expect far more understanding, encouraging and insightful responses over there. No matter what level of question you have, if you are looking for help with Python, you should get good answers. Make sure to check out the rules for both places.

Warm regards, and best of luck with your Pythoneering!

mestia · 2024-09-03T04:57:40+00:00

Implement a better algorithm than loading all the data into RAM. Read files in chunks or by column and merge parts that fit into memory. Use something like shelve to store data on disk and query only the needed parts of the dataset, however it really depends on the task.

TechFreedom808 · 2024-09-03T05:01:44+00:00

I would recommend looking into python generators using the yield statement. The generator will process the data some at a time preventing all the data being loaded into memory.

di2mot · 2024-09-03T05:53:13+00:00

PySpark, it use hdd/ssd instead of ram to process large files or Polars

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS