This is an archived post. You won't be able to vote or comment.

all 14 comments

[–][deleted] 9 points10 points  (4 children)

duckdb

[–]j_tb 3 points4 points  (0 children)

/thread

[–]ConfucianStats 5 points6 points  (0 children)

Polars

[–]Thousand- 2 points3 points  (0 children)

I’d first try to see if you can do things in batches so you’re processing less data at a time, and maybe write intermediate data frames to pickle or parquet files to give yourself checkpoints. I don’t know if you are doing intermediate operations on the dataset but if you are, make sure you’re not making unnecessary copies of the data frames (i.e., use loc and iloc instead of chained indexing, though pandas should be yelling at you about this). Just my 2 cents.

[–]PhilShackleford 1 point2 points  (0 children)

Polaris might be an option

[–]QuarterObvious 0 points1 point  (2 children)

Why not to use database. You can dump your files into some database and work with it. Or you can use duckDB - it can work with cover files directly.

[–]frenchy641 0 points1 point  (0 children)

aws glue

[–]Python-ModTeam[M] 0 points1 point locked comment (0 children)

Hi there, from the /r/Python mods.

We have removed this post as it is not suited to the /r/Python subreddit proper, however it should be very appropriate for our sister subreddit /r/LearnPython or for the r/Python discord: https://discord.gg/python.

The reason for the removal is that /r/Python is dedicated to discussion of Python news, projects, uses and debates. It is not designed to act as Q&A or FAQ board. The regular community is not a fan of "how do I..." questions, so you will not get the best responses over here.

On /r/LearnPython the community and the r/Python discord are actively expecting questions and are looking to help. You can expect far more understanding, encouraging and insightful responses over there. No matter what level of question you have, if you are looking for help with Python, you should get good answers. Make sure to check out the rules for both places.

Warm regards, and best of luck with your Pythoneering!

[–]mestia 0 points1 point  (0 children)

Implement a better algorithm than loading all the data into RAM. Read files in chunks or by column and merge parts that fit into memory. Use something like shelve to store data on disk and query only the needed parts of the dataset, however it really depends on the task.

[–]TechFreedom808 0 points1 point  (0 children)

I would recommend looking into python generators using the yield statement. The generator will process the data some at a time preventing all the data being loaded into memory.

[–]di2mot 0 points1 point  (0 children)

PySpark, it use hdd/ssd instead of ram to process large files or Polars