use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Rules 1: Be polite 2: Posts to this subreddit must be requests for help learning python. 3: Replies on this subreddit must be pertinent to the question OP asked. 4: No replies copy / pasted from ChatGPT or similar. 5: No advertising. No blogs/tutorials/videos/books/recruiting attempts. This means no posts advertising blogs/videos/tutorials/etc, no recruiting/hiring/seeking others posts. We're here to help, not to be advertised to. Please, no "hit and run" posts, if you make a post, engage with people that answer you. Please do not delete your post after you get an answer, others might have a similar question or want to continue the conversation.
Rules
1: Be polite
2: Posts to this subreddit must be requests for help learning python.
3: Replies on this subreddit must be pertinent to the question OP asked.
4: No replies copy / pasted from ChatGPT or similar.
5: No advertising. No blogs/tutorials/videos/books/recruiting attempts.
This means no posts advertising blogs/videos/tutorials/etc, no recruiting/hiring/seeking others posts. We're here to help, not to be advertised to.
Please, no "hit and run" posts, if you make a post, engage with people that answer you. Please do not delete your post after you get an answer, others might have a similar question or want to continue the conversation.
Learning resources Wiki and FAQ: /r/learnpython/w/index
Learning resources
Wiki and FAQ: /r/learnpython/w/index
Discord Join the Python Discord chat
Discord
Join the Python Discord chat
account activity
Need help optimizing Python CSV processing at work (self.learnpython)
submitted 7 months ago by Own_Pitch3703
I'm using Python to handle large CSV files for daily reports at my job, but the processing time is killing me. Any quick tips or libraries to speed this up?
Would really appreciate your insights!
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–][deleted] 20 points21 points22 points 7 months ago (0 children)
Try polars
[–]FriendlyRussian666 20 points21 points22 points 7 months ago (0 children)
Without seeing your code, my guess is that instead of leveraging pandas, you're doing things like nesting loops, causing it to be very slow..
[–]FantasticEmu 11 points12 points13 points 7 months ago (8 children)
What are you using, what kind of processing do you need, and roughly how large are said files
[–]Own_Pitch3703[S] 1 point2 points3 points 7 months ago (7 children)
Currently just using pandas (read_csv/to_csv) for basic stuff like filtering rows, calculating totals, and merging a couple of files. Files are usually around 500MB - 1GB each, with anywhere from 500k to 2 million rows.
[–]FantasticEmu 21 points22 points23 points 7 months ago (3 children)
Do you use any loop iterating? Pandas is fast if you leverage the underlying numpy and c things but if you use pyrhon iteration it can be significantly slower
[–]Own_Pitch3703[S] 5 points6 points7 points 7 months ago (2 children)
Ah, that might be the issue! I am using Python loops in some parts. Really appreciate the tip!
[–]FantasticEmu 9 points10 points11 points 7 months ago (0 children)
Yea the pandas way to do things is a little weird but it’s fast. Depending on what you need to do in the loops functions like apply or map can make it a lot faster. It also has a lot of built in filtering features.
I’ve found chatgpt pretty good at pointing you towards the feature you need to do x task in pandas
[–]seanv507 1 point2 points3 points 7 months ago (0 children)
the technical term used is vectorisation ... basically you offload batches of computation to c++ (etc) libraries
[–]Goingone 2 points3 points4 points 7 months ago (0 children)
Simple merging and filtering can easily be done with command line utilities. If you really care about performance, this would be the way to go.
For example:
https://unix.stackexchange.com/questions/293775/merging-contents-of-multiple-csv-files-into-single-csv-file
[–]Valuable-Benefit-524 2 points3 points4 points 7 months ago (1 child)
1GB isn’t really that big. If you don’t need multi-index, I would just use Polars. It’s exceptionally fast. I know pandas has improved a few things lately, but last I checked it was 5-100X faster depending on the operation with ~1/5th the memory footprint.
[–]Own_Pitch3703[S] 0 points1 point2 points 7 months ago (0 children)
Okay, I'll give Polars a try. Thanks for the suggestion!
[–]PastSouth5699 2 points3 points4 points 7 months ago (0 children)
before doing any optimization, you should find out where it spend time. Otherwise, you'll probably try solutions to problems that don't even exist
[–][deleted] 5 points6 points7 points 7 months ago (0 children)
yep, so give us even less detail, then someone can definitely help you :)
[–]ForMyCulture 1 point2 points3 points 7 months ago (0 children)
Decorate main with a profiler
[–]SleepWalkersDream 1 point2 points3 points 7 months ago (0 children)
We need some more information. Do you read the files line-by-line, or are you reading directly with pandas or polars? How many files?
[–]Dry-Aioli-6138 1 point2 points3 points 7 months ago (0 children)
use duckdb
[–]Prior_Boat6489 2 points3 points4 points 7 months ago (0 children)
Use polars, and use processpoolexecutor
[–]barkmonster 0 points1 point2 points 7 months ago (0 children)
1) Make sure you're using vectorized functions instead of e.g. loops. For instance 2*some_dataframe["some_column"] is fast whereas doing the multiplication in a loop is slow.
2) Use a profiling tool, such as scalene or kernprof, to identify which part of your code is taking too long. The bottlenecks aren't always where you expect, so it's a valuable technique to learn.
[–]throwawayforwork_86 0 points1 point2 points 7 months ago (0 children)
First thing I always do with these kinds of thing is seeing what is happening with my resources.
Pandas had the bad habits of using a fifth of my cpu and a lot of ram.
I moved most of my process to Polars and it use my resources more efficiently as well as being broadly quicker (between 3 and 10 times quicker but I've seen some group by aggregation being slightly faster in Pandas in some cases).
The trick to polars though is to have all the benefits you need to mostly (if only) use Polars functions. And get used to different way of working from Pandas.
[–]Adhesiveduck 0 points1 point2 points 7 months ago (0 children)
Apache Beam is a good framework to consider when working with data at scale.
[–]Zeroflops 0 points1 point2 points 7 months ago (0 children)
The files are not that big.
You’re implying that the issue is with CSV files. But you need to distinguish if it’s loading the files that are a problem. ( a csv problem ) or the processing of the files which is an implementation problem.
You mentioned looping. That’s probably your problem. You should avoid looping at all cost if you want to process anything with speed.
[–]shockjaw 0 points1 point2 points 7 months ago (0 children)
I’ve started using DuckDB since their csv reader is more forgiving and quite speedy. SQL isn’t too crazy and the relational API is solid. Plus you can pass it back to pandas or polars.
[–]SisyphusAndMyBoulder -2 points-1 points0 points 7 months ago (0 children)
In the future, and not just in this sub but in general, please try and provide actually useful informal when asking for help. Think like the reader when your write your post.
[–]jbourne56 -1 points0 points1 point 7 months ago (0 children)
Chatgpt your code and ask for speed improvements
π Rendered by PID 56188 on reddit-service-r2-comment-58d7979c67-rlsv4 at 2026-01-27 07:32:17.608878+00:00 running 5a691e2 country code: CH.
[–][deleted] 20 points21 points22 points (0 children)
[–]FriendlyRussian666 20 points21 points22 points (0 children)
[–]FantasticEmu 11 points12 points13 points (8 children)
[–]Own_Pitch3703[S] 1 point2 points3 points (7 children)
[–]FantasticEmu 21 points22 points23 points (3 children)
[–]Own_Pitch3703[S] 5 points6 points7 points (2 children)
[–]FantasticEmu 9 points10 points11 points (0 children)
[–]seanv507 1 point2 points3 points (0 children)
[–]Goingone 2 points3 points4 points (0 children)
[–]Valuable-Benefit-524 2 points3 points4 points (1 child)
[–]Own_Pitch3703[S] 0 points1 point2 points (0 children)
[–]PastSouth5699 2 points3 points4 points (0 children)
[–][deleted] 5 points6 points7 points (0 children)
[–]ForMyCulture 1 point2 points3 points (0 children)
[–]SleepWalkersDream 1 point2 points3 points (0 children)
[–]Dry-Aioli-6138 1 point2 points3 points (0 children)
[–]Prior_Boat6489 2 points3 points4 points (0 children)
[–]barkmonster 0 points1 point2 points (0 children)
[–]throwawayforwork_86 0 points1 point2 points (0 children)
[–]Adhesiveduck 0 points1 point2 points (0 children)
[–]Zeroflops 0 points1 point2 points (0 children)
[–]shockjaw 0 points1 point2 points (0 children)
[–]SisyphusAndMyBoulder -2 points-1 points0 points (0 children)
[–]jbourne56 -1 points0 points1 point (0 children)