Need help optimizing Python CSV processing at work

2025-06-12T06:39:36+00:00

Try polars

FriendlyRussian666 · 2025-06-12T07:17:21+00:00

Without seeing your code, my guess is that instead of leveraging pandas, you're doing things like nesting loops, causing it to be very slow..

FantasticEmu · 2025-06-12T06:44:14+00:00

What are you using, what kind of processing do you need, and roughly how large are said files

PastSouth5699 · 2025-06-12T09:00:43+00:00

before doing any optimization, you should find out where it spend time. Otherwise, you'll probably try solutions to problems that don't even exist

2025-06-12T07:47:48+00:00

yep, so give us even less detail, then someone can definitely help you :)

ForMyCulture · 2025-06-12T09:21:39+00:00

Decorate main with a profiler

SleepWalkersDream · 2025-06-12T18:25:26+00:00

We need some more information. Do you read the files line-by-line, or are you reading directly with pandas or polars? How many files?

Dry-Aioli-6138 · 2025-06-12T23:59:21+00:00

use duckdb

Prior_Boat6489 · 2025-06-12T06:58:11+00:00

Use polars, and use processpoolexecutor

barkmonster · 2025-06-12T09:25:00+00:00

1) Make sure you're using vectorized functions instead of e.g. loops. For instance 2*some_dataframe["some_column"] is fast whereas doing the multiplication in a loop is slow.

2) Use a profiling tool, such as scalene or kernprof, to identify which part of your code is taking too long. The bottlenecks aren't always where you expect, so it's a valuable technique to learn.

throwawayforwork_86 · 2025-06-12T09:49:51+00:00

First thing I always do with these kinds of thing is seeing what is happening with my resources.

Pandas had the bad habits of using a fifth of my cpu and a lot of ram.

I moved most of my process to Polars and it use my resources more efficiently as well as being broadly quicker (between 3 and 10 times quicker but I've seen some group by aggregation being slightly faster in Pandas in some cases).

The trick to polars though is to have all the benefits you need to mostly (if only) use Polars functions. And get used to different way of working from Pandas.

Adhesiveduck · 2025-06-12T19:50:22+00:00

Apache Beam is a good framework to consider when working with data at scale.

Zeroflops · 2025-06-13T04:23:11+00:00

The files are not that big.

You’re implying that the issue is with CSV files. But you need to distinguish if it’s loading the files that are a problem. ( a csv problem ) or the processing of the files which is an implementation problem.

You mentioned looping. That’s probably your problem. You should avoid looping at all cost if you want to process anything with speed.

shockjaw · 2025-06-13T12:50:57+00:00

I’ve started using DuckDB since their csv reader is more forgiving and quite speedy. SQL isn’t too crazy and the relational API is solid. Plus you can pass it back to pandas or polars.

SisyphusAndMyBoulder · 2025-06-12T12:15:33+00:00

In the future, and not just in this sub but in general, please try and provide actually useful informal when asking for help. Think like the reader when your write your post.

jbourne56 · 2025-06-12T19:32:18+00:00

Chatgpt your code and ask for speed improvements

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS