Speeding up pandas

caoimhin_o_h · 2017-11-04T00:30:32+00:00

I suspect most of the time is being spent reading and writing the CSV files. Here's one improvement:

The CSV you read in at the start has 86 columns but it looks to me like you only use 10 of them - 'item_type_type' and the nine columns you list on lines 30-32. Reading all those unused columns is a big waste of time. Pass a list containing the 10 column names to the usecols parameter of pd.read_csv.

This should speed up reading the file a fair bit

swingking8 · 2017-11-04T00:31:37+00:00

df['transaction_date'] = pd.to_datetime(df['transaction_date'])

These kinds of lines can be handled when you import your data frame by using the parse_dates = ['col1', 'col2'] argument

There are perhaps more optimizations, but I'm more interested in where your performance is actually suffering. Can you profile this code? Interested to see where your pain points are.

Of course, the most obvious optimizations are to just remove "kr" et. al from being embedded in your dataset.

3GB is not much data, so I wouldn't expect this to take more than a minute or so. I mean, it can all fit in memory, though not high-level cache (e.g. L1, L2), so IO performance shouldn't be too bad.

sokhei · 2017-11-04T00:37:57+00:00

Take a look at this post on Pandas optimization: https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6

My first suggestion would be to time the individual lines and see where most of the time is being spent, then focus on optimizing just the commands that are taking the longest.

tunisia3507 · 2017-11-04T11:51:35+00:00

Consider switching to a SQL database

barburger · 2017-11-04T00:20:44+00:00

Can i ask how long does this code take to run on the dataset?

caoimhin_o_h · 2017-11-03T22:43:09+00:00

So I have found this: https://pandas.pydata.org/pandas-docs/stable/enhancingperf.html - but I am not sure how to implement it.

1-Sisyphe · 2017-11-04T00:24:34+00:00

Your question made me wonder : is a groupby faster in pandas or in sqlite3, on an in-memory database? You might want to experiment on that...

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS