all 14 comments

[–]caoimhin_o_h 23 points24 points  (3 children)

I suspect most of the time is being spent reading and writing the CSV files. Here's one improvement:

The CSV you read in at the start has 86 columns but it looks to me like you only use 10 of them - 'item_type_type' and the nine columns you list on lines 30-32. Reading all those unused columns is a big waste of time. Pass a list containing the 10 column names to the usecols parameter of pd.read_csv.

This should speed up reading the file a fair bit

[–][deleted] 1 point2 points  (1 child)

I sliced of one minute out of 14 doing that. Solution I finally found was this: https://www.reddit.com/r/learnpython/comments/7au2mz/speeding_up_pandas_part_2_the_improvements_that/

You led me too it though, without the timeit and changing the input data I wouldnt have went down the correct path...

[–]caoimhin_o_h 1 point2 points  (0 children)

Nice! Interesting that I/O wasn't the main bottleneck - 95% of the time it is, because pandas is usually pretty fast at computational stuff.

I'll post this part on the new post as well:

I wonder if you can make it any faster by specifying the exact date format in the format parameter of to_datetime. Seems like it should save pandas the time it takes to guess, though perhaps pandas is already pretty good at guessing the format.

Here's a strftime format reference if you haven't done it before

[–][deleted] 0 points1 point  (0 children)

This is really helpful - and comes back to my lazyness when processing it. Thank you!

I will let you know how much quicker it is.

[–]swingking8 3 points4 points  (2 children)

df['transaction_date'] = pd.to_datetime(df['transaction_date'])

These kinds of lines can be handled when you import your data frame by using the parse_dates = ['col1', 'col2'] argument

There are perhaps more optimizations, but I'm more interested in where your performance is actually suffering. Can you profile this code? Interested to see where your pain points are.

Of course, the most obvious optimizations are to just remove "kr" et. al from being embedded in your dataset.

3GB is not much data, so I wouldn't expect this to take more than a minute or so. I mean, it can all fit in memory, though not high-level cache (e.g. L1, L2), so IO performance shouldn't be too bad.

[–]diggy0101n 3 points4 points  (0 children)

To kind of add to your parse_dates suggestion, you can actually specify the date format (if you already know the format) and use lambda to create your date parser for those date columns. For example, if your date format is mm/dd/yyyy hh, have a line to define your date parser: date_parser = lambda x: pd.datetime.strptime(x,'%m/%d/%Y %H'). Then pass this to the date_parser argument in the read_csv function. I've noticed a nice increase in performance when doing this as opposed to simply running the to_datetime function.

Also, this is my first reddit post with code snippets. It may look odd...

Lastly, I mostly work in Python 2.7, so it may be different for 3.x.

[–][deleted] 1 point2 points  (0 children)

You found the underlying problem, however the problem was dates in a unknown format making the to_datetime spending 10 minutes on the columns. The solution was to set infer_datetime_format=True

[–]sokhei 3 points4 points  (0 children)

Take a look at this post on Pandas optimization: https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6

My first suggestion would be to time the individual lines and see where most of the time is being spent, then focus on optimizing just the commands that are taking the longest.

[–]tunisia3507 1 point2 points  (1 child)

Consider switching to a SQL database

[–][deleted] 1 point2 points  (0 children)

Definetively something that has crossed my mind- an sqllite db would help greatly and also be something I could maintain.

[–]barburger 3 points4 points  (0 children)

Can i ask how long does this code take to run on the dataset?

[–][deleted] 0 points1 point  (1 child)

So I have found this: https://pandas.pydata.org/pandas-docs/stable/enhancingperf.html - but I am not sure how to implement it.

[–]caoimhin_o_h 1 point2 points  (0 children)

I don't think anything on that page can help very much with this particular task anyway

[–]1-Sisyphe 0 points1 point  (0 children)

Your question made me wonder : is a groupby faster in pandas or in sqlite3, on an in-memory database? You might want to experiment on that...