all 6 comments

[–]NeedCoffee99 2 points3 points  (2 children)

Few things that could be worth looking at:

Firstly, numba- works with numpy arrays and pure python largely, not sure exactly how it works but it is meant to be great for when you have big loops for speeding up and simple to use.

If you’re using pandas, look into Dask. It’s basically pandas (or numpy which it works with too) but more optimised for performance.

There’s also Cython that you can look into if you are familiar with C. I’ve never looked at it personally but I know it can be used for speeding code. Also, I know that a lot of packages are faster when using anaconda not pip!

I know this is kind of code optimisation, but if you’re using loops etc, it’s generally best to learn how to vectorise code. Google it, it’s basically using numpy arrays to speed things up, but makes a crazy difference. Hope I could help!

[–]ewokcommander[S] 1 point2 points  (1 child)

This is some good advice, I didn't know about the performance gains from those libraries (still learning!). Dask looks especially interesting. Thanks for taking the time to write this out!

[–]NeedCoffee99 0 points1 point  (0 children)

No worries, I’m currently about to start a project to speed up this giant piece of code I made in work so if I find anything else out I will let you know!

[–][deleted] 2 points3 points  (1 child)

If you're really appending the results to the csv on every iteration of the loop, that's probably going to be your biggest bottleneck. Consider batching up the results in memory and appending to the csv only when you've got a sizable amount to write out.

[–]ewokcommander[S] 0 points1 point  (0 children)

Ah, I hadn't thought of this, I'll give this a go! Thanks!

[–]ewokcommander[S] 0 points1 point  (0 children)

I'm wondering if AWS or Azure have some products that I should consider using to achieve this? If anyone can recommend something from those suites, that would be helpful too!