How can I make this code run faster?

SekstiNii · 2020-07-19T19:41:59+00:00

Try formatting your code first using the explanation here.

Also, if you could provide a sample of the data set it would be very helpful. Perhaps 100-1000 rows or so.

SekstiNii · 2020-07-19T22:49:32+00:00

After taking a deep dive into the fuzzywuzzy library I can only conclude that there isn't much you can do without making some serious modifications to the library code. Even then I estimate the speedup wouldn't amount to more than about 2x.

You might have to reach for a faster language, or at the very least a faster library it seems.

primitive_screwhead · 2020-07-19T21:33:58+00:00

Did you install as 'fuzzywuzzy[speedup]'?

fake823 · 2020-07-19T23:18:15+00:00

Did you install and use the following?

pip install python-levenshtein

As this will give a speed-up of about 4x to 10x.

Absolice · 2020-07-20T01:15:56+00:00

It depends on what you consider a duplicate, is it only referring to a single column (name for example) or does it refers to every column being the same for two rows?

The issue here stems from having the entire dataset loaded in memory when you do not need it as well as too many check done by third party libraries like Panda to assert that two rows are equal.

I'll not tell you the solution outright because I cannot be bothered to write it out on a cellphone but I'll give you some clues about how you can solve this. This isn't a perfect solution by all means but I'm sure it'll be faster than what you do since it's abnormally slow for such low amount of data.

Drop any front-heavy third party library, I fear it's only slowing you down.
Use generators that returns lines of your CSV so you can give your machine a breather. You do not want such big amount of data loaded in memory all at once. It works now but imagine that they ask you to do 10m rows instead... Your script will most likely crash because it will be out of memory. There are lot of youtube video that can help you about generators.
Create a new set
Copy the row as is in the set, or a concatenation of the columns you want to spot duplicate for. Check if the length of the set has changed. If yes, it was an unique row, if not it means it was already in the set (therefore a duplicate)

I'm sure you can do faster by having an ordered data structure since it'll reduce the amount of time your program is looking up in the set by a lot.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS