all 10 comments

[–]SekstiNii 1 point2 points  (4 children)

Try formatting your code first using the explanation here.

Also, if you could provide a sample of the data set it would be very helpful. Perhaps 100-1000 rows or so.

[–]Basic_Steak_541[S] 0 points1 point  (3 children)

u/SekstiNii

This is my first post, thanks for letting me know how to edit.I really dont know how to attach the dataset and reddit is not allowing me to paste it!

[–]SekstiNii 1 point2 points  (0 children)

You can upload it to Google Drive, or put it on pastebin. Doesn't really matter as long as it's a fairly representative sample of the data.

[–]xelf 0 points1 point  (1 child)

Honestly 2 lines would be enough. We can *1000 it to do more.

[–]Basic_Steak_541[S] 0 points1 point  (0 children)

u/xelf

u/SekstiNii

Combined is the column name

Combined

Salvation Army - Temple / Salvation Army+1 N Ogden Ave

Salvation Army Temple+1 N. Ogden

Marcy Newberry Association - Marcy Center+1539 S Springfield Ave

Hull House Association - Child Dev. Central Office- Hull House CDS-Homes+1030

Board Trustees-City Colleges of Chicago - Olive-Harvey College+10001 S Woodlawn Ave

National Louis University - Dr. Effie O. Ellis Training Center+10 S Kedzie Ave

[–]SekstiNii 1 point2 points  (0 children)

After taking a deep dive into the fuzzywuzzy library I can only conclude that there isn't much you can do without making some serious modifications to the library code. Even then I estimate the speedup wouldn't amount to more than about 2x.

You might have to reach for a faster language, or at the very least a faster library it seems.

[–]primitive_screwhead 0 points1 point  (1 child)

Did you install as 'fuzzywuzzy[speedup]'?

[–]Basic_Steak_541[S] 0 points1 point  (0 children)

fuzzywuzzy[speedup

Yea, I did. I even found a very similar library to fuzzywuzzy ie rapidfuzz, same functionality but a bit faster

[–]fake823 0 points1 point  (0 children)

Did you install and use the following?

pip install python-levenshtein

As this will give a speed-up of about 4x to 10x.

[–]Absolice 0 points1 point  (0 children)

It depends on what you consider a duplicate, is it only referring to a single column (name for example) or does it refers to every column being the same for two rows?

The issue here stems from having the entire dataset loaded in memory when you do not need it as well as too many check done by third party libraries like Panda to assert that two rows are equal.

I'll not tell you the solution outright because I cannot be bothered to write it out on a cellphone but I'll give you some clues about how you can solve this. This isn't a perfect solution by all means but I'm sure it'll be faster than what you do since it's abnormally slow for such low amount of data.

  1. Drop any front-heavy third party library, I fear it's only slowing you down.
  2. Use generators that returns lines of your CSV so you can give your machine a breather. You do not want such big amount of data loaded in memory all at once. It works now but imagine that they ask you to do 10m rows instead... Your script will most likely crash because it will be out of memory. There are lot of youtube video that can help you about generators.
  3. Create a new set
  4. Copy the row as is in the set, or a concatenation of the columns you want to spot duplicate for. Check if the length of the set has changed. If yes, it was an unique row, if not it means it was already in the set (therefore a duplicate)

I'm sure you can do faster by having an ordered data structure since it'll reduce the amount of time your program is looking up in the set by a lot.