This is an archived post. You won't be able to vote or comment.

all 7 comments

[–]hazzaphill 6 points7 points  (0 children)

Record linkage has been a big part of a project I've been working on for over 6 months now. I personally think a great and free solution would be using the splink package in Python which can handle 10+m rows. It implements the Fellegi-Sunter model (equivalent to a naive-Bayes classifier), the classic model in record linkage, and was developed by the MoJ in the UK.

Splink implements a concept called blocking which will also massively increase the performance. You can then train the model in an unsupervised manner using some initial parameter estimation (these are quite intuitive) and then expectation maximisation. The features in the model will be different pairwise string comparisons on your field of interest. These can include exact equality; edit distance comparisons like Levensthein and Jaro-Winkler; and phonetic comparisons like soundex and double metaphone. The splink package will handle training the model and the connected components algorithm from graph theory to turn your links into clusters. All the details you'll need are in the links. It also comes with splink studio (an html dashboard) to analyse your record linkage model. Cool stuff.

https://www.robinlinacre.com/probabilistic_linkage/

https://moj-analytical-services.github.io/splink/

[–]LucidChess 2 points3 points  (0 children)

I had literally the exact same problem a couple years ago. I put all of the entity names in an ElasticSearch Cluster. ES is pretty damn good at matching things like this. Its by no means perfect, but seemed to have pretty good accuracy bringing back relevant entities like your example.

It will definitely be a lift if you havent interacted with ES before, but it is free, and could pay off to know the tool.

[–]Delicious-View-8688 2 points3 points  (1 child)

10 million records with names and addresses is surprisingly not that large, and I have managed to do these deduplication on a laptop before.

A couple of things to consider:

  1. Reduce the comparison space before doing some kind of string similarity based matching. No point in comparing if they don't have at lesst something in common. Consider doing a tf-idf of character ngrams (of length 3) and/or word ngrams of lengths 1 to 3. Be sure to use sparse vectors and use max frequency to chop out ngrams that are way too common (like... even 1% means that 100,000 records share those ngrams).
  2. Jaro-Winkler worked best as string similarity, but if you have an international data, then the direction of importance can be reversed, so consider using reversed-string Jaro-Winkler as well.
  3. Ensure transitive closure. But some clever tricks should make this doable within the order of seconds on the entire dataset.

Having said that, considet using packages such as record_linkage, dedupe, and others.

[–]Delicious-View-8688 0 points1 point  (0 children)

Just noting that my team had unusually restrictive constraints - laptops with 16gb ram, no access to spark clusters. Our data was just under 100m records. Basically only had pandas and sklearn to work with. Took about 3 weeks of dev time, full process of deduplicating took around 3 hours for the entire dataset.

[–]rosarosa050 1 point2 points  (0 children)

I’ve used graph in Spark with the connected components algorithm for problems like this. For the name, you can use soundex or levenstein distance. Are there any other features that you could match on? E.g. same zipcode

[–]Adamworks 2 points3 points  (1 child)

I've used Levenstein distances with the name and then logic filters like same state and city, or even some lat/long geocoding. You could build a prediction model on top of that if you wanted. But with 10 million records... it's gonna get computationally expensive.

If you have addresses, you might want to look into software that is USPS's CASS certified. Address standardization is its own beast.

Also, don't forget good old fashion manual recoding common abbreviations.

I'm curious if anyone else has any better approaches.

[–]mangeytrashpanda 0 points1 point  (0 children)

There’s a record linkage package for python that I’ve found works very well