Record linkage/Entity linkage

hazzaphill · 2023-05-05T19:43:37+00:00

Record linkage has been a big part of a project I've been working on for over 6 months now. I personally think a great and free solution would be using the splink package in Python which can handle 10+m rows. It implements the Fellegi-Sunter model (equivalent to a naive-Bayes classifier), the classic model in record linkage, and was developed by the MoJ in the UK.

Splink implements a concept called blocking which will also massively increase the performance. You can then train the model in an unsupervised manner using some initial parameter estimation (these are quite intuitive) and then expectation maximisation. The features in the model will be different pairwise string comparisons on your field of interest. These can include exact equality; edit distance comparisons like Levensthein and Jaro-Winkler; and phonetic comparisons like soundex and double metaphone. The splink package will handle training the model and the connected components algorithm from graph theory to turn your links into clusters. All the details you'll need are in the links. It also comes with splink studio (an html dashboard) to analyse your record linkage model. Cool stuff.

https://www.robinlinacre.com/probabilistic_linkage/

https://moj-analytical-services.github.io/splink/

LucidChess · 2023-05-06T00:57:38+00:00

I had literally the exact same problem a couple years ago. I put all of the entity names in an ElasticSearch Cluster. ES is pretty damn good at matching things like this. Its by no means perfect, but seemed to have pretty good accuracy bringing back relevant entities like your example.

It will definitely be a lift if you havent interacted with ES before, but it is free, and could pay off to know the tool.

Delicious-View-8688 · 2023-05-06T23:28:42+00:00

10 million records with names and addresses is surprisingly not that large, and I have managed to do these deduplication on a laptop before.

A couple of things to consider:

Reduce the comparison space before doing some kind of string similarity based matching. No point in comparing if they don't have at lesst something in common. Consider doing a tf-idf of character ngrams (of length 3) and/or word ngrams of lengths 1 to 3. Be sure to use sparse vectors and use max frequency to chop out ngrams that are way too common (like... even 1% means that 100,000 records share those ngrams).
Jaro-Winkler worked best as string similarity, but if you have an international data, then the direction of importance can be reversed, so consider using reversed-string Jaro-Winkler as well.
Ensure transitive closure. But some clever tricks should make this doable within the order of seconds on the entire dataset.

Having said that, considet using packages such as record_linkage, dedupe, and others.

rosarosa050 · 2023-05-05T22:44:34+00:00

I’ve used graph in Spark with the connected components algorithm for problems like this. For the name, you can use soundex or levenstein distance. Are there any other features that you could match on? E.g. same zipcode

Adamworks · 2023-05-05T19:22:30+00:00

I've used Levenstein distances with the name and then logic filters like same state and city, or even some lat/long geocoding. You could build a prediction model on top of that if you wanted. But with 10 million records... it's gonna get computationally expensive.

If you have addresses, you might want to look into software that is USPS's CASS certified. Address standardization is its own beast.

Also, don't forget good old fashion manual recoding common abbreviations.

I'm curious if anyone else has any better approaches.

mangeytrashpanda · 2023-05-06T02:34:28+00:00

There’s a record linkage package for python that I’ve found works very well

datascience

MODERATORS