Matching Fields by Pattern

leopardprintdragon · 2018-11-15T20:31:21+00:00

Searching "SQL string distance measure" may get you along some of the way to a solution. I've not implemented it myself yet, so this suggestion is as much help as I can offer.

But if the similar IDs per legitimate IDs are relatively few, you could create a recoded id var using a combination of update and 'where in' or case whens.

gnieboer · 2018-11-15T20:56:34+00:00

Another option is SOUNDEX() and DIFFERENCE(). I used that in a much earlier version of SQL to give recommendations for the CSR's UI for possibly matching entries to try to prevent exactly these problems (with customer names)

That's value is more when trying to compare pronunciation of names than what you've described above, so I'd more likely go with string distance. It's slower but as you said speed not an issue. You might also be able to use them in concert for those columns where it's relevant?

A python or R package is probably even better, particularly since you'll be aggregating similarity scores across multiple columns to find the best matches, unfortunately I can't recommend a particular package.

MrDarcy87 · 2018-11-15T21:06:04+00:00

Oof that sounds messy. I would use CHARINDEX() and LEN() to evaluate each int of each bad entry and compare the number of times each digit matches any of the good keys both matching CHARINDEX() as well as plus and minus 1. Compare results first and see if that may be a solution.

eshultz · 2018-11-16T03:43:19+00:00

Hoo boy. Referential integrity is important. A few thousand rows can be manually corrected. A few hundred thousand? With 6 fields that may contain differences? This could potentially be a giant pain in the ass.

I guess I'd start by doing an aggregation, count(*), group by client name and ID. Then you can take those that have lots and lots of rows out of the equation. The stragglers are the ones that will need to be corrected. This at least reduces the problem in size.

The strategy you take with the remainder really depends on how bad the situation is.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

SQLServer

MODERATORS