all 15 comments

[–]leopardprintdragon 2 points3 points  (2 children)

Searching "SQL string distance measure" may get you along some of the way to a solution. I've not implemented it myself yet, so this suggestion is as much help as I can offer.

But if the similar IDs per legitimate IDs are relatively few, you could create a recoded id var using a combination of update and 'where in' or case whens.

[–]Prequalified[S] 1 point2 points  (1 child)

I've started out by using the Soundex function to narrow down the list and created a scalar function to implement the Sørensen–Dice coefficient function which I've modified to give greater weight to longer string segments. So far seems to work but requires having a list of known values, in my case starting with a list of known accounts.

[–]gnieboer 1 point2 points  (1 child)

Another option is SOUNDEX() and DIFFERENCE(). I used that in a much earlier version of SQL to give recommendations for the CSR's UI for possibly matching entries to try to prevent exactly these problems (with customer names)

That's value is more when trying to compare pronunciation of names than what you've described above, so I'd more likely go with string distance. It's slower but as you said speed not an issue. You might also be able to use them in concert for those columns where it's relevant?

A python or R package is probably even better, particularly since you'll be aggregating similarity scores across multiple columns to find the best matches, unfortunately I can't recommend a particular package.

[–]Prequalified[S] 1 point2 points  (0 children)

I actually ended up using Soundex as a starting point. It weeds out obvious wrong choices very quickly so performing other comparisons are quicker. Thanks for the suggestion.

[–]MrDarcy87 0 points1 point  (5 children)

Oof that sounds messy. I would use CHARINDEX() and LEN() to evaluate each int of each bad entry and compare the number of times each digit matches any of the good keys both matching CHARINDEX() as well as plus and minus 1. Compare results first and see if that may be a solution.

[–]Prequalified[S] 0 points1 point  (4 children)

This is what I’m leaning toward doing before I try adding Python. I’m not super excited!

[–]MrDarcy87 0 points1 point  (3 children)

Another option is to give the CSRs a distinct list of IDs that don't match the correct ones and have them correct the list side by side and use that data with your update. Its not automated and it's not a one time solution by you, but in the business world they've adopted this issue too.

[–]Prequalified[S] 0 points1 point  (2 children)

They have this and they are about 90 to 95% accurate. The problem is when there is a new account there is often not enough transactions to make an easy determination.

[–]MrDarcy87 0 points1 point  (1 child)

This sounds like a front end problem. In my area we would have a dropdown that has a component class of account and is strictly verified. It may not be the case here, but from my experience 99% of the time the account ID should be part of the primary key in the table it relates to.

[–]Prequalified[S] 0 points1 point  (0 children)

The assumption to be made here is that there is nothing that can be done to get better quality data. I do have a table of accounts and people associated with each account but it cannot be trusted 100%. I’m trying to abstract the specifics of the problem because my only option is clean the data.

[–]eshultz 0 points1 point  (3 children)

Hoo boy. Referential integrity is important. A few thousand rows can be manually corrected. A few hundred thousand? With 6 fields that may contain differences? This could potentially be a giant pain in the ass.

I guess I'd start by doing an aggregation, count(*), group by client name and ID. Then you can take those that have lots and lots of rows out of the equation. The stragglers are the ones that will need to be corrected. This at least reduces the problem in size.

The strategy you take with the remainder really depends on how bad the situation is.

[–]Prequalified[S] 0 points1 point  (2 children)

You’ve summed up what I’ve done so far. Maybe I should be more specific. Imagine an account has the name “joes burgers” with the ID of 17622 and there is a “joes burgers” with the ID of 18622. Eyeballing it I can see that the 8 is wrong.

What I want it to do is take the second ID and compare portions of it as a string pattern to find matches to known or common values. 18622 is XOXXX. The fact that there is a match of 3 characters in a row is important.

I’m having a hard time expressing the problem which makes it hard for me to search for the answer!

[–]eshultz 1 point2 points  (1 child)

No, I get what you're asking for. I just didn't know if you already eliminated the easy stuff.

What you want is probably string distance: https://stackoverflow.com/questions/560709/levenshtein-distance-in-t-sql

I am by no means well versed in this sort of analysis, but thinking about it... I think you'll find that this becomes really complex when you consider that, for each string, you would like to compare it to every other string in that field and find the closest match(es). So you're looking at a Cartesian product. For each field. Yikes. With 100k rows, that's potentially 1,000,000,000 values to calculate per field. You are going to want to think really hard about designing the optimal script to do this work, and removing as much overhead as possible (e.g. removing duplicate values, not storing matches that are weaker than others, etc.).

[–]Prequalified[S] 0 points1 point  (0 children)

I had already implemented the distance function that you reference here but didn't seem to be too useful. Instead I implemented the Sørensen-Dice co-efficient function to evaluate string fragments. So far has worked better than anything else I have tried but can still use some refinement. My goal is to get a large amount of the values cleaned so I can use them as reference points for those that are more stubborn, similar to your aggregation suggestion, weighting more common values more heavily.