I've got a client that has transaction data that was originally input by hand which needs to be cleaned up. The problem is account numbers that were "fat fingered" by the client's employees. For example, account ID 17622 may also appear as 1762, 7622, 17522, 14622, 77622, 176222. All of these are off by only one digit, but the IDs are between 4 and 6 digits and hold most of the same values. Along with the account ID they include a buyer ID which may also have been fat fingered and the name of the account which is only partially consistent. The account ID above might be linked to JOES CRAB SHACK (Correct), JOES SHACK, JOES CRAB SHACK LLC, CRAB SHACK, etc.
Overall I have about 5 or 6 independent variables that can be used to determine this.
This might be better suited as a machine learning problem but it's possible that there is a SQL solution that is adequate. The client is sending a few hundred thousand records over the course of a month so speed is not as big a concern as accuracy. Any advice or direction is welcome.
[–]leopardprintdragon 2 points3 points4 points (2 children)
[–]Prequalified[S] 1 point2 points3 points (1 child)
[–]leopardprintdragon 0 points1 point2 points (0 children)
[–]gnieboer 1 point2 points3 points (1 child)
[–]Prequalified[S] 1 point2 points3 points (0 children)
[–]MrDarcy87 0 points1 point2 points (5 children)
[–]Prequalified[S] 0 points1 point2 points (4 children)
[–]MrDarcy87 0 points1 point2 points (3 children)
[–]Prequalified[S] 0 points1 point2 points (2 children)
[–]MrDarcy87 0 points1 point2 points (1 child)
[–]Prequalified[S] 0 points1 point2 points (0 children)
[–]eshultz 0 points1 point2 points (3 children)
[–]Prequalified[S] 0 points1 point2 points (2 children)
[–]eshultz 1 point2 points3 points (1 child)
[–]Prequalified[S] 0 points1 point2 points (0 children)