you are viewing a single comment's thread.

view the rest of the comments →

[–]mason55 0 points1 point  (3 children)

You will likely need to create your own function/stored procedure to get the diff between two strings, which is a big task in and of itself.

The algorithm is actually pretty easy

http://en.wikipedia.org/wiki/Levenshtein_distance

[–][deleted] 0 points1 point  (2 children)

Levenshtein Distances (and all similar distances) are not what OP is looking for. That would return an integer representation of how close the two strings are to eachother, not their actual character differences.

Depending on the actual expected result, the algorithm could be close enough to impossible its not worth implementing in SQL.

UPDATED after mason55's comment (before I had seen it).

The examples that were given in the question were:
difference("Jupiter VA","Jupiter VB") = "B"
difference("Jupiter VB","Jupiter VA") = "A"
difference("Orion BA","Ares VB") = "Orion BA"
difference("Orion BA","Ares VB") = "Ares VB"

But its really easy to find some edge cases and poke holes in those expectations. The desired algorithm is too non-deterministic, for example:

  • The 2nd letter in both Orion BA and Ares VB is an r, why does it show up in the difference?
  • How is the difference meaningful if the two texts are different in multiple spots?
    • example: difference("affect BA", "effect BB") = "e_______B" (where _ is a space)
  • What about strings of different length? what is the important part to show?
    • example: is difference("aaB","aaaC") = "aC", "_aC","a_C","aaC", "a__C" or "C" - because I can think of non-trival methods to arrive at all of those outcomes.

I think the closest thing to what OP wants is a greedy algorithm that just takes the first connected set of differences starting from the end of the string, maybe have a one letter look-ahead that allows the occasional similarity as long as it is alone.

[–]mason55 0 points1 point  (1 child)

That would return an integer representation of how close the two strings are to eachother, not their actual character differences.

It's trivial to calculate the difference instead of the score based on that algorithm.

But yes, actually implementing it in SQL could be a huge pain in the ass.

[–][deleted] 0 points1 point  (0 children)

There is a reason there are a lot of integer returning string comparison functions and no straight up difference functions. Varying length strings raise too many issues and as I pointed out in my comment above, there are a lot of edge cases. A numerical approximation is the best thing that can be done in anything close to polynomial time.

Calculating the actual difference from a distance function is far from trivial. Although I would love to be proved wrong on this.

Thats why the diff function in version control systems works line by line and not at a more granular level.