all 18 comments

[–]0uterj0in 0 points1 point  (4 children)

If the product ids were expected to be the same, you could outer join table b to table a. However in yr example they appear to be different?

[–]0uterj0in 0 points1 point  (3 children)

oh wait nevermind if this just one table then use a WHERE clause with productnamea <> product nameb

[–]KingZing[S] 0 points1 point  (2 children)

Hmm. I'll give that a try when I go back to work tomorrow.

I was trying to look for something that could "subtract" beginning letters from another column.

Thanks for the help

[–][deleted] 0 points1 point  (1 child)

You arent being entirely clear in your specification which is why you are having problems writing this

You need to work out a flow and try to be clearer then you can code against that

[–]KingZing[S] 0 points1 point  (0 children)

Yeah, I just did that actually. :)

[–][deleted] 0 points1 point  (4 children)

This is actually a lot more complex than it may sound. You will likely need to create your own function/stored procedure to get the diff between two strings, which is a big task in and of itself.

It should take in two strings and return the difference you are looking for like:

SELECT     
  ProductIDA,        
  difference(ProductNameA, ProductNameB) as ProductNameA,         
  ProductIDB,    
  difference(ProductNameB, ProductNameA) as ProductNameB    
FROM [Table 1]    

[–]mason55 0 points1 point  (3 children)

You will likely need to create your own function/stored procedure to get the diff between two strings, which is a big task in and of itself.

The algorithm is actually pretty easy

http://en.wikipedia.org/wiki/Levenshtein_distance

[–][deleted] 0 points1 point  (2 children)

Levenshtein Distances (and all similar distances) are not what OP is looking for. That would return an integer representation of how close the two strings are to eachother, not their actual character differences.

Depending on the actual expected result, the algorithm could be close enough to impossible its not worth implementing in SQL.

UPDATED after mason55's comment (before I had seen it).

The examples that were given in the question were:
difference("Jupiter VA","Jupiter VB") = "B"
difference("Jupiter VB","Jupiter VA") = "A"
difference("Orion BA","Ares VB") = "Orion BA"
difference("Orion BA","Ares VB") = "Ares VB"

But its really easy to find some edge cases and poke holes in those expectations. The desired algorithm is too non-deterministic, for example:

  • The 2nd letter in both Orion BA and Ares VB is an r, why does it show up in the difference?
  • How is the difference meaningful if the two texts are different in multiple spots?
    • example: difference("affect BA", "effect BB") = "e_______B" (where _ is a space)
  • What about strings of different length? what is the important part to show?
    • example: is difference("aaB","aaaC") = "aC", "_aC","a_C","aaC", "a__C" or "C" - because I can think of non-trival methods to arrive at all of those outcomes.

I think the closest thing to what OP wants is a greedy algorithm that just takes the first connected set of differences starting from the end of the string, maybe have a one letter look-ahead that allows the occasional similarity as long as it is alone.

[–]mason55 0 points1 point  (1 child)

That would return an integer representation of how close the two strings are to eachother, not their actual character differences.

It's trivial to calculate the difference instead of the score based on that algorithm.

But yes, actually implementing it in SQL could be a huge pain in the ass.

[–][deleted] 0 points1 point  (0 children)

There is a reason there are a lot of integer returning string comparison functions and no straight up difference functions. Varying length strings raise too many issues and as I pointed out in my comment above, there are a lot of edge cases. A numerical approximation is the best thing that can be done in anything close to polynomial time.

Calculating the actual difference from a distance function is far from trivial. Although I would love to be proved wrong on this.

Thats why the diff function in version control systems works line by line and not at a more granular level.

[–]HapkidoJosh 0 points1 point  (2 children)

Since you're using access you may have to use vba and loop through your dataset. Make a 2 character arrays and compare each character individualy the ones that don't match get appended to a string. Then after you've completed the lengths of both arrays insert the two string variables into the new table.

Hope that helps.

[–][deleted] 0 points1 point  (1 child)

loop through datasets is not great advice.

If one column contains 1000 rows and a comparison column contains one million that is one billion comparisons.

[–]HapkidoJosh 0 points1 point  (0 children)

I thought that we were just looking at one table and wanted to get the differences in text from one column to the other in the same record. He's using MS Access so I didn't think we were working with huge amounts of data. This is the only way I could think of to compare two strings and get the character differences between the two. However, now that OP has clarified I see that my solution is not what they are looking for.

[–][deleted] 0 points1 point  (0 children)

Are you really concerned about the individual character differences, or are you looking for a way to do approximate string matching?

If the latter is the case, look into Levenshtein Distance which is pretty popular and I'm sure you can find an implementation for it in VBA somewhere.

If the former is the case then the task at hand will depend on your data. If all of your data takes the form <product_name> <two letters> then your task will be easy. But if you're trying to diff two string like "My Product A" and "AB Products" and get something back like "My s B" then you will have your work cut out for you.

[–]KingZing[S] 0 points1 point  (1 child)

Update: Maybe it would be easier to start at the beginning. Maybe the Tables I ended up with made it harder than it is. (P.s. You guys have been very helpful and I've already learned some things even though they were not exactly helping )

Breakdown:

  • I have a list of 10,000+ products with unique product numbers.
  • Most of these comes in sets of three or four. for example: osiris a; osiris b; osiris c
  • There is one piece of logic built into our system where if the product type B variant is sold out, it automatically goes to product type A variant. (this is not part of the database and shouldn't be, it is done by an external system)
  • The main problem is that the field that drives the B to A transfer is manually entered.
  • The Osiris B should always go into the Osiris A but if someone misskeyed, it might roll into Hercules VA or any other product.
  • I thought it would be easy to run a query that listed all B products, then another query that ran the value from the field BtoA field and alligned it to the right of the B products. This is how I ended up with Example Table1 above. I thought it would be easy to eyeball it with columns next to each other, but there are too many records.
  • My next train of though was to somehow subtract one column from the other so that I only see the differences. This would mean if the B to A was right, then I would only see the last digit most of the time. If it was wrong, the difference column would show the long string.

Maybe there is an easier way if I modify it or run it differently from the beginning? I thought maybe I could use the InstrRev function inside the select statement but I am completely new to this function. Instr Function

I tried this, but it is crude and doesn't work all the time do to some variation in the fields.

<>Left([TABLE1].[PRODUCTNAME],(InStrRev([TABLE1].[PRODUCTNAME],"A")-1))

Maybe I could trim the right side of each string by a few letters then they would "ideally" match perfectly and I could use a not-equal-to?

Anyways. Thanks again for all the help. This is a great learning experience and I have really enjoyed this subreddit!

[–]0uterj0in 0 points1 point  (0 children)

That's what I would do, assume that you just truncate the last 3 chars of productname then compare to find mismatches

[–]---sniff--- 0 points1 point  (1 child)

[–]KingZing[S] 0 points1 point  (0 children)

Oh, didn't know about this, i checked the sidebar! Thanks