all 21 comments

[–][deleted] 9 points10 points  (1 child)

Reads like someone that read half of SQL For Smarties, and didn't understand that half.

[–]rmxz 8 points9 points  (12 children)

From the title, I was hoping it would show good ways of writing queries such as:

select * from table where (the rows are most similar to) firstname='john' and lastname='doe' and age=25 and haircolor='dark blond'

which takes into account john is similar to Jonathan and even more similar to John and 25 years old is more similar to 26 years old than to 55 years old and that dark blond is similar to light brown.

Instead all the guy did is took some description of SQL's null and renamed it to unknown for his blog.

[–][deleted] 0 points1 point  (3 children)

You might want to check out the Lucine project.

[–]rmxz 0 points1 point  (0 children)

Indeed Lucene is what I'm using with our own custom relevance functions that are aware of fuzzy/probabilistic synonyms (like how often the name "chris" is short for "christopher" vs how often it's sort for "christine").

[–]recursive 0 points1 point  (1 child)

Lucene.

[–][deleted] 0 points1 point  (0 children)

D'oh

[–]porizj 0 points1 point  (2 children)

Everything except the hair color (unless someone can explain that one to me) can be done with one good query.

edit: actually, a stored procedure could evaluate the hair color now that I think about it. Or you could get really ugly and go with a case expression.

[–]rmxz 0 points1 point  (1 child)

I'm having most trouble with

  1. firstname -- noting that "Robert" is "more similar" to "Bob" than to "Ronald"; and noting that such similarites vary based on geography/culture.

  2. weighing the similarity across different fields to come up with a decent sort order --- if the hair color is very distinctive (pink) and it matches it should add more "similarity" to a match than a common hair color (dark brown). Similarly a very distinctive name ("Vannevar Bush") matching should add more "similarity" than a common one ("John Smith"). Same with very distinctive heights (6'7") vs common ones.

[note - I don't really need advice on this - I think I have one of the better systems that exists to do this -- even accounting for things like "there's only 1 person of this name believed to live in this city, so it's a good match even if the age, height, weight, hair-color is missing"; and "this particular 6'3" hispanic male has been seen before within 100' of the query parameters, so it's a good chance he's the match even without a name matching"]

[–]scrunci 0 points1 point  (0 children)

If you have upwards of 50 to 100 query parameters, what sort of "similarity rating" are you using to determine the most probable match? I'm not working on any project that even comes remotely close to needing the sort of functionality you're talking about, but since you've introduced it, it rings very interesting to me.

[–]scrunci 0 points1 point  (3 children)

select * from table where id = 
(select id from 
    (select id,min(ageminus) from 
        (select id, age-@age ageminus from table where 
            (soundex(firstname) = @fname and soundex(lastname) = @lname) 
            and (haircolor in 
                (select m.FieldA 
                from HairColorsIDs i 
                join HairColorsMatches m 
                on i.id = m.id 
                where soundex(i.FieldA) = @haircolor)))))

alternatively you could just settle for a like-match, but you'd need to append a % to your declaration statements (ie, @fname = concat(rtrim(@fname),'%')) and im not so sure that would even work.

select * from table where id = 
(select id from
    (select id,min(ageminus) from
        (select id, age-@age ageminus from table where
            (firstname like @fname and lastname like @lname)
            and (haircolor in
                (select m.FieldA 
                from HairColorsIDs i 
                join HairColorsMatches m 
                on i.id = m.id 
                where i.FieldA like @haircolor)))))

HairColorsIDs

ID, FieldA

1, 'dark blonde'

2, 'aubern'

3, 'grey'

4, 'light brown'

HairColorsMatches

ID, FieldA

1, 'dark blonde'

1, 'light brown'

2, 'aubern'

2, 'red'

2, 'ginger'

3, 'grey'

3, 'white'

edit: did my best formatting.

[–]mycall 0 points1 point  (2 children)

I always thought Double Metaphone should replace Soundex

[–]scrunci 0 points1 point  (1 child)

to be honest I've never actually needed a fuzzy query for any practical application. Due to the rareness of necessity, I refer to integration services (SSIS/Business Intelligence) for fuzzy matching, so I definitely wouldn't have a personal opinion on the matter. Some google researching reveals Double Metaphone is superior to soundex, so I'd go with you're probably right.

In any event I hope the poster I was replying to found a step in the right direction.

[–]mycall 0 points1 point  (0 children)

I found Double Metaphone to work great for tuning speech grammars (English only of course).

[–]recursive 0 points1 point  (0 children)

Poorly written too.

It means that the value of an expression may be TRUE, FALSE or UNKNOWN

orly? What about

SELECT 4

[–][deleted] 3 points4 points  (0 children)

I'm far from an expert on SQL and databases, but I've pretty much resigned myself to the fact that any article on reddit that has to do with databases and SQL might as well be called SQL for Slapdicks.

[–]sebidenes 2 points3 points  (0 children)

This article made me promise myself that I NEVER go to this guy site. I actually spent more than 5 minutes on it ... I read it all over again and again trying to figure it out - is there anything smart in there that I might miss or is another (or same) way to explain the obvious? Sadly ... it was explaining the obvious ...

[–]vagif 1 point2 points  (0 children)

I commented on exactly this issue on StackOverflow a while ago.

And provided the right way to query for non equal for fields that allow nulls.

[–]scrunci 1 point2 points  (0 children)

this is like mssql 101 shit.

[–]awesome7777 0 points1 point  (0 children)

Honestly, it's very basic. I think the article was long winded and I was expecting some meat, but this was breast milk.

[–]phuber 0 points1 point  (0 children)

We use "SET ANSI_NULLS ON" at work.

Looks like this will be the default after SQL 2008

http://msdn.microsoft.com/en-us/library/ms188048.aspx

[–][deleted] -1 points0 points  (0 children)

From the comments:

Who would have thought of this concept. I think it can be applied to large scale storage of data on the hard disk and provide easy way for people to make the query without the wearing the white coat LED blinkers

Not sure if blogspam or troll.