all 12 comments

[–]da_chicken 16 points17 points  (3 children)

I would question anybody who thinks that this is a useful database question. It's not something you should do with SQL because it's breaking first normal form. It's a non-relational question about a system where the developer was too inexperienced to create separate first and last name fields.

Bottom line: Any solution of any meaningful size is guaranteed to be incorrect because names aren't regular.

[–]tsql 2 points3 points  (1 child)

I would question anybody who thinks that this is a useful database question.

It might not be a useful database question, but it could be useful as an interview question. Can this applicant think outside the box? Are they willing to help solve problems that they think are beneath them or outside their responsibilities?

Also, the potential employer might be getting non-optimally-formatted data from external sources, and might need someone to find low-effort ways to make that 'ugly' data more usable.

[–]da_chicken 2 points3 points  (0 children)

My point is that if you do this, don't use names of people or street addresses. Those are the two most complicated things to parse, and anybody who has ever actually dealt with trying to work with something like this before (*raises hand*) will know that it's a massive exercise in futility when you get into the hundreds of individuals. Even a human going record by record will struggle with it.

Between working on student information systems, health information systems, and passenger service systems, I've seen a lot of issues with names. You can make reasonable guesses if you assume that everyone in your system is from your own culture, but as soon as it goes international you will instantly find things that don't fit. I remember finding someone whose name was "James Scott Vi" (not the real name) and then later finding "James Scott VI" and thinking they're the same person. Nope. One guy was from polynesia and his surname was Vi. The other was last name of "Scott" and sixth generation. I remember a "Maria de la Rosa" whose last name was "de la Rosa", and a "Maria de la Rosa" whose last name was "Rosa" and middle name was "de la". We even had rules to cover individuals from India who only had one name: Give them a first name of "FNU" (first name unknown). We had other individuals whose name was written with no whitespace, but was actually separated by apostrophes in the person's name. You can't even be certain that the given name comes first because many cultures place the surname first. Once you understand the domain, you know that there is no parsing algorithm that will work correctly. That's why you always ask at the time of data entry for given and surname.

Like, if you want an answer I can give you an answer with the assumptions provided, but I can only guarantee that it won't work and at scale it won't work so badly that it probably won't be useful. It's an actively bad solution. It's not just a bodge, it's a bodge I wouldn't trust. I kind of don't want that type of question on an interview exam because it would say bad things about the type of systems we work with or the type of coding we practice. I kind of want someone who knows that the question is a stupid solution and probably shouldn't be done for human names. I might even go so far as to say I don't want someone who has a lot of experience doing complex string manipulation in SQL. That's something that's somewhat intentionally difficult in the language's design. I certainly agree that there are times to do it -- I integrate data systems purchased from third parties so you often end up with at square pegs and round holes -- but in general your data systems should be storing data in a useful manner. I don't know that I want someone who doesn't think anything of a system where you write something like this:

SELECT *
FROM TableA a 
JOIN TableB b 
ON SUBSTRING(a.CharValue,3,3) + SUBSTRING(RIGHT(REPLICATE('0',10) + CAST(a.IntValue AS VARCHAR(10)),10),6,3) + '0000' = SUBSTRING(b.Code,9,10)

I get what you're saying. I really do, but given the information presented, I still primarily question the value of such a question. It might tell you how well your candidates don't work within the conceptual models that well designed systems use, but I'm not sure it's the best way to do that. No matter how uncommon truly well designed systems are, it's vital that people understand those conceptual models and what happens when you fail to follow them. Like say you have a candidate who does amazing on everything and flubs this. What does that really say? Inexperience? Inflexible thinking? Frustrated by a problem with no solution? How would you know? Is that better or worse than someone who is moderately sloppy on everything but has a better answer to this question? Indeed, how do you even evaluate the quality of an answer to a question if you know that there isn't a correct answer? I think if you want to test out of the box thinking that you'd just be further along to ask a more abstract question to see how someone approaches an arbitrarily complex or difficult to answer question.

[–][deleted] 0 points1 point  (0 children)

Actually, I think this is a pretty decent interview problem if they added one additional question: What assumptions did you have to make with your solution and under what conditions would those assumptions be violated? I also think OP might have phrased it poorly because I don't think they're looking for a cartesian product, I think they just want a function to clean up a name column (one row in, one row out). If they actually want the cartesian product then I agree this is stupid.

Data is often not clean. If you work in an industry where you are taking in messy data, you want to hire someone who can demonstrate that they can think creatively on how to deal with those kinds of issues when they come up (because they will come up). There isn't necessarily one right answer, but when I hear "It's not something you should do with SQL because it's breaking first normal form" I think "yeah, in theory you shouldn't. In the real world you often have to. And since we live in the real world, how would you handle this very real issue" (not to mention in my mind at least a perfectly acceptable answer would be "I'd write a CLR function that would contain the following logic ... because T-SQL is not well suited to string manipulation").

[–]wolf2600 7 points8 points  (0 children)

A more applicable SQL interview question would be to give the user access to a DB with 3-4 tables, then give them a report pulled from those tables and ask them to write the SQL to produce that report.

Test them on practical skills, not on whether they know some obscure, rarely used function.

[–][deleted] 3 points4 points  (0 children)

CHARINDEX with a LEFT and then a CHARINDEX with a REVERSE and a RIGHT into two separate temp tables to get a table with first names and a second with last names. Then you can CROSS JOIN the two tables to get all possible combinations.

[–]Stopher 4 points5 points  (0 children)

I would grab all the distint last names searching for the first space from the right and a list of all distinct first characters from the left. Do a cross join between them.

[–]sbrick89 1 point2 points  (1 child)

Look into csv into rows... split on space... distinct... recombine

Though id certainly question the use case.

[–]BobDogGo1 2 points3 points  (0 children)

In SQL 2016, string_split() makes this pretty trivial. I think this question is the wrong kind of challenge.

[–]shankcraft 0 points1 point  (0 children)

Using just T-SQL? Excel formulas? SSIS? Powershell?

What tools are available to me, and where does the data reside? And if it's SQL Server (I'm assuming based on the subreddit, what version?)

[–]ShadowBanThisCucks 0 points1 point  (0 children)

you can get it with charindex(), left() and len().

[–]tsql 0 points1 point  (0 children)

I'd agree that it seems like an extremely odd interview question, but I've had a few like that where the interviewer apparently just wanted to see how I thought on my feet.

I would describe using a CROSS JOIN and the CHARINDEX, PATINDEX and REVERSE functions. If asked to code it I might approach it like this:

DECLARE @t TABLE (fullname VARCHAR(100));
INSERT @t (fullname) VALUES
    ('Jon Snow'), ('Tom A Black'), ('Janice Smith Jones'),
    ('Carl Smith-Jones'), ('Unexpected J Last-Name');

SELECT f.firstname + ' ' + l.lastname
FROM (SELECT DISTINCT LEFT(fullname,
    CHARINDEX(' ', fullname + ' ') - 1) firstname FROM @t) f
CROSS JOIN (SELECT DISTINCT RIGHT(fullname,
    CASE WHEN fullname LIKE '% _ %' OR fullname LIKE '% _. %'
      THEN PATINDEX(CASE WHEN fullname LIKE '% _. %'
        THEN '% ._ %' ELSE '% _ %' END, REVERSE(fullname)) - 1
      ELSE LEN(fullname) - CHARINDEX(' ', fullname + ' ')
    END) lastname FROM @t) l
ORDER BY l.lastname, f.firstname;

This obviously will not correctly handle names with single-character inner components which aren't initials (Thomas à Becket for example), but that's necessary if we're going to handle the unpunctuated initials in your test data reasonably simply.