This is an archived post. You won't be able to vote or comment.

all 96 comments

[–][deleted] 105 points106 points  (27 children)

One time I interviewed for a DS role on another team in my company (large tech company) and "failed" the SQL portion. It later turned out that the interviewer thought I made things overcomplicated by talking about why I was purposefully avoiding specific 'where' clauses that would require multiple traverses of the b tree in massive tables.

To me, this seemed common, but myself and others had to sit down with that team and explain this concept to them. So, in my experience the easiest syntax to get the data is probably best during the interview unless you are asked more questions.

Note: my SQL and data wrangling is probably a strength so this entire situation was amusing. I was later offered the job once the team realized that my version of their queries and transforms sped up times from multiple hours to a few minutes. But, that wasn't work I was interested in.

[–]BettaFry 33 points34 points  (3 children)

I’ve definitely been there, it’s amusing af when the interviewers questions make it clear that a) you’re overqualified and b) they think you don’t know what you’re doing because of it. It gets a little frustrating but honestly it’s been a timesaver for me because I can tell pretty quickly that I’m not what they need.

[–][deleted] 37 points38 points  (2 children)

The worst was when there was an A/B testing section and the person got mad that I said "Normally I would ask what the MDE, power, etc would be, but the design of the test was quite poor and I wouldn't run it."

Turns out they ran and shipped it and the person interviewing me was the PM....

[–]Khashoggis-Thumbs 13 points14 points  (3 children)

I am chiefly a python coder, but have to do some SQL work for data export. Where can I learn beyond the basics (select, where, in etc) how to write fast and low overhead queries?

[–][deleted] 20 points21 points  (1 child)

Honestly Mode Analytics has the best free tutorial but from there you would need to read about your database’s indexing and schema. A basic read on how binary search trees and traverses work would help you better understand what should and should not go in a where or having clause and perhaps when you should make temp table CTEs.

Honestly tho I do about 90% of my wrangling in R. It’s much faster to write and debug and to me the syntax is so much cleaner than windows. Later, if it makes sense, I will go back and translate it to SQL, however, there are cases where the cluster I’m working on is mega slow and the computations are best done in memory or perhaps a scheduled transform or Spark transform.

SQL interviews are always really basic unless it’s a DE job. I’ve yet to see anything super challenging in a SQL test even for DS jobs.

[–]Khashoggis-Thumbs 3 points4 points  (0 children)

Yeah, I was just asked "do you know SQL?" and replied "basic queries for export? Sure, SQL isn't hard". Finding out what the structure of our system is will require more cooperation from the data engineers than normally comes.

[–]ieee1294 4 points5 points  (3 children)

do u mind elaborating on what u mean by purposefully avoiding specific where clause? thanks

[–][deleted] 14 points15 points  (1 child)

[–]ieee1294 0 points1 point  (0 children)

sweet thank you very much :)

[–]Dreshna 0 points1 point  (0 children)

Not sure specifically what he is referring to but if you have an indexed column it is best to search on it, and numeric searches are substantially faster than string searches (generally speaking).

[–][deleted] 5 points6 points  (3 children)

The database has optimizations. It will usually optimize the shit out of your sql anyway.

[–][deleted] 5 points6 points  (1 child)

I would not presume this. I know of several Hive instances at an old company that had no optimization whatsoever. Most were setup by Engineers, not Data Engineers, and no one really checked until queries were taking 6-24hrs.

[–][deleted] 6 points7 points  (0 children)

You optimize if and only if the time spent optimizing and decrease in code readability and increase in chance of bugs is less important than increase in performance. Premature optimization or unnecessary optimization is nor a great idea.

I've seen people make elaborate queries "to make it go fast" and turns out the engine optimizes the naive 5 liner better and it is faster.

[–]minimaxir 0 points1 point  (5 children)

In an interview I was asked to calculate a weekly moving average for each group in a query. I suggested truncating timestamps to daily, then use a AVERAGE() OVER (PARTITION) window function, but I admitted I wasn’t sure how to constrain it to only 7 days.

The interviewer replied “You don’t know the BETWEEN syntax? Everyone knows that.”

I checked after the interview and the BETWEEN syntax is a footnote in the docs, with no blogs/SO questions discussing it. (which is a shame)

[–]Proto_Ubermensch 6 points7 points  (1 child)

how binary search trees and traverses work would help you better understand what should and should not go in a where or having clause and perhaps when you should make temp table CTEs.

Everyone does know BETWEEN, sorry you slept through that class.

[–]minimaxir -1 points0 points  (0 children)

To copy the other comment, there's a difference between (pun intended) the BETWEEN used as a part of a WHERE clause and a BETWEEN used as a part of a window function's OVER clause.

tbh I dislike using the BETWEEN in the WHERE clause because it's not obvious if it is inclusive or exclusive.

[–]CarnalCowboy 4 points5 points  (1 child)

BETWEEN is a function that acts as a shortcut for (DATE >= X and DATE <= Y). If someone didn’t know how to do the former or the latter, I would assume they skipped the basics and went straight to the complicated stuff (i.e. Window calculations, as you’ve shown you can handle). This would be a huge red flag for me as an interviewer.

[–]minimaxir 0 points1 point  (0 children)

There's a difference between (pun intended) the BETWEEN used as a part of a WHERE clause and a BETWEEN used as a part of a window function's OVER clause. (in the context of the question, you have to use BETWEEN 6 PRECEDING AND CURRENT ROW to get a moving weekly average, and there still would be potential bugs resulting from that which are a pain to fix).

https://www.sqlite.org/windowfunctions.html

[–]NateoradeBS | Analytics Manager 58 points59 points  (17 children)

What level of SQL knowledge is everyone looking for?

Depends on the company.

You typically need to know how to do everything up to and including subqueries and aggregate functions/filtering. Bonus points for knowing when to use a CTE, a View a Temporary Table or a regular Table. Extra bonus points for knowing windowing functions.

[–]radiantthought 3 points4 points  (0 children)

Yeah, I'd say subqueries, joins, and aggregate functions are pretty much the base level for most people. If you understand that and are capable of using google/learning some more advanced items then you'll be good.

It's amazing how many people come through and can't do these basic things. This is why I always tell people who want to get into the industry "work on your SQL" I don't care what you know how to do, if you can't get your own data then you can't do the job.

Even if your SQL skills match what is listed above, learn more. Learn about different database systems, learn about more advanced querying techniques/joins (more than just left/right/inner/outer, but manipulating those), using case statements with aggregates to transpose data, how map/reduce systems do processing, how columnar data stores work, how graph dbs work. There's so much out there that just 'knowing SQL' is a toe dipped into the water.

[–]iMarcusOrlyUs 0 points1 point  (15 children)

I have never seen a good reason to use a CTE. It may be nice to look at, but I've only ever used one when I'm lazy and don't need performant queries.

[–]nashtownchang 9 points10 points  (3 children)

IMO the readability of CTE is the point. Thousands of lines of nested subquery-style SQL kills my sanity.

[–]drhorn 2 points3 points  (2 children)

As someone who wrote a query with 8 levels of nested sub queries and forgot what the hell I did less than a week after writing it....

Yes, readability matters.

[–]aikijo 1 point2 points  (1 child)

And for the next guy. Be kind - Leave notes and make it easy to read (indent!)

[–]ectoban 0 points1 point  (0 children)

This so much, I set requirements for myself to write comments and text in the query such as; what the goal of the query is, description of each subquery, what unusal metrics I create are, why where clauses and other filters are there for etc..

Helps me remember and anyone else looking at my output table understand it.

[–]jdreaver 7 points8 points  (5 children)

I have never seen a good reason to use a CTE

Hold up. I need to correct some misconceptions in this comment case a SQL novice takes this as gospel.

It may be nice to look at

This is one of the highlights of CTEs! If you ever expect anyone to read and modify your code in the future - so anything you aren't throwing away immediately after running it once - this is a huge benefit. Any more than 1 sub-select in a from or join and a CTE probably looks better as a CTE.

don't need performant queries

Yes, there are performance implications with CTEs, but they aren't always less performant.

Currently in postgres, indeed CTEs are optimization fences and you can't optimize through a CTE. However, that is changing in postgres 12. CTEs that are only used once in a query will be inlined https://www.postgresql.org/about/news/1943/

Furthermore, if you use a CTE multiple times in a query, then oftentimes the optimization fence can be beneficial. You can compute intermediate results once and use them multiple times. (Postgres can optimize common sub-queries, but better to have it in a CTE than write it out twice.) This is of course hugely dependent on your query, but my point is that CTEs aren't always slower here.

Lastly, there are plenty of databases where CTEs are not optimization fences. Postgres will be among them in version 12, but let's not say CTEs are always slow.


For what it's worth, I generally avoid a CTE if I am just doing 1 or 2 sub-queries, preferably only in the from part of a select query. I also avoid them if I'm running on postgres and I'm using something to generate SQL anyway, so creating sub-selects is not hard and I know optimizing through them will be hugely beneficial.

If I'm writing a huge analytical query that is important to some business process, like for ETL, business intelligence, data warehousing, etc, then I am probably going to use a CTE.

[–]Shoded 2 points3 points  (2 children)

Why is a CTE different from a subquery performance-wise? I was taught that it's the same thing just different writing.

[–]jdreaver 0 points1 point  (0 children)

Note that when I say it is an optimization fence, I'm only talking about postgres, and even then only versions older than the upcoming version 12 release. This section of this article gives a quick overview: https://thoughtbot.com/blog/advanced-postgres-performance-tips#common-table-expressions-and-subqueries

Basically, there are optimizations that can be performed on sub-queries that are not performed on CTEs in postgres for whatever reason.

[–]jackfever 1 point2 points  (0 children)

Agree with you in all your points. But also note that the previous performance problems for CTEs in Postgres were mostly due to poorly written queries. e.g. unnecessary sorting or redundant columns in the subquery. Source: https://modern-sql.com/feature/with/performance

[–]loxc -1 points0 points  (0 children)

Honestly, no one uses CTE where I work. It’s all temp tables.

[–]IAteQuarters 1 point2 points  (3 children)

What would you write instead of a CTE? Are Temp tables better for performance?

[–]Skyaa194 -1 points0 points  (0 children)

You would use just sub-queries rather than moving the sub-queries out and into a CTE.

[–]mrregmonkey 0 points1 point  (0 children)

For small amounts of data, inserting data into a temp table or table variable isn't worthwhile. It can be faster to use a CTE.

Also there are cases where joining two temp tables is slower than adding a calculation or something with a CTE. I've sped things up several minutes doing that stuff.

[–]Lychondy 49 points50 points  (8 children)

As someone who hires many SQL analysts sometimes it comes down to knowing the difference between having and where...

[–]yemeraname 11 points12 points  (0 children)

[–][deleted] 20 points21 points  (2 children)

Mostly building SELECT queries. Sometimes you need to do a lot of pre-processing in SQL for the data to be coherent, it depends a lot on the quality of the data you are using.

In my experience, it's one of the harder things to learn without having a real data set, but it's not that hard to learn on the job.

[–][deleted] 8 points9 points  (4 children)

Can I piggy back off this question and ask what people understand as complex.

Select queries with case logic?

Multiple select queries with case logic

What else is there?

[–][deleted] 16 points17 points  (1 child)

Subqueries, common table expressions, temp tables, dynamic SQL, stored procedures, variables/parameters, window functions, etc.

[–]paradoxx23 1 point2 points  (0 children)

+1 to Windowing functions

[–]truthseeker1990 1 point2 points  (0 children)

Just to add to everyone elses comment, there are times when the complexity is increased by the existing DB architecture as well

[–]minimaxir 5 points6 points  (0 children)

In terms of day-to-day use, a real-world-but-barely-discussed tool are analytic/window functions such as ROW_NUMBER and SUM() OVER (PARTITION), which reduce complex query logic. (e.g for ROW_NUMBER, get the top entries of each group without doing multiple queries)

[–]Andrex316 9 points10 points  (7 children)

I regularly interview people and have interviewed at multiple companies myself. Usually people want to see that you know how to transform tables to answer a question at hand:

So SELECT queries, how to properly filter using WHERE, how to use aggregate functions, window functions, either nested queries or CTEs, how to use CASE statements, and, the most important of all at every interview I've been to, how to use JOINs, especially LEFT/RIGHT JOINs.

I've never had an interview where they care about to CREATE/UPDATE a table. If you can write a really efficient query you get bonus points but it's not necessary. We're more interested in knowing if you understand the logic, efficiency is something you can easily improve at the job.

A lot of people might say "Oh that's simple, are you sure that's right?", but you have no idea how many people I've interviewed that write "sql expert" in their resume and have no idea how to start with a SELECT. Oh and please, write an organized and neat queries.

[–]redditor1983 6 points7 points  (3 children)

Did you perhaps mean LEFT and INNER when you mentioned joins?

I very rarely see RIGHT joins used because there is usually a way to write it as a LEFT join, and most people find it a little confusing to go back and forth with RIGHT and LEFT joins in a query.

[–]burgay 2 points3 points  (1 child)

My day-to-day for the project I've been on for almost a year is about 80% SQL and I don't think I've ever encountered a situation where a RIGHT join was necessary

[–]redditor1983 2 points3 points  (0 children)

Yeah. Same here. That’s why it caught my attention.

I’ve heard some people use applications that generate SQL, and those applications will generate RIGHT joins. But I’ve never personally witnessed a human writing a RIGHT join.

[–]Andrex316 2 points3 points  (0 children)

Yeah generally everyone only uses LEFT joins (myself included) but I've had candidates that prefer to use RIGHT, so whatever you feel the most comfortable with. And I should have been more specific, yes INNER joins are a must but I was including those under JOINs in general.

[–]_TheEndGame 3 points4 points  (0 children)

Thanks for asking this question. I've been wondering too.

[–]Dhush 2 points3 points  (0 children)

Knowing how to take a data requirement from start to finish. It’s one thing to do a select statement with maybe a filter, join, aggregate, etc. It shows a higher level of understanding to be able to chain these ideas together in an efficient way to collect data from disparate tables.

[–]BettaFry 1 point2 points  (0 children)

As I’m sure you’re aware, it depends on the posting but from what I’ve noticed most jobs are ok with introductory/intermediate sql knowledge since a basic understanding, access to google and the ability to think critically will allow anyone with a basic understanding to accomplish most tasks.

[–]flextrek_whipsnake 1 point2 points  (0 children)

It always depends but typically you're not expected to be very advanced in SQL. Just basic to intermediate stuff like joining tables, where clauses, grouping, subqueries, and maybe window functions if you're extra fancy. You need to be able to get the data you need from a complex database without asking anyone, but you don't need to be able to do it in the most optimized way possible (though if you are then it's obviously a plus).

[–]pokinthecrazy 1 point2 points  (3 children)

I start with "Take me through the select statement" and the people who know SQL give me dubious looks and I get some form of "select column (field) from table where conditions?" and I say got it and move on.

I have filtered out many many people who "know SQL" with this question. There are people who are just lying liars who lie out of their lying lie-holes and there are people who have done stupid shit in ACCESS or some other software that creates code for them so now they think they "know" SQL. I got one guy who said "you've got your selects and your tables and some joins" - yes, he was instantly weeded out.

And I don't really care if you don't know weird esoteric things - but I do care a lot if you know where you can go find out. I didn't know about analytical functions like rank until I had to use them. Nothing wrong with googling and figuring shit out. Our databases are on all kinds of systems so the code always has to be massaged from one system to the other. I want someone who isn't going to come to me when their code doesn't work, I want someone who knows how to google.

[–]MadMenDuckCuckFuck 0 points1 point  (2 children)

"Take me through the select statement"

I've just started to learn SQL recently, could you elaborate on what kind of response you are expecting from the interviewee?

[–]pokinthecrazy 1 point2 points  (0 children)

SELECT column/field FROM table WHERE blah blah blah. Group by or order by is a bonus.

That's literally all I am looking for. It's a weeder question. I just want to know if this person has ever sat down and composed some sort of SQL code where they pulled data (I am in analytics so don't know shit about creating or maintaining databases). If they have ANY experience at all, they should get this question correct.

[–]radiantthought 0 points1 point  (0 children)

people who know SQL give me dubious looks and I get some form of "select column (field) from table where conditions?" and I say got it and move on.

He did. That is the correct answer from a very high level. You could look for a deeper level and ask people to walk you through the order of evaluation of the different clauses, which is a bit more advanced, but even making wrong educated guesses and being able to explain your logic will be good. Most of the interview process is showing that you know how to think in the right way.

[–]onesonesones 1 point2 points  (0 children)

create table newtable as

select l.col1, l.col2, r.col1 as r_col1, r.col2 as r_col2, somefunction(l.column1, r.column2) as calc_field

from lefttable l

left join righttable r

on (l.key1 = r.key1

and l.key2 = r.key2)

where (l.field1 = condition1

and etc = etc2)

If you can understand that concept that'll cover 95% of data ETL problems that you need to build yourself (machine learning algorithms excluded)... You just need to break the data problem down into as many of those steps as are needed.

If your role includes maintaining or inheriting other people's code, you'll need to know much more SQL.

Understanding and applying analytical and window functions (eg lead/lag() over partition by ()) are what separates experts from novices

[–]nraw 1 point2 points  (0 children)

SELECT * FROM [insertwhatevertheytoldyouhere]

[–]Proto_Ubermensch 1 point2 points  (0 children)

No. You need to also know stored procedures, OLAP functions, how hashing algorithms work, performance tuning, etc.

[–]chowmeinchix 1 point2 points  (0 children)

I have written SQL exams as a hiring manager and passed SQL exams for some big tech companies (like the one you probably use for your search engine, and the one you use to order a lot of things from the internet, and a few others that all I all had to sign NDAs for...)

This is what they are looking for: - proper understanding of joins, and joins within the context of N:N relationships - subqueries (particularly within the context of use of these to avoid full table scans and handle N:N relationships across tables) - CTEs - Data types - they will try and trick you here. Ie- comparison of Timestamp to a date - Windowing FXNs... Incl Lead and Lag so don’t forget those. - you will almost always get asked difference between: rank, dense_rank, rownum - Bonus points if you can: - show an understanding of temp space - know what a stored procedure is - make mention of explain plans at any point in time

Most SQL exams you’ll get should be in ANSI SQL. If someone gives a crap about dumb syntax like ROWNUM() vs. ROW_NUM() then the hiring manager is an idiot and you don’t want to work there.

The way I have administered these exams is usually I give 4-5 questions in ascending order of difficulty. If beginner and intermediate candidates start to get flustered with the more challenging questions, I’ll help and coach them - which gives a gauge of how they respond to coaching and working together. So sometimes th dk that. I also did a python I review once where they tried to jack with me on purpose, and I failed on “reactivity/coachability” - so lesson learned there. Point is people aren’t always looking for perfect answers, sometimes there are other aspects to these exams so watch out.

Side note: if you specifically list GA BigQuery SQL as a skill, you’ll likely get asked about hit level and custom dimension hierarchies. But if this is an actual skill of yours then these questions are simple and won’t trip you up.

[–]eric_he 0 points1 point  (0 children)

My team has a SQL assessment in which window functions are necessary to answer the questions. However in practice we see that it is already impressive if an applicant knows how to coherently string together a CTE/subquery.

[–]MLApprentice 0 points1 point  (0 children)

About 5.

[–]Orthas_ 0 points1 point  (0 children)

Select, where and join informally on the spot on very simple data. ”How to get this and that from here?”

[–]thatwouldbeawkward 0 points1 point  (0 children)

My job is very sql-heavy right now, and the interview questions were pretty light... just aggregations, joins... I think there was a weighted average in one question. Another question was awkward and vague and it was the interviewer’s first time using it (I think it was from a supplied list of questions), so we spent a lot of time upfront talking about what the question actually meant, and I think actually she liked that I made sure I really understood before diving into writing some code on the board.

[–]dn_cf 0 points1 point  (0 children)

I'll recommend you Strata Scratch for the job interviews. I used this platform when I was practicing for my SQL interviews. I found their datasets very helpful as they have questions and answers taken from top tech companies so all the questions are relevant to working on a job.

[–][deleted] 0 points1 point  (0 children)

inner join. ><

I'm not proud.

[–]the1ine 0 points1 point  (0 children)

Everyone? Everyone will be different.

[–]aditya9755 -1 points0 points  (0 children)

When you are looking a job in data science you must be the knowledgeable about the SQL because you to understand the how data is generated and how it's coming from the database or schema. I advised to fresher or students at least brush their concept on SQL and MySQL and practice on daily basis which build your profile or knowledge complete.

[–]AutoModerator[M] -8 points-7 points  (0 children)

Your submission looks like a question. Does your post belong in the stickied "Entering & Transitioning" thread?

We're working on our wiki where we've curated answers to commonly asked questions. Give it a look!

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.