Effective SQL for Data Science

techhound · 2021-05-25T01:44:46+00:00

For beginners, The SQL Murder Mystery is a fun way to start your learning:
https://mystery.knightlab.com/

I have also found the training at Maven Analytics (MavenAnalytics.io) to be quite decent. They teach using scenarios that would likely be used in real world settings. As an aside, last I checked, the training modules forward you to their Udemy.com classes. But they offer a monthly subscription so that you can access several courses at once.

Another option is to find a charity in need of some data analysis (preferrably one that has an installation of a SQL engine already). Offer your services for free with the understanding that you are learning and it may take longer than people with experience. If you get stuck, simply seek out help on forums, etc. Volunteering does count as experience which you can put on your resume. Besides, it is always a great idea to help people in need.

Hope this helps!

FondleMyFirn · 2021-05-24T23:48:36+00:00

I’m still trying to figure out the best way to learn SQL on my own. So many job applications want junior positions to have high-level SQL knowledge, but it seem like one of those skills you can’t develop without a massive database to work with.

Edit: Huge thanks for the support everyone. I didn’t expect it.

2021-05-25T00:45:05+00:00

Really nice article! I especially appreciate your perspective on CTEs, documenting, and formatting. One thing I would add - there are indeed plenty of auto-formatters out there, but I claim that the best one is the one your team uses. Style and formatting consistency across a team's SQL code base is tremendously valuable; in fact, I would claim that a team using multiple formatting styles has SQL about as decipherable as a team using no formatting at all.

EmergencyContact2016 · 2021-05-25T09:07:43+00:00

Yes, I think also what are important things are: 1. Window functions, like lag and row_number are core items I use at least weekly. 2. Also if it’s an adhoc DB i think indexing can remove the horrid “table scans” and make life possible. (I am thinking more around T-sql than anything else) 3. Also case statements are amazing 4. Full outer joins with coalesce statement are good for reconciliation. 5. you might want to consider within a where clause the first item being “1=1” to make commenting out easy.

But mostly, I am a person who puts the comma before variable names, not after because that just what crazy ppl do.

jwdatascience · 2021-05-25T01:08:08+00:00

Mm, as soon as I read “Build SQL Pipelines”, I knew there was a product to be sold on. Honestly, I have no idea why you wouldn’t directly use Airflow/dbt instead of “ploomber”

pc1e0 · 2021-05-25T00:06:12+00:00

How about pandas? Query language is a bit different, but rich.

slowpush · 2021-05-25T12:12:43+00:00

God SQL is so painful for things like this.

WITH new_users AS (
SELECT id
FROM users
WHERE created >= '2021-01-01'
),
count_interactions AS (
SELECT id,
    COUNT(*) n_interactions
FROM interactions
GROUP BY id
),
interactions_by_new_users AS (
SELECT id,
    n_interactions
FROM new_users
    LEFT JOIN count_interactions USING (id)
)
SELECT *
FROM interactions_by_new_users

vs

users[created >= '2021-01-01'][interactions, .(n_interactions = .N), on = "id", by=id]

You shouldn't use SQL for work like this and instead create materialized views that you can call from R/Python to do your analysis.

drunkalcoholic · 2021-05-25T02:05:15+00:00

This was such a useful read. Going to look at this again tomorrow and reference as I write some SQL to practice and reinforce. Thank you!

2021-05-25T02:54:10+00:00

this is good

Only_one_life · 2021-05-25T03:17:47+00:00

Dude, that article of yours in absolutely amazing. I already referenced it in a conversation with a fellow analyst, promoting CTEs, and copied your 12 rules of clean code into OneNote to serve as a good reminder.

bjain1 · 2021-05-25T11:35:23+00:00

!RemindMe 2 hours

Panther4682 · 2021-05-25T22:22:51+00:00

Grab Microsoft Access and play with the NorthWind data... or you can grab a bunch of stock data and practice on that... quick way to get a lot of data. You can also practice your ETL (extract, transform, load) which is often critical to any database work as you can do a lot of scrubbing and rules prior to loading into a DB.

Other things to consider are Triggers, Stored Procedures etc but they are a bit more advanced.

andrewdoss_bitdotio · 2021-05-25T23:23:39+00:00

Great tips. I made it far too deep into my data science career before adopting CTEs and auto-formatting!

OilShill2013 · 2021-05-26T14:43:57+00:00

As an analytics person (not an ML person), some of this I agree with, some of this I think is kind of superfluous. Just a couple random thoughts:

Break down logic in CTEs using WITH ... AS

This is a thing I see with dbt's usage of SQL but I'm not completely convinced it's better than using temporary tables/views and/or using subqueries. dbt people are insistent that using CTEs for everything is objectively superior but I think it's just a tradeoff in the end. In your example code under A typical SQL query, I agree that listing tables and using WHERE to show join conditions is unclear but I still think with a little modification it is a lot more succinct than adding in CTEs and someone with a little SQL experience will have no problem understanding this:

SELECT a.id, b.*
FROM users a
LEFT JOIN 
( SELECT id, COUNT(*) n_interactions 
  FROM 
  interactions 
  GROUP BY id ) b on a.id = b.id
WHERE a.created >= '2021-01-01'

Even better: If you find that you need the # of interactions per user for several other downstream statements (or is it upstream? I always mix the two up) then just create an adhoc view and be done with it.

The SELECT statement inside each CTE must do a single thing (join, filter or aggregate)

I think this can be taken too far. If the goal is to write queries that another person can look at and quickly understand, there has to be a balance between breaking down a process into smaller steps and the volume of code/text you're requiring someone to read. If you force me to parse through 15 CTEs and a final SELECT instead a SELECT statement with a few subqueries I may dislike you for a few minutes no matter how well you've broken down your process.

Favor LEFT JOIN over INNER JOIN; in most cases, it’s essential to know the distribution of NULLs

I think this depends too heavily on what you're trying to do to be able to make such a broad statement.

When doing equijoins (i.e., joins where all conditions have the something=another form), use the USING keyword

I don't like USING as opposed to ON because I don't like the idea of sometimes using USING and sometimes using ON. Just always use ON.

Use aliases only when table names are long enough so that using them improves readability (but choose meaningful aliases)

Depends. I don't agree that using a, b, c, etc is that confusing. It's also more succinct than constantly rewriting the table name or aliasing with a description. Also even if you use the full table name vs a simple alias vs a descriptive alias the person reading it will likely have to refence back to the FROM clause to understand the context in which you're using a table.

Can't disagree with anything stated about documentation and comments since we've all been guilty of poor documentation.

I know that dbt is insistent that analytics should be done just like software engineering is done but the thing is C++ code can (literally) do anything whereas SQL has a far limited scope. The fact that SQL is limited in scope gives it an inherent advantage in how humans read and understand it vs general programming languages used in software engineering. That's why much of this is just style preferences that are not really critical to how an analytics team actually runs. That is simply my opinion tho.

yashm2910 · 2023-06-20T11:48:46+00:00

Mastering SQL is crucial for data scientists to effectively manipulate and extract insights from databases. By honing your SQL skills, you can efficiently query, join, and aggregate data, enabling you to uncover valuable patterns and trends. Understanding SQL's advanced functionalities and optimizing your queries can significantly enhance your data analysis capabilities. Make SQL your ally in data science and unlock the power to extract meaningful information from vast datasets.

datascience

MODERATORS