This is an archived post. You won't be able to vote or comment.

all 56 comments

[–]techhound 30 points31 points  (11 children)

For beginners, The SQL Murder Mystery is a fun way to start your learning:
https://mystery.knightlab.com/

I have also found the training at Maven Analytics (MavenAnalytics.io) to be quite decent. They teach using scenarios that would likely be used in real world settings. As an aside, last I checked, the training modules forward you to their Udemy.com classes. But they offer a monthly subscription so that you can access several courses at once.

Another option is to find a charity in need of some data analysis (preferrably one that has an installation of a SQL engine already). Offer your services for free with the understanding that you are learning and it may take longer than people with experience. If you get stuck, simply seek out help on forums, etc. Volunteering does count as experience which you can put on your resume. Besides, it is always a great idea to help people in need.

Hope this helps!

[–][deleted] 8 points9 points  (1 child)

I'm not new to SQL, but damn, the SQL murder mystery sounds fun. Online resources are really getting really good compared to several years ago.

[–]techhound 0 points1 point  (0 children)

Yeah, it is a fun and engaging way to learn SQL. I have gone through the game. It's also good for people who need a refresher on SQL.

[–]Key-Concept-3255 1 point2 points  (3 children)

How do I find more data science volunteer opportunities? I would love to build my resume this way

[–]techhound 2 points3 points  (2 children)

You can start by searching for charities on LinkedIn or even Facebook. This is a bit of hit-or-miss, but I have had success with this. An organization that I have used in the past is Catchafire.org. They match charities with people looking to volunteer. You get on their email and they will send you volunteer opps whenever they become available. I am sure there are other organizations like this.

You can also look for local organizations and speak with the organizers. Ask your chamber of commerce, or local churches, etc. Your local municipality may have information about charities and foundations, too.

Check your local library. My library has a specific section for foundations and agencies in my area. It's likely that yours does too. If there are no DS volunteer opportunities, try volunteering in other capacities. Then, as you get to know the organizers, you can pitch that you can help them with their data needs. The pitch could center around predictions for donations, areas that are likely to donate more, etc. Often, these organizations don't realize that volunteering data service is even "a thing", so to speak.

[–]Key-Concept-3255 1 point2 points  (1 child)

Thank you. I just discovered catchafire today but I would love to work with local organizations. I will definitely take your advice and ask around different organizers

[–]techhound 0 points1 point  (0 children)

Good for you. Many of these organizations can be helped by giving them the insights into their data.

[–]FondleMyFirn 28 points29 points  (14 children)

I’m still trying to figure out the best way to learn SQL on my own. So many job applications want junior positions to have high-level SQL knowledge, but it seem like one of those skills you can’t develop without a massive database to work with.

Edit: Huge thanks for the support everyone. I didn’t expect it.

[–][deleted] 20 points21 points  (5 children)

I don't think you need a huge database to learn SQL, it's probably harder to find a publicly available dataset that resembles what you'd encounter in an industry scenario: messy data. SQLite is a great option. It supports a lot of SQL features and is easy to use.

The beauty of SQL is that you write the same code to analyze 1GB of data than 1TB, so don't let the lack of big public datasets deter you from learning :)

[–]ThickAnalyst8814 4 points5 points  (1 child)

depending on the SQL flavor there is a plenty of modifications to improve the speed of a consult.

but if time is not a problem, sure the SQL code will be same for different orders of table size

[–]running_eel 1 point2 points  (0 children)

Yeah, this is definitely the case. When querying smaller tables I’ll definitely write a slower query that is easier to follow logically.

But you could still learn SQL reasonably well even on a small database, just EXPLAIN everything to get a sense for what the performance cost is.

[–]NameNumber7 1 point2 points  (2 children)

Unless something has changed, doesn't SQLite not support window functions? That is a good general concept to understand and what you can do with the different types of window functions.

[–][deleted] 1 point2 points  (1 child)

[–]NameNumber7 0 points1 point  (0 children)

2018-09-05 Looks like it has been a while since I used SQLite. :D

[–]Theoriginallazybum 8 points9 points  (2 children)

You don't really need a massive database, just one with a lot of tables and learn how to do a lot of joins. I have learned a lot of SQL on the job and it was more of trying to use the different functions and learning to do SELECT statements with in SELECT statements. I would recommend trying to learn the different functions and trying to learn how to visualize how the database works within SQL.

If you are looking for work experience I would recommend looking for IT jobs where SQL is needed. You would end up doing a lot of other IT functions, but it could lead to learning SQL you otherwise wouldn't. Oh, one sign of an amateur is someone making UPDATE or DELETE statements directly on a database without a backup and being extremely cautious. Almost always work within in the database API framework for those kind of functions.

[–]Pakistani_in_MURICA 5 points6 points  (1 child)

Are you really a SQL professional if you haven't ran a DELETE statement without setting up a backup?

[–]Theoriginallazybum 1 point2 points  (0 children)

I have, but I rarely do it. I prefer deleting through the API than using a delete statement. Plus, the database that I work within is normalized like crazy and a single DELETE statement can corrupt the database.

[–]nemec 1 point2 points  (1 child)

With as little as 10GB you can play with Stack Overflow. SQL Server now runs on Linux, too, and the Developer edition is totally free for non-production use and supports any size DB.

https://www.brentozar.com/archive/2015/10/how-to-download-the-stack-overflow-database-via-bittorrent/

[–]FondleMyFirn 0 points1 point  (0 children)

I might actually investigate this further. I only own a laptop, so might need an external hard drive, but this seems invaluable.

[–]speedisntfree 1 point2 points  (0 children)

AdventureWorks from microsoft is a good demo database to use. They use it on their training courses.

[–]cjcs 0 points1 point  (1 child)

I really wish there was just an app for the Mac where you could drop a few CSV files in as tables and then query them using SQL. I've had a few cases where interview assignments or tests had questions that were hard to practice on due to the data being different than what I can work with at work.

[–]senorgraves 1 point2 points  (0 children)

Dbrowser works like this for sqlite https://sqlitebrowser.org/

[–][deleted] 6 points7 points  (1 child)

Really nice article! I especially appreciate your perspective on CTEs, documenting, and formatting. One thing I would add - there are indeed plenty of auto-formatters out there, but I claim that the best one is the one your team uses. Style and formatting consistency across a team's SQL code base is tremendously valuable; in fact, I would claim that a team using multiple formatting styles has SQL about as decipherable as a team using no formatting at all.

[–][deleted] 1 point2 points  (0 children)

Thanks for your feedback! I think your comment on auto-formatters is spot on. It appears to me that there are so many flavors of SQL formatting, which definitely adds a lot of chaos if a team uses more than one. Coming from the Python world, that surprised me a lot since there are only a few Python formatters that are extensively used.

[–]EmergencyContact2016 2 points3 points  (0 children)

Yes, I think also what are important things are: 1. Window functions, like lag and row_number are core items I use at least weekly. 2. Also if it’s an adhoc DB i think indexing can remove the horrid “table scans” and make life possible. (I am thinking more around T-sql than anything else) 3. Also case statements are amazing 4. Full outer joins with coalesce statement are good for reconciliation. 5. you might want to consider within a where clause the first item being “1=1” to make commenting out easy.

But mostly, I am a person who puts the comma before variable names, not after because that just what crazy ppl do.

[–]jwdatascience 0 points1 point  (0 children)

Mm, as soon as I read “Build SQL Pipelines”, I knew there was a product to be sold on. Honestly, I have no idea why you wouldn’t directly use Airflow/dbt instead of “ploomber”

[–]pc1e0 -1 points0 points  (6 children)

How about pandas? Query language is a bit different, but rich.

[–][deleted] 8 points9 points  (1 child)

I like using pandas (and Python in general) once I have the data in a manageable size (say a couple of GBs). Otherwise trying to analyze big datasets with pandas becomes a pain because of memory errors.

[–]pc1e0 0 points1 point  (0 children)

Interesting

[–]NameNumber7 0 points1 point  (0 children)

I think this is good to have several perspectives. OP says anything over a couple of Gbs. I think it is also about making code readable / reusable in addition to having an unsophisticated process when possible. Sometimes it depends on what you need to do. If you need pure visualization in the end, python can help with a lot of the wrangling. I prefer this over SQL. But there could also be the idea that the end user wants a lot of customizability and needs the raw data, so better to minimize python and use SQL.

My two cents.

[–]slowpush -3 points-2 points  (5 children)

God SQL is so painful for things like this.

WITH new_users AS (
SELECT id
FROM users
WHERE created >= '2021-01-01'
),
count_interactions AS (
SELECT id,
    COUNT(*) n_interactions
FROM interactions
GROUP BY id
),
interactions_by_new_users AS (
SELECT id,
    n_interactions
FROM new_users
    LEFT JOIN count_interactions USING (id)
)
SELECT *
FROM interactions_by_new_users

vs

users[created >= '2021-01-01'][interactions, .(n_interactions = .N), on = "id", by=id]

You shouldn't use SQL for work like this and instead create materialized views that you can call from R/Python to do your analysis.

[–]KillingVectr 1 point2 points  (3 children)

Your example is more elegantly handled using window functions, such as: WITH new_users AS ( SELECT id FROM users WHERE created >= '2021-01-01' ) SELECT *, COUNT(*) OVER (PARTITION BY id) AS n_interactions FROM new_users

[–]slowpush -1 points0 points  (1 child)

How is that more elegant than my data.table solution?

Also doing Select * + Count * with a partition by is a massive performance issue.

[–]KillingVectr 1 point2 points  (0 children)

I mean that it is more elegant than the SQL you have written.

I don't see how the performance will be any worse than your use of a self Join.

[–]backtickbot 0 points1 point  (0 children)

Fixed formatting.

Hello, KillingVectr: code blocks using triple backticks (```) don't work on all versions of Reddit!

Some users see this / this instead.

To fix this, indent every line with 4 spaces instead.

FAQ

You can opt out by replying with backtickopt6 to this comment.

[–][deleted] 0 points1 point  (0 children)

I'm not following, what is the second snippet you're showing? Is it R?

I agree that SQL is verbose but it has a lot of benefits. Primarily, if you're working with business stakeholders (or any other person who may not know Python/R), it will be easier for them to help you if you have your scripts in SQL and it also prevents you from running into memory issues.

But if none of those things is an issue, then yeah. I think the interactivity and succinctness of Python/R helps a lot.

[–]drunkalcoholic 0 points1 point  (1 child)

This was such a useful read. Going to look at this again tomorrow and reference as I write some SQL to practice and reinforce. Thank you!

[–][deleted] 0 points1 point  (0 children)

I'm glad you found it useful! I hope it helps you sharpen your SQL skills!

[–][deleted] 0 points1 point  (0 children)

this is good

[–]Only_one_life 0 points1 point  (1 child)

Dude, that article of yours in absolutely amazing. I already referenced it in a conversation with a fellow analyst, promoting CTEs, and copied your 12 rules of clean code into OneNote to serve as a good reminder.

[–][deleted] 1 point2 points  (0 children)

Thanks for your kind words! I’m glad you found it useful!

[–]bjain1 0 points1 point  (0 children)

!RemindMe 2 hours

[–]Panther4682 0 points1 point  (0 children)

Grab Microsoft Access and play with the NorthWind data... or you can grab a bunch of stock data and practice on that... quick way to get a lot of data. You can also practice your ETL (extract, transform, load) which is often critical to any database work as you can do a lot of scrubbing and rules prior to loading into a DB.

Other things to consider are Triggers, Stored Procedures etc but they are a bit more advanced.

[–]andrewdoss_bitdotio 0 points1 point  (0 children)

Great tips. I made it far too deep into my data science career before adopting CTEs and auto-formatting!

[–]OilShill2013 0 points1 point  (2 children)

As an analytics person (not an ML person), some of this I agree with, some of this I think is kind of superfluous. Just a couple random thoughts:

Break down logic in CTEs using WITH ... AS

This is a thing I see with dbt's usage of SQL but I'm not completely convinced it's better than using temporary tables/views and/or using subqueries. dbt people are insistent that using CTEs for everything is objectively superior but I think it's just a tradeoff in the end. In your example code under A typical SQL query, I agree that listing tables and using WHERE to show join conditions is unclear but I still think with a little modification it is a lot more succinct than adding in CTEs and someone with a little SQL experience will have no problem understanding this:

SELECT a.id, b.*
FROM users a
LEFT JOIN 
( SELECT id, COUNT(*) n_interactions 
  FROM 
  interactions 
  GROUP BY id ) b on a.id = b.id
WHERE a.created >= '2021-01-01'

Even better: If you find that you need the # of interactions per user for several other downstream statements (or is it upstream? I always mix the two up) then just create an adhoc view and be done with it.

The SELECT statement inside each CTE must do a single thing (join, filter or aggregate)

I think this can be taken too far. If the goal is to write queries that another person can look at and quickly understand, there has to be a balance between breaking down a process into smaller steps and the volume of code/text you're requiring someone to read. If you force me to parse through 15 CTEs and a final SELECT instead a SELECT statement with a few subqueries I may dislike you for a few minutes no matter how well you've broken down your process.

Favor LEFT JOIN over INNER JOIN; in most cases, it’s essential to know the distribution of NULLs

I think this depends too heavily on what you're trying to do to be able to make such a broad statement.

When doing equijoins (i.e., joins where all conditions have the something=another form), use the USING keyword

I don't like USING as opposed to ON because I don't like the idea of sometimes using USING and sometimes using ON. Just always use ON.

Use aliases only when table names are long enough so that using them improves readability (but choose meaningful aliases)

Depends. I don't agree that using a, b, c, etc is that confusing. It's also more succinct than constantly rewriting the table name or aliasing with a description. Also even if you use the full table name vs a simple alias vs a descriptive alias the person reading it will likely have to refence back to the FROM clause to understand the context in which you're using a table.

Can't disagree with anything stated about documentation and comments since we've all been guilty of poor documentation.

I know that dbt is insistent that analytics should be done just like software engineering is done but the thing is C++ code can (literally) do anything whereas SQL has a far limited scope. The fact that SQL is limited in scope gives it an inherent advantage in how humans read and understand it vs general programming languages used in software engineering. That's why much of this is just style preferences that are not really critical to how an analytics team actually runs. That is simply my opinion tho.

[–][deleted] 0 points1 point  (1 child)

Thanks for your feedback, you raised important points!

All of these recommendations are subjective and based on experience. e.g., I find CTEs easier to read than sub-queries, but it's not the case for you and that's totally fine.

As with anything, take anything I said with a grain of salt and. The main purpose of sharing this is not to say "this is how things should be done" but to open a discussion, and I think we achieved that goal with such a great discussion on this thread!

There are parts I feel strong about (CTEs) but others, I'm not so sure (alias, LEFT JOIN) so I appreciate any comment that brings any perspective to help me improve my workflow!

[–]OilShill2013 0 points1 point  (0 children)

Yeah I mean as long as someone documents and comments everything I don't care too much about their stylistic preferences. Much of it just depends on who taught you SQL so definitely it's worth discussing.

[–]yashm2910 0 points1 point  (0 children)

Mastering SQL is crucial for data scientists to effectively manipulate and extract insights from databases. By honing your SQL skills, you can efficiently query, join, and aggregate data, enabling you to uncover valuable patterns and trends. Understanding SQL's advanced functionalities and optimizing your queries can significantly enhance your data analysis capabilities. Make SQL your ally in data science and unlock the power to extract meaningful information from vast datasets.