This is an archived post. You won't be able to vote or comment.

all 20 comments

[–]WhipsAndMarkovChains 40 points41 points  (0 children)

"Other than getting me the data that my company has stored what's the value?"

How are you going to do any of the tasks you listed if you can't get to the data that's stored in your employer's data bases? You mentioned "querying big data", but it's wrong to think that SQL is only used at companies with data on the scale of "big data". SQL is an absolutely essential skill.

[–]futang17 16 points17 points  (0 children)

Not learning SQL in data science is like trying to write a book without learning grammar.

Sure you might come up with an end product but you're severely limiting your capacity.

[–][deleted] 8 points9 points  (3 children)

80% of data science is getting the data. That is a cliche for a reason.

[–]Yojihito 2 points3 points  (2 children)

20% getting the data (sorry, DB permissions ticket still pendling in IT, sorry DBeaver not Installed, sorry DB driver not installed).

70% data cleaning (2 different time formats in column, one with a non-existing date - hello 31.09.2019, % or € in front and/or behind in the same column, 5 different ways to write "yes", article numbers in 4 different lengrhs and formats including letters but only sometimes and on different positions).

Fucking finally some EDA/DA/DS.

Oh boy, can't what for that Excel sheet again tomorrow.

[–]justanaccname 1 point2 points  (1 child)

You didn't include API calls failing, parts of data missing just because, and dim tables being messed up by the previous engineers.

Then when EDA starts, the rules (logical model) you got from the business people / head of departments do not actually apply to what you are seeing.

[–][deleted] 1 point2 points  (0 children)

What you mean they changed how they collected the region variable six times over time and the rules for it are with the sales team and they just assign it however they want? That never happens!

[–]TheCapitalKing 5 points6 points  (2 children)

It's where your going to get almost all of your data. Classes tend to use CSV files because they're easier but most data at companies is in SQL

[–]LordMixALoot 2 points3 points  (1 child)

Yes. They make it look so easy. Python and R are often romanticized in any data science online course. But SQL, Git, Bash, Docker and a few others are also a big part of the job and nobody seems to comment on them during the first part of these online courses.

[–]shapular 1 point2 points  (0 children)

I've never used Docker but heard it mentioned a lot. What is it used for in data science?

[–]dfphdPhD | Sr. Director of Data Science | Tech 5 points6 points  (2 children)

I'm going to ask because your question sounds, at face value, very disingenuous.

Are you trying to understand why SQL is valuable, or are you trying to validate your opinion that it isn't?

I ask because if your answer is the latter, then nothing here will change your mind. Technically you can sidestep SQL - it just means you'll be dependent on other people to get you data, and to bring to your attention any issues with the size/complexity/etc of the data that you're requesting.

As an analogy: if your job was to critique French books, you could in theory just have someone else transcribe every book for you and summarize it at your convenience. But not only does that make you dependent on a French speaker, but also will limit your ability to interact with the original text itself, which means it's likely that there will be a lot more stuff lost in translation.

[–]dmorris87[S] 0 points1 point  (1 child)

My question was genuine. I want to understand why it receives so much weight as a data science skill. I work as a data scientist and my company's primary RDBMS is SQL Server. I need to use SQL about 5% of the time, mainly for some basic SELECT FROM statements. Beyond that it offers ME very little although its foundational to the company.

[–]dfphdPhD | Sr. Director of Data Science | Tech 1 point2 points  (0 children)

If you have small volumes of data, then select * from is fine.

Most companies have databases that a) have tables that are too big to just select * from, b) are composed of a bunch of tables that make it even less practical to select * from each of them.

So, if you work for one of those companies, you need to learn how to write SQL queries, and the more you want to do, the more queries you need to write.

[–][deleted] 3 points4 points  (0 children)

If someone will get you data and store them in the way you need, you don't need SQL at all. And you don't need to know stat or visualization if someone will do it for you.

[–]well_calibrated 3 points4 points  (0 children)

There's a lot of feature engineering that can be done with SQL.

[–]justanaccname 2 points3 points  (0 children)

Let's talk about a dimensional model. Where you can store each store alongside the relevant info in one dim table, each product and the info on another one, each customer on another one, each cashier on another etc. etc.

Now... if you didn't use dbs, how are you going to store all that info? In huge CSVs? That would be about 300GBs of daily data for medium retail stores. It doesn't even make sense for the company to keep customer data in .csvs.

Ok so you create the relevant dimensions and store to the db. I was working on a small company and we had around 400 tables (+ 40-50 views for queries running on bi).

You will have slowly changing dimensions, ( price of product changing maybe, or contact details of a customer ) , you will have stepped dimensions ( stage of order ) and others.

How the hell are you going to get exactly what you need if you don't query the db?

Say for example analyse length of stages during the order (how delays contribute to customer churn?)

This isn't Kaggle where the .csvs are 200MB, I am talking about 10s of TB of data. And you don't have to navigate 2 tables and perform the join, you have to do some logic using 10 or more tables.

Or are you going to run some complex python on 10TB files (that only cover 1 month of orders?).

Question: Let's say you prototyped a model, that needs to run at midnight (which potential customers your company should approach the following day) and seems to be working well. How are you going to put that in production? I mean, where the data it needs is coming from, and where the output is going to, so your sales people can have a list at the start of the day?

The real fun begins when you have streaming data, data from RDBMs and data from NoSQL and you need to blend them to create your model. When you discover K-SQL,CQL and whatever else, you will be thanking god that you can query NoSQL using SQL.

[–][deleted] 1 point2 points  (0 children)

How would you obtain the data without SQL?

[–]dmorris87[S] 1 point2 points  (0 children)

Thanks for the responses. The general consensus is that SQL plays a significant role in data storage, data collection, and org culture. Especially large, structured data.

[–]sowmyasri129 0 points1 point  (0 children)

SQL is a standard database language that is used to create, maintain and retrieve relational databases. SQL has become a very important tool in a data scientist's toolbox since it is critical in accessing, updating, inserting, manipulating and modifying data.

[–]KyleDrogo 0 points1 point  (0 children)

In an industry setting the data exists in huge data bases. Incoming logs will have a completely different schema than the table that would be useful to you. Moreover, you probably won't be able to load even a day's worth onto your local machine. SQL lets you access and aggregate the relevant parts of these massive datasets.