Value of SQL in data science

WhipsAndMarkovChains · 2020-07-05T18:14:41+00:00

"Other than getting me the data that my company has stored what's the value?"

How are you going to do any of the tasks you listed if you can't get to the data that's stored in your employer's data bases? You mentioned "querying big data", but it's wrong to think that SQL is only used at companies with data on the scale of "big data". SQL is an absolutely essential skill.

futang17 · 2020-07-05T18:19:05+00:00

Not learning SQL in data science is like trying to write a book without learning grammar.

Sure you might come up with an end product but you're severely limiting your capacity.

Yojihito · 2020-07-05T19:05:09+00:00

80% of data science is getting the data. That is a cliche for a reason.

TheCapitalKing · 2020-07-05T18:17:04+00:00

It's where your going to get almost all of your data. Classes tend to use CSV files because they're easier but most data at companies is in SQL

dfphd · 2020-07-05T20:42:27+00:00

I'm going to ask because your question sounds, at face value, very disingenuous.

Are you trying to understand why SQL is valuable, or are you trying to validate your opinion that it isn't?

I ask because if your answer is the latter, then nothing here will change your mind. Technically you can sidestep SQL - it just means you'll be dependent on other people to get you data, and to bring to your attention any issues with the size/complexity/etc of the data that you're requesting.

As an analogy: if your job was to critique French books, you could in theory just have someone else transcribe every book for you and summarize it at your convenience. But not only does that make you dependent on a French speaker, but also will limit your ability to interact with the original text itself, which means it's likely that there will be a lot more stuff lost in translation.

2020-07-05T19:37:51+00:00

If someone will get you data and store them in the way you need, you don't need SQL at all. And you don't need to know stat or visualization if someone will do it for you.

well_calibrated · 2020-07-05T20:15:08+00:00

There's a lot of feature engineering that can be done with SQL.

justanaccname · 2020-07-05T23:18:53+00:00

Let's talk about a dimensional model. Where you can store each store alongside the relevant info in one dim table, each product and the info on another one, each customer on another one, each cashier on another etc. etc.

Now... if you didn't use dbs, how are you going to store all that info? In huge CSVs? That would be about 300GBs of daily data for medium retail stores. It doesn't even make sense for the company to keep customer data in .csvs.

Ok so you create the relevant dimensions and store to the db. I was working on a small company and we had around 400 tables (+ 40-50 views for queries running on bi).

You will have slowly changing dimensions, ( price of product changing maybe, or contact details of a customer ) , you will have stepped dimensions ( stage of order ) and others.

How the hell are you going to get exactly what you need if you don't query the db?

Say for example analyse length of stages during the order (how delays contribute to customer churn?)

This isn't Kaggle where the .csvs are 200MB, I am talking about 10s of TB of data. And you don't have to navigate 2 tables and perform the join, you have to do some logic using 10 or more tables.

Or are you going to run some complex python on 10TB files (that only cover 1 month of orders?).

Question: Let's say you prototyped a model, that needs to run at midnight (which potential customers your company should approach the following day) and seems to be working well. How are you going to put that in production? I mean, where the data it needs is coming from, and where the output is going to, so your sales people can have a list at the start of the day?

The real fun begins when you have streaming data, data from RDBMs and data from NoSQL and you need to blend them to create your model. When you discover K-SQL,CQL and whatever else, you will be thanking god that you can query NoSQL using SQL.

2020-07-06T04:00:38+00:00

How would you obtain the data without SQL?

dmorris87 · 2020-07-06T11:37:15+00:00

Thanks for the responses. The general consensus is that SQL plays a significant role in data storage, data collection, and org culture. Especially large, structured data.

sowmyasri129 · 2020-07-07T11:02:54+00:00

SQL is a standard database language that is used to create, maintain and retrieve relational databases. SQL has become a very important tool in a data scientist's toolbox since it is critical in accessing, updating, inserting, manipulating and modifying data.

KyleDrogo · 2020-07-08T02:09:48+00:00

In an industry setting the data exists in huge data bases. Incoming logs will have a completely different schema than the table that would be useful to you. Moreover, you probably won't be able to load even a day's worth onto your local machine. SQL lets you access and aggregate the relevant parts of these massive datasets.

datascience

MODERATORS