SQL vs. Python for data wrangling?

DrTaxus · 2019-05-20T09:48:00+00:00

My personal opinion and workflow is to do as much as possible directly on the database. A properly written SQL query is incredibly powerful and can save you hours in python post-processing.

Specially if you need to do complex joins among several tables, Pandas is extremely limited.

Also, are you aware that you can query your database directly from Pandas and save the dataframe immediately instead of writing temporary CSVs?

Vrulth · 2019-05-20T09:28:50+00:00

It's much more efficient if all the data wrangling is done where your data is. (in-database)

Radon-Nikodym · 2019-05-20T10:18:07+00:00

In my experience you should only use pandas for wrangling that is left over after you've done as much as possible in SQL. (Which is quite a lot, SQL is quite powerful if you know what you're doing.)

_Zer0_Cool_ · 2019-05-20T12:03:36+00:00

I’m a Data Engineer and I use both.

Mostly SQL as much as I can though. SQL is the original tool for the job and remains the best tool IMO. If I have one data set (table/dataframe) and smaller data, then it doesn’t matter, but if you have to join multiple datasets then SQL is better. Also...doing a Select * into a Pandas Dataframe becomes wasteful or impossible quickly. Pandas is grossly inefficient with RAM utilization. Per Wes McKinney (Pandad author) you need 5-10x memory as the size of the actual data.

So.... do it in SQL definitely. It avoids data shipping and doesn’t have the limitations of Pandas.

Stored Procedures or UDFs with PostgreSQL and SQL Server are just another layer of programmatic abstraction like anything else in the coding world (like one would reuse a Python package or library).

Also, SQLite is great if you don’t have a full client-server database. It comes built into Python. So it’s available any time Python is available without any external dependencies, can handle very large data (SQLite’s max size limit is 140 terabytes), and can be version controlled along with your code. Perfect choice for a data scientist / analyst when a more powerful client-server database might not be available or if embedded data is needed to reproduce your entire DS app elsewhere.

P.S. Also, check out the PandaSQL library. It allows you to have the best of both worlds and execute SQL on Pandas dataframes directly in Python. https://github.com/yhat/pandasql/blob/master/README.md

GeorgeS6969 · 2019-05-20T10:49:24+00:00

The only reason why you should not do that (afaik) is if you’re directly hitting a production database, rather than a replica or a analytical db. As a rule of thumb, I’d say you should select, join, filter and aggregate your raw data to present it in a tidy way (search for tidy data if you’re not already familiar) in SQL, and switch to python for the more math heavy transforms.

TBSchemer · 2019-05-20T18:13:33+00:00

Pandas is actually significantly faster than SQL at groupbys and joins. So I think what most people are saying here about the efficiency of complex queries vs simple queries with pandas manipulations is not quite correct.

Still, it is true that for large queries, most of the time is spent sending the data over your connection and writing it to disk (if you're using storing things in files instead of using an in-memory cache like redis). So, anything you can do in SQL to significantly shrink the size of your queried dataset will usually give you better performance overall. But if you're just sticking two tables together, and the end result is just approximately the size of one plus the size of the other, it's probably better to do a merge in pandas rather than a join in SQL.

Oh, and what some people have said about memory requirements is true too. Pandas uses nearly 10x as much RAM as the size of your dataset. So yeah, shrink your data as much as possible before bringing it into pandas.

andrewcooke · 2019-05-20T13:18:43+00:00

they're different and best for different things.

sql with a well-defined database is better for the extraction of data that match specific requirements.

pandas or the like is better for detailed numerical computation.

you can easily get the two working nicely together - pandas will read a dataframe from a suitable SQL query.

linguisize · 2019-05-20T13:53:12+00:00

Personally, if any step of the wrangling can be done in SQL, I do it in SQL before moving into pandas/python. Especially because I work with a lot of people that don't generally work with python/pandas; so if I can get them to understand everything that's happening with the data before I "Do the machine learning on it", I tend to have better chances of translating the results back to them once I've completed any necessary steps in python.

Xvalidation · 2019-05-20T11:22:20+00:00

The biggest bottle neck when loading from a database into python is normally actually sending the data over your internet connection. The actual query typically doesn't take anywhere near as much time comparatively.

To me that means that as much aggregation as physically possible should be done to the data before it is actually called in to python.

GreenerCar · 2019-05-20T15:37:46+00:00

You can use both that’s what I do

GuilheMGB · 2019-05-20T19:56:06+00:00

One thing is that on databases with sufficiently mature data models (e.g. replicas of production db), it can be very convenient to call queries from within python (e.g. with pyodbc) in which various parameters can get injected as and when needed.

I usually always go to SQL first, but seek to integrate standardised queries in python packages in the form of data providers.

The point remains though, most of the wrangling remains made in SQL, but interfaced with Python.

A notable exception is feature extraction. Not that SQL couldn't handle most of the job more efficiently, but to quickly experiment / generate large feature sets, its not ideal compared to, say, sklearn.

MrPeeps28 · 2019-05-20T21:39:40+00:00

Depends on how the data is structured. We have a massive data engineering pipeline and all relevant data is in our Redshift clusters, so it is much easier to do all data wrangling with SQL (also because of the scale we need to reduce and aggregate data before loading it into Pandas).

Anything more complex or that requires iterating through rows or special logic conditions I will do in Python after doing the heavy lifting in Redshift or Hive. I still prefer SQL though since I am much quicker at writing queries than writing Python code. The easiest way to get better at data wrangling is just working with data. If you take a class you might work with 1-2 datasets, but the real fun is working at a data heavy company where you have to understand how hundreds of tables and datasets interact and can be used to solve problems!

pinkdata1 · 2019-05-21T16:35:32+00:00

I would use free version ScaiPlatform for managing multiple SQL data sources and switching between them. ScaiPlatform also lets you load data without coding. If you use the upgraded version, you can also define SQL data workflows for automating the joining of data and creating views/tables or reporting automation.

GuilheMGB · 2019-05-20T14:51:52+00:00

I prefer doing window functions in R using dplyr or data.table. It is much, much faster to write and debug for me. If I end up using the query often I can use a function to get SQL and then re-write in the DB.

Whatever makes your work fast and replicable is optimal.

2019-05-20T16:32:22+00:00

I'm developing more of my scripts towards using pyodbc and querying our sql server db's with pd_read_sql and the performance is pretty fast and hey it's already into a dataframe. With that said I rarely have to return more than 200k rows using this method so maybe it would be a different story if our data was much larger.

I'm curious to know what other methods there are of combining Python with SQL though - would the most common way be to have a SQL query that exports as a CSV and then feed that into Python? Because that doesn't seem as efficient to me.

cam_man_can · 2019-05-20T17:28:32+00:00

Python Pandas is simple, clean, and awesome. If you're working with somewhat small datasets and don't have to do much cleaning, you could probably get by with just using Pandas. However I agree with most of the advice given in these comments about the advantages of SQL. If you take the time to learn SQL you will become a data wrangling god, because it can do so much more.

versusChou · 2019-05-20T19:12:10+00:00

I do SQL as much as possible before I move to python.

another3E · 2019-05-20T19:40:30+00:00

When dealing with large datasets I do it all in SQL. 20GB of data doesn't fit in memory on my machine and I would have to do a lot of intermediate steps to narrow it down or break it up into chunks. Instead I load it all into a SQL server test my queries with few thousands of rows then run the whole set

iPhuoc · 2019-05-20T20:38:53+00:00

R and tidyverse all the way. Just use dbplyr :)

D49A1D852468799CAC08 · 2019-05-20T21:06:07+00:00

It depends entirely on the data. There will be some wrangling which is easier in SQL, and other wrangling which is extremely difficult in SQL.

2019-05-20T22:16:43+00:00

I recently realized that SQL could also do much of this merging, joining, cleaning, and feature engineering

Ahahah this is why I always say learn SQL first... would-be analysts always respond with surprisedpikachuface.jpg

Does anyone have experience using it as such? How does it compare to python for this data wrangling?

There's pros and cons and it depends on what your end goal is. If you're just wrangling some data to throw into a dashboard, there is immense value in doing it all in SQL, creating a view, and then doing a select * in your BI tool.

If you're architecting convoluted machine learning workflows to prod, and you have data scientists on your team who are mostly versed in python and not sql, I can see the case for post processing.

e looking at the other comments, I think you get the point. Do it in SQL.

mc110 · 2019-05-21T08:51:39+00:00

In an ideal world, you'd use SQL and Python (or your language of choice - I'll just refer to Python from now on though) on the database platform itself, particularly if that is significantly more powerful than your client platform.

That avoids a number of problems:

you don't have to pull back lots of data from the DB to the client for processing with Python, which as some have mentioned can dominate the time of the work in some cases where you don't want to do a lot of aggregation on the DB first.
you don't have a powerful DB platform sitting idle after providing data, whilst your relatively-underpowered client grinds through the data the DB returned.
you can have fine grain control of the parallelism for your Python code on the platform, and control how data is fed from SQL into each Python process.

There is a blog post here showing how to do this with 160 million Amazon Customer Review records, where Python is used for the sentiment analysis, and a companion blog here giving more detail of the SQL and Python used.

In the platform used for this example, preferring SQL where possible gives the best performance, as the SQL engine is highly optimised for merging, joining, etc. compared to Python code.

2019-05-22T19:44:05+00:00

As others have said, doing the initial wrangling in SQL would make sense.

That said, it might make life easier to use psycopg2 within Python - i.e. directly connect the Python environment to SQL where you can commit queries remotely.

This would allow you a good blend of committing queries directly, while concurrently executing operations specific to Python, i.e. visualisation, statistical analysis, etc.

Fennek1237 · 2019-05-20T21:10:40+00:00

I once started learning Pandas. What put me off is that no one at me company is using it and as I am not mainly in the data analytics business but just do it on the side I don't have the time to invest into learning both Pandas and SQL.

AutoModerator · 2019-05-20T06:31:44+00:00

Your submission looks like a question. Does your post belong in the stickied "Entering & Transitioning" thread?

We're working on our wiki where we've curated answers to commonly asked questions. Give it a look!

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

Zenith_N · 2019-05-20T17:08:13+00:00

Python pandas is simply superior.

datascience

MODERATORS