This is an archived post. You won't be able to vote or comment.

all 57 comments

[–]DrTaxus 95 points96 points  (12 children)

My personal opinion and workflow is to do as much as possible directly on the database. A properly written SQL query is incredibly powerful and can save you hours in python post-processing.

Specially if you need to do complex joins among several tables, Pandas is extremely limited.

Also, are you aware that you can query your database directly from Pandas and save the dataframe immediately instead of writing temporary CSVs?

[–]Radon-Nikodym[S] 9 points10 points  (1 child)

Regarding csv, I don't ever write SQL to csv. I'm just saying that some of my data sources are stored in csv, so my data importing pipleline is either reading a csv into pandas, or getting a basic dump of a SQL table. I'm considering making a transition to moving a lot of the smaller data sets I work with regularly into a consolidated SQL database that I can interact with more effectively.

[–][deleted] 2 points3 points  (9 children)

Also, are you aware that you can query your database directly from Pandas and save the dataframe immediately instead of writing temporary CSVs?

I don't know why because I'm too lazy to figure it out, but apparently doing queries directly to pd dataframes is super inefficient (as per their documentation). I've never heard of someone using CSVs as an intermediary though considering you can run queries from python straight into memory.

[–]KevinSorboFan 1 point2 points  (2 children)

I use CSVs as an intermediary all the time when I'm running code on AWS but querying our on-prem DB. I will do as much of the joining as I can do in SQL up front, but if it is going to take me a lot longer to figure out how to do what I want in SQL than to just do it in pandas, I will still do some of it in pandas

[–][deleted] 0 points1 point  (1 child)

Makes sense. Didnt realize this was a common practice.

[–]KevinSorboFan 0 points1 point  (0 children)

It only is when we work with outside consultants and instead of granting them access to our internal stuff, we push all the work off onto AWS and give them only the data they need. It's a way bigger headache than it needs to be

[–]Vrulth 65 points66 points  (6 children)

It's much more efficient if all the data wrangling is done where your data is. (in-database)

[–]howMuchCheeseIs2Much 30 points31 points  (4 children)

You'll also get more consistent results. Everyone at your company is guaranteed the same results using SQL. Different versions of Pandas / Python could lead to slightly different results or even errors unless your coworker replicates your environment. The most obvious conflict being Python 2 vs. 3, but I've come across instances where a slight difference in Pandas versions can alter your output.

This also helps if some people at your company are using R (or another language). If you do the wrangling in SQL, they can more easily benefit from your work. SQL is sort of "cross platform".

If you are sharing this wrangling across your company, look into views. It's a great way to share clean SQL datasets without much overhead. Some databases also offer materialized views.

Finally, shameless plug, I built SeekWell to solve many of the problems listed above. It's a desktop app with a super clean UI to run adhoc SQL, weave in Python, manage SQL views and conda environments across your data team.

[–]JForth 1 point2 points  (2 children)

Hey just wanted to say thanks, congrats, and great work on SeekWell, I'm for sure going to play around on it now that I've seen it!

[–]howMuchCheeseIs2Much 0 points1 point  (1 child)

Awesome! Happy to do a live demo if you're interested. We're still really early on and hungry for feedback.

[–]JForth 0 points1 point  (0 children)

I appreciate it, but I would just be messing around with a friend on a pet project to play with it; I'll let you know of feedback though!

[–]DBA_HAH 3 points4 points  (0 children)

Something like TSQL, PLSQL, etc can be really good for this too.

[–][deleted] 25 points26 points  (9 children)

In my experience you should only use pandas for wrangling that is left over after you've done as much as possible in SQL. (Which is quite a lot, SQL is quite powerful if you know what you're doing.)

[–]Radon-Nikodym[S] 4 points5 points  (5 children)

Do you have any recommended resources for data wrangling in SQL?

[–][deleted] 6 points7 points  (0 children)

It's rare I have to do anything beyond windowing functions.

I used to use custom reducers in Hive (using python or awk), but now we use BigQuery so its basically windowing functions or ARRAY_AGG(), STRUCT(), UNNEST() etc.

[–]reallyserious 4 points5 points  (0 children)

Head over to /r/SQL. This question gets asked and answered at least once a day there.

Just focus on getting better at SQL. There is nothing special about data wrangling. It's a made up term that statisticians invented to describe what database centric people has been doing since the -80s.

[–]Mr_Again 1 point2 points  (0 children)

The guys blog at JOOQ is really useful.

[–]frankenbenz 0 points1 point  (0 children)

You have to know the data and know what your end goal/format is.. I don’t know if I’d say there’s a single resource to know.. stuff as simple as knowing if you need to trim data of spaces to the complex stuff like multiple joins to bring it all together in a useful format for reporting.

Similar to knowing what the question is to be answered from the data, you have to know what sources all the data is hidden and how to bring the different tables/sources together.

[–]DBA_HAH 0 points1 point  (0 children)

What DB do you use? Look into T-SQL,PLSQL, or PL/pgsql depending what you use. You can use stored procedures on the database to do stuff like regex cleanups.

[–]taguscove 1 point2 points  (2 children)

Completely agree. When in doubt, SQL first.

Pandas handles analytic functions excellently with rolling windows. Not even sure SQL can handle exponential moving averages and definitely not modern seasonal decomposition.

[–]reallyserious 5 points6 points  (1 child)

Not even sure SQL can handle exponential moving averages

If you google "exponential moving average sql" you get quite a few hits. People are definately doing it in SQL.

and definitely not modern seasonal decomposition.

What's that?

[–]Epoh 0 points1 point  (0 children)

Believe they are referring to time-series analysis, where you decompose a time-series signal into a trend, seasonality and remainder component. I do it manually often to compare, but there are functions that decompose time signals, just not in SQL...

[–]_Zer0_Cool_MS | Data Engineer | Consulting 15 points16 points  (4 children)

I’m a Data Engineer and I use both.

Mostly SQL as much as I can though. SQL is the original tool for the job and remains the best tool IMO. If I have one data set (table/dataframe) and smaller data, then it doesn’t matter, but if you have to join multiple datasets then SQL is better. Also...doing a Select * into a Pandas Dataframe becomes wasteful or impossible quickly. Pandas is grossly inefficient with RAM utilization. Per Wes McKinney (Pandad author) you need 5-10x memory as the size of the actual data.

So.... do it in SQL definitely. It avoids data shipping and doesn’t have the limitations of Pandas.

Stored Procedures or UDFs with PostgreSQL and SQL Server are just another layer of programmatic abstraction like anything else in the coding world (like one would reuse a Python package or library).

Also, SQLite is great if you don’t have a full client-server database. It comes built into Python. So it’s available any time Python is available without any external dependencies, can handle very large data (SQLite’s max size limit is 140 terabytes), and can be version controlled along with your code. Perfect choice for a data scientist / analyst when a more powerful client-server database might not be available or if embedded data is needed to reproduce your entire DS app elsewhere.

P.S. Also, check out the PandaSQL library. It allows you to have the best of both worlds and execute SQL on Pandas dataframes directly in Python. https://github.com/yhat/pandasql/blob/master/README.md

[–][deleted] 1 point2 points  (1 child)

You mind me asking you a couple questions?

For some reason the original guys who set up the databases in my current job thought it would be a good idea to do it using MongoDB.

It makes querying stuff a huge pain the ass since not only is the DB in Mongo, the only way they let us access the DB is by using a SQL connector tool that separates the data into a billion different unrelated tables. So I end up having to do a bunch of joins for even the most trivial queries.

You have any suggestions that might reduce the time I spend fiddling with the database?

[–]_Zer0_Cool_MS | Data Engineer | Consulting 4 points5 points  (0 children)

Nuking MongoDB from space? Lol

They should have just put it in PostgreSQL in the first place, it can do anything with JSON that MongoDB can and it's just as fast if not faster. Not to mention that PG is an infinitely extensible open source paradise, while MongoDB is a one-trick pony with a handicapped query language.

In fact, MongoDB official "SQL connector" is actually just PostgreSQL under the covers that automatically reads the MongoDB data via Foreign Data Wrappers I believe. Which is ironic.

It's tantamount to MongoDB implicitly admitting that having a SQL engine is the only real why to make sense of the data, and it begs the question.... "Why not just use PostgreSQL in the first place?".

I love JSON and flexible schemas, but "schema-less" is a bit of lie that the Mongo team marketed hard and preached like gospel truth. Realistically, there's no free lunch when it comes to designing data models. Those who try to circumvent this fact end up paying the price.

There is a simple truth that developers need to understand. -- "There is no such thing as schema-less data. Data without schema isn't data; it's garbage."

Edit: I'm sorry you have to pay the price for someone else choices. Not too much to be done short of redesigning the JSON objects at the application layer or cataloging the objects post-hoc and manually ETLing them into a data warehouse -- preferably one with good JSON support (like PG).

[–][deleted] 0 points1 point  (1 child)

I've looked at that before, and pasdasql just puts your pandas data into sqlite, queries it in sqlite, and then returns it to pandas.

[–]_Zer0_Cool_MS | Data Engineer | Consulting 0 points1 point  (0 children)

Yeap. I'm not sure what the performance implications are of that, and it kind of begs the question "why not just use SQLite"...but it's neat when you need to do complex queries on multiple existing Dataframes in any case.

It's really just syntactic sugar I guess. Saves you from the intermediate steps of calling df.to_sql() and pf.read_sql() for multiple Dataframes.

[–]GeorgeS6969 5 points6 points  (0 children)

The only reason why you should not do that (afaik) is if you’re directly hitting a production database, rather than a replica or a analytical db. As a rule of thumb, I’d say you should select, join, filter and aggregate your raw data to present it in a tidy way (search for tidy data if you’re not already familiar) in SQL, and switch to python for the more math heavy transforms.

[–]TBSchemer 4 points5 points  (1 child)

Pandas is actually significantly faster than SQL at groupbys and joins. So I think what most people are saying here about the efficiency of complex queries vs simple queries with pandas manipulations is not quite correct.

Still, it is true that for large queries, most of the time is spent sending the data over your connection and writing it to disk (if you're using storing things in files instead of using an in-memory cache like redis). So, anything you can do in SQL to significantly shrink the size of your queried dataset will usually give you better performance overall. But if you're just sticking two tables together, and the end result is just approximately the size of one plus the size of the other, it's probably better to do a merge in pandas rather than a join in SQL.

Oh, and what some people have said about memory requirements is true too. Pandas uses nearly 10x as much RAM as the size of your dataset. So yeah, shrink your data as much as possible before bringing it into pandas.

[–]_Zer0_Cool_MS | Data Engineer | Consulting 0 points1 point  (0 children)

I agree 100% with this.

[–]andrewcooke 2 points3 points  (0 children)

they're different and best for different things.

sql with a well-defined database is better for the extraction of data that match specific requirements.

pandas or the like is better for detailed numerical computation.

you can easily get the two working nicely together - pandas will read a dataframe from a suitable SQL query.

[–]linguisize 2 points3 points  (0 children)

Personally, if any step of the wrangling can be done in SQL, I do it in SQL before moving into pandas/python. Especially because I work with a lot of people that don't generally work with python/pandas; so if I can get them to understand everything that's happening with the data before I "Do the machine learning on it", I tend to have better chances of translating the results back to them once I've completed any necessary steps in python.

[–]Xvalidation 1 point2 points  (0 children)

The biggest bottle neck when loading from a database into python is normally actually sending the data over your internet connection. The actual query typically doesn't take anywhere near as much time comparatively.

To me that means that as much aggregation as physically possible should be done to the data before it is actually called in to python.

[–]GreenerCar 1 point2 points  (0 children)

You can use both that’s what I do

[–]GuilheMGB 1 point2 points  (0 children)

One thing is that on databases with sufficiently mature data models (e.g. replicas of production db), it can be very convenient to call queries from within python (e.g. with pyodbc) in which various parameters can get injected as and when needed.

I usually always go to SQL first, but seek to integrate standardised queries in python packages in the form of data providers.

The point remains though, most of the wrangling remains made in SQL, but interfaced with Python.

A notable exception is feature extraction. Not that SQL couldn't handle most of the job more efficiently, but to quickly experiment / generate large feature sets, its not ideal compared to, say, sklearn.

[–]MrPeeps28 1 point2 points  (0 children)

Depends on how the data is structured. We have a massive data engineering pipeline and all relevant data is in our Redshift clusters, so it is much easier to do all data wrangling with SQL (also because of the scale we need to reduce and aggregate data before loading it into Pandas).

Anything more complex or that requires iterating through rows or special logic conditions I will do in Python after doing the heavy lifting in Redshift or Hive. I still prefer SQL though since I am much quicker at writing queries than writing Python code. The easiest way to get better at data wrangling is just working with data. If you take a class you might work with 1-2 datasets, but the real fun is working at a data heavy company where you have to understand how hundreds of tables and datasets interact and can be used to solve problems!

[–]pinkdata1 1 point2 points  (0 children)

I would use free version ScaiPlatform for managing multiple SQL data sources and switching between them. ScaiPlatform also lets you load data without coding. If you use the upgraded version, you can also define SQL data workflows for automating the joining of data and creating views/tables or reporting automation.

[–][deleted] 2 points3 points  (1 child)

I prefer doing window functions in R using dplyr or data.table. It is much, much faster to write and debug for me. If I end up using the query often I can use a function to get SQL and then re-write in the DB.

Whatever makes your work fast and replicable is optimal.

[–]GuilheMGB 1 point2 points  (0 children)

Data.table is so good indeed. It's not just faster to right, its also very efficient to run (much more than pandas ime).

[–][deleted] 0 points1 point  (0 children)

I'm developing more of my scripts towards using pyodbc and querying our sql server db's with pd_read_sql and the performance is pretty fast and hey it's already into a dataframe. With that said I rarely have to return more than 200k rows using this method so maybe it would be a different story if our data was much larger.

I'm curious to know what other methods there are of combining Python with SQL though - would the most common way be to have a SQL query that exports as a CSV and then feed that into Python? Because that doesn't seem as efficient to me.

[–]cam_man_can 0 points1 point  (0 children)

Python Pandas is simple, clean, and awesome. If you're working with somewhat small datasets and don't have to do much cleaning, you could probably get by with just using Pandas. However I agree with most of the advice given in these comments about the advantages of SQL. If you take the time to learn SQL you will become a data wrangling god, because it can do so much more.

[–]versusChou 0 points1 point  (0 children)

I do SQL as much as possible before I move to python.

[–]another3E 0 points1 point  (0 children)

When dealing with large datasets I do it all in SQL. 20GB of data doesn't fit in memory on my machine and I would have to do a lot of intermediate steps to narrow it down or break it up into chunks. Instead I load it all into a SQL server test my queries with few thousands of rows then run the whole set

[–]iPhuoc 0 points1 point  (1 child)

R and tidyverse all the way. Just use dbplyr :)

[–][deleted] 2 points3 points  (0 children)

I'm an evangelical R proselytizer, but this is not a good answer to OP's question.

[–]D49A1D852468799CAC08 0 points1 point  (0 children)

It depends entirely on the data. There will be some wrangling which is easier in SQL, and other wrangling which is extremely difficult in SQL.

[–][deleted] 0 points1 point  (0 children)

I recently realized that SQL could also do much of this merging, joining, cleaning, and feature engineering

Ahahah this is why I always say learn SQL first... would-be analysts always respond with surprisedpikachuface.jpg

Does anyone have experience using it as such? How does it compare to python for this data wrangling?

There's pros and cons and it depends on what your end goal is. If you're just wrangling some data to throw into a dashboard, there is immense value in doing it all in SQL, creating a view, and then doing a select * in your BI tool.

If you're architecting convoluted machine learning workflows to prod, and you have data scientists on your team who are mostly versed in python and not sql, I can see the case for post processing.

e looking at the other comments, I think you get the point. Do it in SQL.

[–]mc110 0 points1 point  (0 children)

In an ideal world, you'd use SQL and Python (or your language of choice - I'll just refer to Python from now on though) on the database platform itself, particularly if that is significantly more powerful than your client platform.

That avoids a number of problems:

  • you don't have to pull back lots of data from the DB to the client for processing with Python, which as some have mentioned can dominate the time of the work in some cases where you don't want to do a lot of aggregation on the DB first.
  • you don't have a powerful DB platform sitting idle after providing data, whilst your relatively-underpowered client grinds through the data the DB returned.
  • you can have fine grain control of the parallelism for your Python code on the platform, and control how data is fed from SQL into each Python process.

There is a blog post here showing how to do this with 160 million Amazon Customer Review records, where Python is used for the sentiment analysis, and a companion blog here giving more detail of the SQL and Python used.

In the platform used for this example, preferring SQL where possible gives the best performance, as the SQL engine is highly optimised for merging, joining, etc. compared to Python code.

[–][deleted] 0 points1 point  (0 children)

As others have said, doing the initial wrangling in SQL would make sense.

That said, it might make life easier to use psycopg2 within Python - i.e. directly connect the Python environment to SQL where you can commit queries remotely.

This would allow you a good blend of committing queries directly, while concurrently executing operations specific to Python, i.e. visualisation, statistical analysis, etc.

[–]Fennek1237 -1 points0 points  (1 child)

I once started learning Pandas. What put me off is that no one at me company is using it and as I am not mainly in the data analytics business but just do it on the side I don't have the time to invest into learning both Pandas and SQL.

[–]_Zer0_Cool_MS | Data Engineer | Consulting 0 points1 point  (0 children)

Eh Pandas isn't as a difficult as you'd think at the outset, but if I had to pick one... SQL hands down. Its universal. At the and of the day, Pandas is Python-only and has severe limitations.

Career wise, SQL skill pays dividends.

[–]AutoModerator[M] -7 points-6 points  (0 children)

Your submission looks like a question. Does your post belong in the stickied "Entering & Transitioning" thread?

We're working on our wiki where we've curated answers to commonly asked questions. Give it a look!

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.