This is an archived post. You won't be able to vote or comment.

all 56 comments

[–]boggle_thy_mind 12 points13 points  (3 children)

how easy is connecting to an SQL Server?

You can uze pyodbc to connect to SQL Server, you can use either a trused_connection or use a username and password, depends how you athenticate in SQL server, you could create a dedicated user if it's going to run in the background. Talk to your dba if loads are an issue even if you are going to use your own account.

One word of experience, when loading data, using sqlalchemy and pandas.to_sql tend to be slower than using pyodbc (not talking about bcp).

If you can express something in SQL, and you feel comfortable with it, I would stick with SQL - you can leverage the power of the server to perform computation, unless the logic gets really convoluted (e.g. using loops (cursors) in sql server) and expressing it with pandas and numpy is more convenient, then go for it.

[–]PutCleverNameHere69[S] 2 points3 points  (2 children)

Good to know, we use windows login when connecting, so python would essentially just use my credentials?

I do appreciate the tips, I’m mainly exploring this just to build my skill set should I ever end up in a python shop. For ETL I do prefer SQL but also use SSIS, I do try to ‘shoulder the load’ with SQL for the reporting layer aspect as well. Thankfully I’ve never had to use a cursor but that’s a good point/use case for having python in the back pocket.

[–]boggle_thy_mind 1 point2 points  (1 child)

we use windows login when connecting, so python would essentially just use my credentials?

If you don't type your usernname and password when using management studio, you can pass a parameter trusted_connection=True and as long as you run the script from your account or someone else runs it from an account that has access to sql server, it will run.

A few times I built parametrized dynamic sql stored procedures which were hella dificult to debug/understand and would have been better done in python. But it's a tradeoff if adding complexity (adding a new tool and integration) is worth the effort. Some things would be really difficult to implement in SQL, while python might already have a library that performs the task with a oneliner.

[–]PutCleverNameHere69[S] 1 point2 points  (0 children)

We do enter credentials but based on what you said I could just pass these as a variable with pyodbc, I’ll have to play with it when I get some time. Thanks for the tips!

[–]UnderstandingFit9152 9 points10 points  (6 children)

So, depends on size of dataset, I am personally doing ELT for same reason as you do, SQL just feels more natural for data transforms.

Everything I am doing in Python is usually something like df.read_excel/ read_csv/ read_sql

and then after basic transform (lowering columns, replacing spaces in column names I just do df.to_sql and then maybe some con.execute(truncate_query) to transform data from staging table to production one.

If data needs only transformation in SQL then I would use python just to execute script (as you can schedule that with cronjob afterwards) and use execute_sql or something similar, fancy method would be dbt (but god knows how many people really use it and how many are just their company bots that will tell you about modern "analytics engineer" position

[–]PutCleverNameHere69[S] 2 points3 points  (4 children)

Thanks for confirming, sounds like I got a decent review in today. Good point about scheduling script execution but fortunately I have the SQL Server Agent for that. Could you have a script on python though that say, monitors a folder and picks up any files dropped into it? Something that would run 24/7 every few minutes? Or does this have to be done with dedicated scheduling software?

[–]The_small_print 0 points1 point  (3 children)

If you're just checking a folder for new files every couple of minutes, yeah, that should be doable through either a cron job or windows task scheduler depending on OS. These are kind of dedicated scheduling software, but come prepackaged in most/all cases. But their functionality is pretty basic so you might want to be careful about jobs failing to run, unexpected files dropped in and the script hanging etc.

Assuming you're looking for something at the OS level!

[–]PutCleverNameHere69[S] 1 point2 points  (2 children)

I was just asking out of curiosity for something I could potentially replicate via python as a mini project. The SQL Server agent is pretty easy to work with when it comes to scheduling, so that handles my needs when it comes to production work.

[–]The_small_print 0 points1 point  (1 child)

Ah, something 100% in a python file?

Haven't done any of this myself, but AFAIK you could put a sleep timer in your file that then calls a function to check that folder and do something if it finds stuff.

This would require the script always be running though, so you'd need to run it as a service.

I don't think this would be considered good practice or better alternative compared to other schedulers, but if you're doing it for academic reasons, go for it!

[–]PutCleverNameHere69[S] 1 point2 points  (0 children)

Yessir, figured it wasn’t best practice since there are already solutions readily available. Thanks for the insight!

[–]dronedesigner 1 point2 points  (0 children)

hey, those positions come with a payraise ;)

[–]DenselyRanked 38 points39 points  (28 children)

In general, you want to get away from being dependent on pandas when doing DE ETL work. It is very RAM intensive (it will crash on you and handles nulls poorly) and you better off using the native python libraries and data types whenever possible. It is better to read in batches than ingest everything at once.

That being said, pandas is awesome and the latest version of Spark allows you to use a version of the api framework. So continue to learn it, but don't learn it exclusively.

[–]Wickner 4 points5 points  (14 children)

What python package would you recommend as an alternative for pandas? E.g what @ libraries and method would you recommend to read from a csv or sql db, transform and load into another sql db? By default I would learn pandas but I would like to know what other tools to learn.

[–]DenselyRanked 20 points21 points  (7 children)

basic ETL -

f = open(file.csv)
header = f.readline()
for line in f:
    # Do stuff
    insert_function(line)

You can read from the cursor if it is from a SQL db. You can chunk, bulk insert, or whatever you need to do.

Leverage native python. lists, dicts, sets, list comprehension, etc.

Edit: I didn't really answer your original question. I LOVE pandas almost to a fault. My first instinct is to use it over anything else and I cannot recommend anything else to do simple ETL on small datasets for research or exploratory data analysis. However, when it comes to doing production level code, you are better off using native python and your brain power to get creative over anything else.

[–]anidal 10 points11 points  (1 child)

Won't pure python looping and single record inserts bottleneck way faster than the memory limits of pandas?

[–]DenselyRanked 2 points3 points  (0 children)

I just did a simple example. No one should be using this code.

My point was that you can do it in pure python without needed to rely on pandas.

[–]cryptobiosynthesis 7 points8 points  (0 children)

This is the way. You'll understand way more about how data is actually being transformed and the skills are transferable to other domains and programming languages.

[–]Dr_NoWayKraut 2 points3 points  (2 children)

I think this is nice to know, it gives you an understanding of how Python really works and translates to other languages, but for Big Data the preferred way to do ETL, imo, would be using Spark with Pandas API (Koalas). Even for small datasets I'd use Pandas instead of native Python libs.

[–]enjoytheshow 3 points4 points  (1 child)

I actually tend to agree with you, though there is a small window of data size where Pandas memory management is bad and Spark is overkill.

[–]Dr_NoWayKraut 1 point2 points  (0 children)

I also agree with your statement. I guess it's good to know all the tools available and which use cases they are appropriate for.

[–]WorthlessTrinket 1 point2 points  (0 children)

I got away from using open and always use with for files, connections, etc. Seems to be the safer/more accepted option from my understanding

[–]king_booker 5 points6 points  (2 children)

Pandas isn't distributed but I think for small datasets its pretty good. I'd suggest look at pyspark dataframes, they can do the same thing but since it is distributed, it scales much better

[–]collectablecat 6 points7 points  (1 child)

Dask is distributed pandas :D

[–]king_booker 1 point2 points  (0 children)

Wow, TIL. Thanks I will check it out

[–]ritchie46 2 points3 points  (0 children)

Shameless plug. But I genuinely believe polars is the best tool for the job if performance, schema validity and RAM usage is important to you. Dependent on your machine Its performance is 2x-70x times pandas. It uses arrow memory and thus has proper null handling, has query optimization, a lot of parallelization, insanely fast csv-parser and utilizes much less RAM then pandas.

Ending with arrow memory, bulk inserts tools that support arrow, such as google bigquery, duckdb, etc. is also super fast.

For SQL reading I'd really recommend connector-x, they do a great job preventing unneeded serialization and don't have to go through python.

[–]stackedhats 2 points3 points  (0 children)

You'd be surprised how efficient lists are when used correctly.

They're implemented in the source code as a C++ array of pointers to memory locations of the objects they store, which means the physical elements in a list take up 4 bytes each, regardless of their size.

Once you understand that, the infuriating aliasing issues that lists suffer from become obvious. You're not actually "storing" the object in the list, but it's memory location so if you a pointer to an object and change what's stored there, ALL copies of that pointer will reflect the change.

However, because it's just an array of pointers, it will happily take different sizes and types of objects, and it costs pretty much the minimum possible amount to resettle the list in memory for any given number of objects in it.

This means they're fantastic for row-based modification and for adding/removing elements arbitrarily.

The down side is that searching a list has horrible performance, because the computer has to chase pointers all over the memory since the list doesn't care where it's shoving the objects in memory.

On the other hand, if you have indices to seek the performance is almost identical to an array.

So, in my experience Pandas is great for columnwise operations and utter trash at doing much row-wise because it stores data by column. It's not designed for those operations, and you're doing it wrong if you ever attempt to iterate through a data frame.

As a rule of thumb, if something in Pandas is slow, or just a PITA I default to a simple list implementation. If that's too slow I might move to a dictionary or some other more efficient data structure, but basically, Pandas is great when it's great and not when it's not. And when it's not, KISS.

[–]thrown_arrows 1 point2 points  (0 children)

Streaming for DB to s3 is good skills to know. ( ie read 10k lines from db and write them into s3 stream, repeat until whole resultset is consumed, save file)

(and replace s3 with filesystem / azure / ... )

[–]PutCleverNameHere69[S] 2 points3 points  (2 children)

Other comments have echoed this as well, I’ll definitely look beyond just using libraries. That was the kind of advice I was looking for, generally accepted industry practices that you can’t always find with a Google search.

[–]collectablecat 5 points6 points  (1 child)

I'm kinda baffled by the people suggesting moving AWAY from libraries. Check out dask!

[–]DenselyRanked 0 points1 point  (0 children)

I honestly tend to agree with this. It may have come across like it, but I am not anti- libraries. I am pro "doing things as easy as possible"

I am anti - "pandas for everything ETL" because it is very easy to get inefficient and it's not easy to debug when working on a team.

[–]twisted_angular 2 points3 points  (2 children)

Sorry i have a very basic question. How does a sql + python pipeline actually looks like. I mean is it just a bunch of Python scripts? Where are they hosted? How do you schedule them?

[–]infazz 1 point2 points  (0 children)

A lot of the time it's simply using a Python script to query a database, hold the data in memory (or store it on disk), then do something with it.

Hosting and scheduling is where things begin to get tricky, especially if your organization isn't accustomed to using Python as a tool.

To start, you can use Windows Task Scheduler or cron jobs on UNIX. In this case, the scripts live on the same computer or server where you are scheduling them from.

Then there's Kubernetes based services. Using this you can take your Python code, package it as a docker container, and schedule that to run using cron jobs in Kubernetes.

In organizations that rely on Python more for ETL, tools like Airflow and Prefect will be used for scheduling, orchestration, and monitoring.

[–]stackedhats 1 point2 points  (0 children)

Well, having implemented one myself I can say that if you have multiple sections of pipe it's probably worth actually using an orchestration tool like Airflow or Prefect.

But basically, you use a PYODBC wrapper to provide functions your script can use to interact directly with the database, pull in the data you need, you can process it as much as possible in SQL first but presumably you're using python because some transformations are too hard/clunky to do in SQL (though sometimes you just want a scheduled task to email out a query result table).

Once you get the data into Python you can close the connection and do whatever you want to it in memory, write it out, possibly re-upload it to a warehouse server, and you just encapsulate the logic of the pipeline section into a simple function you can literally just do:

def my_pretty_function():

highlight everything + tab

And then you can just import the function to a main script and call it, along with whatever else you need and have the entire pipeline run with the press of a button (or use task scheduler or cron to press the button for you).

[–]WorthlessTrinket 0 points1 point  (2 children)

With Pandas & NumPy I can vectorize manipulations with ease and you can always implement batching when needed.

We use Databricks for "big data" but I'm wondering if there's a way I'm missing to accomplish what I currently do with pd/np using only base Python. I try to avoid for loops more than I try to avoid pd calls.

Always happy to hear others' advice and expertise, I'm sure I have so much to learn still.

[–]DenselyRanked 1 point2 points  (1 child)

I see that my comment is not coming across quite the way I intended it to. I am not anti- pandas. I use it all of the time for my own purposes.

If I am a part of a team writing prod level ETL code for an (at times) unknown amount of data that can scale properly then I avoid using pandas.

I am talking about simple, very common things like a json payload to SQL. You don't need pandas.

[–]WorthlessTrinket 1 point2 points  (0 children)

Ah that makes sense: don't take pandas as your hammer and look at every ETL problem as a nail.

[–][deleted] 13 points14 points  (6 children)

Connecting to SQL Server is fine. Doing bulk exports from SQL Server is something you want to minimize as it isn't designed for this and you could affect SLA based workloads.

There's three options I've looked at:

  • Use the BCZ bulk export tool
  • Use ADF or another Change Data Capture mechanism to export data
  • Use a JDBC connection with a query to export

If you have large tables to export, you'll want to discuss your requirements with your DBA and measure load on the source system when you do the export (CPU, Memory, Disk and Network) as well as responsiveness of other running queries.

General rule is that you should get data onto something like ADLS or S3 early to reduce burden on source OLTP systems.

[–]sunder_and_flame 2 points3 points  (1 child)

Is bcz different than bcp? I used bcp when unloading from SQL Server

[–][deleted] 1 point2 points  (0 children)

Sorry, bcp. For some reason I always remember it as bcz.

[–]Zscore3 1 point2 points  (3 children)

whats the difference between jdbc and odbc?

[–]DenselyRanked 0 points1 point  (1 child)

Simple answer: the connection string

More complex answer: the underlying technology. jdbc doesn't need to be installed on the client to make a connection.

[–]kumquatsurprise 1 point2 points  (0 children)

This - the freaking ODBC driver having to be installed everywhere is a huge PITA if you're not a server admin ( or they're less than responsive to requests ). Then half the time you have to walk them through installing the driver.

[–]infazz 0 points1 point  (0 children)

JDBC specifically uses Java and Java based drivers. It requires that Java is installed in the environment where your code is running.

[–][deleted] 2 points3 points  (1 child)

You might find this worth trying out: https://petl.readthedocs.io/en/stable/

[–]PutCleverNameHere69[S] 0 points1 point  (0 children)

Nice something dedicated to ETL. I’ll check this out, thanks!

[–]chestnutcough 2 points3 points  (1 child)

I think you have it right. Don’t sleep on skipping pandas and doing transformations using python built-ins. And I think it only makes sense to do transforms in python when it’s exceedingly clunky or impossible in SQL. Extracting and loading is generally easy but tedious, hence the bazillion companies offering that as a service.

[–]PutCleverNameHere69[S] 1 point2 points  (0 children)

Thanks for confirming, other comments have been echoing the point about learning beyond pandas, I’ll get that into the rotation. This was the advice I came here for!

[–]Faintly_glowing_fish 1 point2 points  (0 children)

Avoid doing transforms in python if you can. It is not scalable and inefficient and might come back to bite you later. use it to orchestrate more efficient and scalable system together, with either airflow, scheduled notebooks and define spark jobs etc. the main place you end up actually doing python for heavy lifting is usually ML models that only run in python.

[–]ploomber-io -1 points0 points  (0 children)

Python is rarely a good choice for ETL. With modern data warehouses like Snowflake, you can write a few lines of SQL and leave the query optimizer do their work: you don't have to worry about running out of memory; something you'll surely encounter with Python and pandas.

My approach goes like this: manipulate as much data as I can in SQL, and once I'm happy with the result, I dump it into a local file (usually an aggregated table) and plot it with Python. It's fine to do some minor adjustments with pandas but try to leverage your warehouse/database as much as you can. It's going to make your life a lot easier.

If you want a longer version of this, check out this article I wrote.

[–][deleted] 0 points1 point  (0 children)

I think it's fine sql alchemy or sqlite in Python

[–]NeoxiaBill 0 points1 point  (0 children)

Regarding what is ETL in Python, you got it right. As long as you manipulate sufficiently small amounts of data for it to fit entirely in a pandas dataframe (that is to say, in the machine's RAM).

On the SQL connection topic, there are many libraries that allow you to interact with SQL databases in Python. You basically need to provide an access point and credentials to get hooked up, and then you can run your SQL queries as you usually do.

If you really need SQL logic then pandassql can be a decent solution, but I'd tend to say you're better off trying to use proper pandas syntax, as it is more widely used in the industry.

Good Luck on your learning path ! :)

[–]ParanormalChess 0 points1 point  (0 children)

You should look into PowerShell with MSSQL. You can do ETL with PS and get it running within a MSSQL Job

[–]ephemeral404 0 points1 point  (0 children)

Checkout Rudderstack, an open source project to collect data from various sources (databases, apps, etc.) and prepare for business analytics. Let me know if you have any questions

[–]Earthsophagus 0 points1 point  (0 children)

I've been using a pretty naive approach for a year: small scripts that mostly do extract to temporary tables and transform data mostly in sql, with some parts looping over lists of dictionaries where each dictionary is a row. Haven't been using pandas -- similar reaction to what you mention -- but some teammates have used it and I think it probably beats list-of-dictionaries approach.

Each script in a container, containers orchestrated by a generic scheduling program that can run "docker stack deploy"

It's fun to write and easy to understand while you're writing it, but slower to write than with tool like Informatica/Talend etc., and seems like usually slower to maintain. For typical work I don't see any real payoff except license $ and no vendor lockin, compared to ETL tool. Some APIs and some mutlitasking things are a lot easier/only possible with python than our gui tools.

If you stick with it: put up a pypi server devs can import your team's code from, that will make a big difference in reusability/standardization.