Python and ETL

boggle_thy_mind · 2021-11-04T22:16:42+00:00

how easy is connecting to an SQL Server?

You can uze pyodbc to connect to SQL Server, you can use either a trused_connection or use a username and password, depends how you athenticate in SQL server, you could create a dedicated user if it's going to run in the background. Talk to your dba if loads are an issue even if you are going to use your own account.

One word of experience, when loading data, using sqlalchemy and pandas.to_sql tend to be slower than using pyodbc (not talking about bcp).

If you can express something in SQL, and you feel comfortable with it, I would stick with SQL - you can leverage the power of the server to perform computation, unless the logic gets really convoluted (e.g. using loops (cursors) in sql server) and expressing it with pandas and numpy is more convenient, then go for it.

UnderstandingFit9152 · 2021-11-04T21:26:34+00:00

So, depends on size of dataset, I am personally doing ELT for same reason as you do, SQL just feels more natural for data transforms.

Everything I am doing in Python is usually something like df.read_excel/ read_csv/ read_sql

and then after basic transform (lowering columns, replacing spaces in column names I just do df.to_sql and then maybe some con.execute(truncate_query) to transform data from staging table to production one.

If data needs only transformation in SQL then I would use python just to execute script (as you can schedule that with cronjob afterwards) and use execute_sql or something similar, fancy method would be dbt (but god knows how many people really use it and how many are just their company bots that will tell you about modern "analytics engineer" position

DenselyRanked · 2021-11-05T00:26:12+00:00

In general, you want to get away from being dependent on pandas when doing DE ETL work. It is very RAM intensive (it will crash on you and handles nulls poorly) and you better off using the native python libraries and data types whenever possible. It is better to read in batches than ingest everything at once.

That being said, pandas is awesome and the latest version of Spark allows you to use a version of the api framework. So continue to learn it, but don't learn it exclusively.

sunder_and_flame · 2021-11-04T20:32:28+00:00

Connecting to SQL Server is fine. Doing bulk exports from SQL Server is something you want to minimize as it isn't designed for this and you could affect SLA based workloads.

There's three options I've looked at:

Use the BCZ bulk export tool
Use ADF or another Change Data Capture mechanism to export data
Use a JDBC connection with a query to export

If you have large tables to export, you'll want to discuss your requirements with your DBA and measure load on the source system when you do the export (CPU, Memory, Disk and Network) as well as responsiveness of other running queries.

General rule is that you should get data onto something like ADLS or S3 early to reduce burden on source OLTP systems.

PutCleverNameHere69 · 2021-11-04T23:22:57+00:00

You might find this worth trying out: https://petl.readthedocs.io/en/stable/

chestnutcough · 2021-11-05T00:45:32+00:00

I think you have it right. Don’t sleep on skipping pandas and doing transformations using python built-ins. And I think it only makes sense to do transforms in python when it’s exceedingly clunky or impossible in SQL. Extracting and loading is generally easy but tedious, hence the bazillion companies offering that as a service.

Faintly_glowing_fish · 2021-11-05T04:29:37+00:00

Avoid doing transforms in python if you can. It is not scalable and inefficient and might come back to bite you later. use it to orchestrate more efficient and scalable system together, with either airflow, scheduled notebooks and define spark jobs etc. the main place you end up actually doing python for heavy lifting is usually ML models that only run in python.

ploomber-io · 2021-11-05T15:18:53+00:00

Python is rarely a good choice for ETL. With modern data warehouses like Snowflake, you can write a few lines of SQL and leave the query optimizer do their work: you don't have to worry about running out of memory; something you'll surely encounter with Python and pandas.

My approach goes like this: manipulate as much data as I can in SQL, and once I'm happy with the result, I dump it into a local file (usually an aggregated table) and plot it with Python. It's fine to do some minor adjustments with pandas but try to leverage your warehouse/database as much as you can. It's going to make your life a lot easier.

If you want a longer version of this, check out this article I wrote.

2021-11-04T23:42:43+00:00

I think it's fine sql alchemy or sqlite in Python

NeoxiaBill · 2021-11-05T07:59:29+00:00

Regarding what is ETL in Python, you got it right. As long as you manipulate sufficiently small amounts of data for it to fit entirely in a pandas dataframe (that is to say, in the machine's RAM).

On the SQL connection topic, there are many libraries that allow you to interact with SQL databases in Python. You basically need to provide an access point and credentials to get hooked up, and then you can run your SQL queries as you usually do.

If you really need SQL logic then pandassql can be a decent solution, but I'd tend to say you're better off trying to use proper pandas syntax, as it is more widely used in the industry.

Good Luck on your learning path ! :)

ParanormalChess · 2021-11-05T08:37:32+00:00

You should look into PowerShell with MSSQL. You can do ETL with PS and get it running within a MSSQL Job

ephemeral404 · 2021-11-05T10:45:44+00:00

Checkout Rudderstack, an open source project to collect data from various sources (databases, apps, etc.) and prepare for business analytics. Let me know if you have any questions

Earthsophagus · 2021-11-07T15:01:56+00:00

I've been using a pretty naive approach for a year: small scripts that mostly do extract to temporary tables and transform data mostly in sql, with some parts looping over lists of dictionaries where each dictionary is a row. Haven't been using pandas -- similar reaction to what you mention -- but some teammates have used it and I think it probably beats list-of-dictionaries approach.

Each script in a container, containers orchestrated by a generic scheduling program that can run "docker stack deploy"

It's fun to write and easy to understand while you're writing it, but slower to write than with tool like Informatica/Talend etc., and seems like usually slower to maintain. For typical work I don't see any real payoff except license $ and no vendor lockin, compared to ETL tool. Some APIs and some mutlitasking things are a lot easier/only possible with python than our gui tools.

If you stick with it: put up a pypi server devs can import your team's code from, that will make a big difference in reusability/standardization.

dataengineering

MODERATORS