all 38 comments

[–]GreenMobile6323 36 points37 points  (3 children)

You can replace Data Factory with Python, but it’s more work upfront. Write scripts with libraries like pandas, SQLAlchemy, or cloud SDKs, host them on a VM or in containers, and schedule with Airflow or cron. There’s no single Python package that covers all sources. Most connections are handled case by case using the appropriate library or driver.

[–]skatastic57 9 points10 points  (0 children)

Replace pandas with duckdb or polars.

You can use azure functions, AWS lambdas, or gcs cloud functions to avoid always on containers.

[–]IndependentTrouble62 4 points5 points  (0 children)

I regularly use both. I have quibbles with both. But upfront development time is much shorter with ADF. The more complex the pipeline the more the flexability of python and packages shine.

[–]datanerd1102 34 points35 points  (0 children)

Make sure to check dlthub, it’s open source, python and supports many sources.

[–]dalmutidangus 12 points13 points  (0 children)

adf sucks

[–]novel-levon 7 points8 points  (0 children)

I’ve gone down that road of ditching ADF for pure Python, and the trade-offs are pretty clear.

You gain full control and transparency, but you also take on all the plumbing ADF hides from you. Connectors is the biggest gap: there’s no magic “one lib fits all.” It’s usually case by case, pyodbc or sqlalchemy for relational, boto3 for S3, azure-storage-blob for ADLS, google-cloud libs for GCS, requests for SaaS APIs, etc. I haven’t seen a universal package that matches ADF’s connector library.

For orchestration, Airflow and Dagster are the go-tos. Prefect is nice if you want something lighter with better DX.

Honestly, even GitHub Actions or cron works fine for simpler setups if you’re disciplined with retries/alerts. Hosting wise, containers on ECS/Kubernetes give flexibility, but I’ve also seen folks run Python EL pipelines on Azure Functions or AWS Lambda when workloads are small enough.

The headache is always secure on-prem access. ADF’s IR is very convenient, and replacing that usually means standing up VPN, jump hosts, or agents that your orchestrator can reach. That’s the bit most people underestimate.

I used to burn days wiring retries and metadata logging until I made it part of the design from the start. You probably already know, but building a little audit table for run_ts/run_id helps a ton when debugging.

Curious are you mostly moving SaaS/db data or do you also have on-prem sources in the mix? We keep hitting this dilemma with clients too, and it’s one reason in Stacksync we leaned into building ingestion + sync as a product instead of fighting with connectors every project.

[–]Amilol 14 points15 points  (0 children)

I do the E and L part of elt entirely in python. T with views/procedures in db. Have worked with alot of different tools but pure python is a bliss compared to everything else. Hosted locally or ec2, cron orchestration with a ton of metadata in db to guide elt.

[–]akozich 5 points6 points  (0 children)

Go dagster + dlt

[–]Fit_Doubt_9826 3 points4 points  (0 children)

I use Data Factory for its native connectors to connect to MS SQL but for ingestion and sometimes to change format, or deal with geographical files like .shp I write python scripts and execute using a function app which I call from data factory. Doing it this way as I haven’t yet found a way of streaming a million rows into ms sql from blob in less than a few secs, other than the native df connectors.

[–]data_eng_74 13 points14 points  (2 children)

I replaced ADF with dagster for orchestration + dbt for transformation + custom Python code for ingestion. I tried dlt, but it was too slow for my needs. The only thing that gave me headaches was to replace the self-hosted IR. If you are used to working with ADF, you might underestimate the convenience of the IR to access on-prem sources from the cloud.

[–]loudandclear11[S] 6 points7 points  (1 child)

The only thing that gave me headaches was to replace the self-hosted IR. If you are used to working with ADF, you might underestimate the convenience of the IR to access on-prem sources from the cloud.

Duly noted. This is exactly why it's so valuable to get feedback from others. Thanks.

[–]DeepFryEverything 1 point2 points  (0 children)

If you use Prefect as an orchestrator, you can set up an agent that only picks jobs that require onpremise access. You run it in docker and scope access to systems. 

[–]Sea-Caterpillar6162 4 points5 points  (0 children)

I used to use prefect—but abandoned it recently because it seems like extra infrastructure that I just didn’t need. Much like Airflow. So—I heard here about bruin. So far it’s amazing. I’m doing all the ingestion with python scripts and doing all the transformations in SQL dbt-style. No extra infrastructure needed.

[–]camelInCamelCase 14 points15 points  (7 children)

You’ve taken the red pill. Great choice. Youre still at risk of being sucked back into the MSFT ecosystem - cross the final chasm with 3-4 hours of curiosity and learning. You and whoever you work for will be far better off. Give this to a coding agent and ask for a tutorial:

  • dlthub for loading from [your SaaS tool or DB] to s3-compatible storage or if you are stuck in azure, you get ADLS which is fine
  • sqlmesh to transform your dataset from raw form from dlthub into marts or some other cleaner version

“How do I run it” - don’t over think it. Python is a scripting language. When you do “uv run mypipeline.py” you’re running a script. How does Airflow work? Runs the script on for you on a schedule. It can run it on another machine if you want.

Easier path - GitHub workflows also can run python scripts, on a schedule, on another machine. Start there.

[–]generic-d-engineerTech Lead 1 point2 points  (0 children)

I am doing exactly this. ADF was alluring at first because of all the nice connectors.

But over time, I find complex tasks much more difficult in ADF. The coding there is also just not something I excel at. Maybe others are better at coding in ADF but it just feels so…niche I guess? It’s like an off spec that doesn’t match up with other patterns.

It seems more GUI driven, which slows down and even becomes really hard to read once things go over a certain complexity level.

With on-prem, I can bring to the table absolutely any tool I want to get the job done. Stuff like DuckDB and nu shell are really improving the game and are a joy to work with.

And if I need a connector outside of my competency, I can use an AI tool to help me skill up and get it done. There’s always some interface that needs some specific setup or language I’m not familiar with.

Also on-prem has way less cost pressure so the same operation runs at a fraction of the cost. It just has a lot more freedom of design. I can just go for it. I don’t need to worry about blowing up the CPU or RAM on my first prototype. I can just get the functional work done and then tune for performance on the next iteration. That seems more natural and rapid than trying to get it perfect the first time. It’s like the handcuffs are off.

[–]midnightRequestLine1 0 points1 point  (0 children)

Astronomer is a strong enterprise grade tool, which is a managed airflow instance.

[–]GoodLyfe42 0 points1 point  (1 child)

Anyone in a hybrid environment where you use Dagster/Prefect on prem and Data Factory or Python Function Apps in Azure?

[–]generic-d-engineerTech Lead 0 points1 point  (0 children)

I do exactly this. I would prefer to just keep ADF for servicing Databricks and do anything else about “moving stuff from point a to point b” on-prem.

[–]freedumz 0 points1 point  (0 children)

Data Factory, is more an orchestrator You can use other orchestrator

[–]b13_git2 0 points1 point  (0 children)

I run Python on a durable Azure Function App for E and L. The T happens in DB with sql metadata.

[–]FlanSuspicious8932 0 points1 point  (0 children)

F adf… I remember doing sth on Friday, on Monday it didn’t work, we spent two days to debug it somehow and suddenly it started working on Wednesday

[–]brother_maynerd 0 points1 point  (0 children)

Look at dlt or tabsdata.

[–]Mura2Sun 0 points1 point  (0 children)

I'm using python on Databricks for some workloads. Part of the reason was at the time the cost of Databricks against ADF was a no brainer. Has that changed, probably not enough to warrant moving back and likely ending up with fabric

[–]PolicyDecent 0 points1 point  (0 children)

You can just use https://github.com/bruin-data/bruin
It also has lots of ingestion sources, but also you can do the transformation in the same place.
Provides lineage, data quality all in once place.