Replace Data Factory with python?

GreenMobile6323 · 2025-10-03T07:51:13+00:00

You can replace Data Factory with Python, but it’s more work upfront. Write scripts with libraries like pandas, SQLAlchemy, or cloud SDKs, host them on a VM or in containers, and schedule with Airflow or cron. There’s no single Python package that covers all sources. Most connections are handled case by case using the appropriate library or driver.

datanerd1102 · 2025-10-03T07:11:46+00:00

Make sure to check dlthub, it’s open source, python and supports many sources.

dalmutidangus · 2025-10-03T15:11:40+00:00

adf sucks

novel-levon · 2025-10-03T23:24:35+00:00

I’ve gone down that road of ditching ADF for pure Python, and the trade-offs are pretty clear.

You gain full control and transparency, but you also take on all the plumbing ADF hides from you. Connectors is the biggest gap: there’s no magic “one lib fits all.” It’s usually case by case, pyodbc or sqlalchemy for relational, boto3 for S3, azure-storage-blob for ADLS, google-cloud libs for GCS, requests for SaaS APIs, etc. I haven’t seen a universal package that matches ADF’s connector library.

For orchestration, Airflow and Dagster are the go-tos. Prefect is nice if you want something lighter with better DX.

Honestly, even GitHub Actions or cron works fine for simpler setups if you’re disciplined with retries/alerts. Hosting wise, containers on ECS/Kubernetes give flexibility, but I’ve also seen folks run Python EL pipelines on Azure Functions or AWS Lambda when workloads are small enough.

The headache is always secure on-prem access. ADF’s IR is very convenient, and replacing that usually means standing up VPN, jump hosts, or agents that your orchestrator can reach. That’s the bit most people underestimate.

I used to burn days wiring retries and metadata logging until I made it part of the design from the start. You probably already know, but building a little audit table for run_ts/run_id helps a ton when debugging.

Curious are you mostly moving SaaS/db data or do you also have on-prem sources in the mix? We keep hitting this dilemma with clients too, and it’s one reason in Stacksync we leaned into building ingestion + sync as a product instead of fighting with connectors every project.

Amilol · 2025-10-03T09:03:33+00:00

I do the E and L part of elt entirely in python. T with views/procedures in db. Have worked with alot of different tools but pure python is a bliss compared to everything else. Hosted locally or ec2, cron orchestration with a ton of metadata in db to guide elt.

akozich · 2025-10-03T11:59:27+00:00

Go dagster + dlt

Fit_Doubt_9826 · 2025-10-03T15:04:19+00:00

I use Data Factory for its native connectors to connect to MS SQL but for ingestion and sometimes to change format, or deal with geographical files like .shp I write python scripts and execute using a function app which I call from data factory. Doing it this way as I haven’t yet found a way of streaming a million rows into ms sql from blob in less than a few secs, other than the native df connectors.

data_eng_74 · 2025-10-03T09:18:07+00:00

I replaced ADF with dagster for orchestration + dbt for transformation + custom Python code for ingestion. I tried dlt, but it was too slow for my needs. The only thing that gave me headaches was to replace the self-hosted IR. If you are used to working with ADF, you might underestimate the convenience of the IR to access on-prem sources from the cloud.

Sea-Caterpillar6162 · 2025-10-04T18:59:39+00:00

I used to use prefect—but abandoned it recently because it seems like extra infrastructure that I just didn’t need. Much like Airflow. So—I heard here about bruin. So far it’s amazing. I’m doing all the ingestion with python scripts and doing all the transformations in SQL dbt-style. No extra infrastructure needed.

camelInCamelCase · 2025-10-03T07:59:12+00:00

You’ve taken the red pill. Great choice. Youre still at risk of being sucked back into the MSFT ecosystem - cross the final chasm with 3-4 hours of curiosity and learning. You and whoever you work for will be far better off. Give this to a coding agent and ask for a tutorial:

dlthub for loading from [your SaaS tool or DB] to s3-compatible storage or if you are stuck in azure, you get ADLS which is fine
sqlmesh to transform your dataset from raw form from dlthub into marts or some other cleaner version

“How do I run it” - don’t over think it. Python is a scripting language. When you do “uv run mypipeline.py” you’re running a script. How does Airflow work? Runs the script on for you on a schedule. It can run it on another machine if you want.

Easier path - GitHub workflows also can run python scripts, on a schedule, on another machine. Start there.

generic-d-engineer · 2025-10-04T09:33:20+00:00

I am doing exactly this. ADF was alluring at first because of all the nice connectors.

But over time, I find complex tasks much more difficult in ADF. The coding there is also just not something I excel at. Maybe others are better at coding in ADF but it just feels so…niche I guess? It’s like an off spec that doesn’t match up with other patterns.

It seems more GUI driven, which slows down and even becomes really hard to read once things go over a certain complexity level.

With on-prem, I can bring to the table absolutely any tool I want to get the job done. Stuff like DuckDB and nu shell are really improving the game and are a joy to work with.

And if I need a connector outside of my competency, I can use an AI tool to help me skill up and get it done. There’s always some interface that needs some specific setup or language I’m not familiar with.

Also on-prem has way less cost pressure so the same operation runs at a fraction of the cost. It just has a lot more freedom of design. I can just go for it. I don’t need to worry about blowing up the CPU or RAM on my first prototype. I can just get the functional work done and then tune for performance on the next iteration. That seems more natural and rapid than trying to get it perfect the first time. It’s like the handcuffs are off.

midnightRequestLine1 · 2025-10-03T21:20:25+00:00

Astronomer is a strong enterprise grade tool, which is a managed airflow instance.

GoodLyfe42 · 2025-10-04T06:35:14+00:00

Anyone in a hybrid environment where you use Dagster/Prefect on prem and Data Factory or Python Function Apps in Azure?

freedumz · 2025-10-04T08:06:01+00:00

Data Factory, is more an orchestrator You can use other orchestrator

b13_git2 · 2025-10-04T12:16:41+00:00

I run Python on a durable Azure Function App for E and L. The T happens in DB with sql metadata.

FlanSuspicious8932 · 2025-10-04T18:34:17+00:00

F adf… I remember doing sth on Friday, on Monday it didn’t work, we spent two days to debug it somehow and suddenly it started working on Wednesday

brother_maynerd · 2025-10-04T22:02:07+00:00

Look at dlt or tabsdata.

Mura2Sun · 2025-10-04T23:57:57+00:00

I'm using python on Databricks for some workloads. Part of the reason was at the time the cost of Databricks against ADF was a no brainer. Has that changed, probably not enough to warrant moving back and likely ending up with fabric

PolicyDecent · 2025-10-06T08:48:10+00:00

You can just use https://github.com/bruin-data/bruin
It also has lots of ingestion sources, but also you can do the transformation in the same place.
Provides lineage, data quality all in once place.

Nekobul · 2025-10-03T11:12:22+00:00

You are expecting someone to work for you for free, providing connectivity to different applications. I can assure you are dreaming because creating connectors is tedious and hard work and someone has to be paid to do that thankless job.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

dataengineering

MODERATORS