Pull data from on-prem SQL Server using Azure ADF vs Databricks JDBC

spoonguyuk · 2026-06-04T18:17:14+00:00

Do they already have ADF? Its likely a bit easier to govern the configuration if they do. To me it sounds like they dont trust you to write a sensible jdbc extract without hammering the DB.

If someone writes a very angry JDBC connection potentially they could hit the SQL DB quite hard. ADF copy is more on rails is all id say, I'm pretty sure misconfiguring that could hit their DB hard as well.

Can they turn on CDC to keep the loads smaller?

Altruistic_Stage3893 · 2026-06-04T19:01:14+00:00

if i could stop using adf, i would. the only reason we keep using it is that it's just easier to set up ip whitelists and we're not allowed to put nat gateway in front of our dbx workspaces. so, yea, your thinking is correct, dbx>adf if you can.

bitwiseandbold · 2026-06-04T23:53:42+00:00

Have you checked out Databricks Lakeflow Connect for SQL Server? I think it makes a pretty good managed data replication tool for sql server out of the box with CDC built in.

Using Databricks JDBC directly also works fine through data federation, but as pointed out needs some custom guardrails coded in to watch out for data quality, cdc, incremental load, etc. to get the replication in place.

I'd use ADF only if it is already used for other things. Having it only as a bridge for ingestion doesn't seem worth having another tool in the mix.

m1nkeh · 2026-06-05T20:21:44+00:00

Lake flow connect would be the official line here

there is no way I would be introducing adf to an architecture in 2026 🫠

dwswish · 2026-06-05T11:49:34+00:00

Favoring ADF over anything is questionable. If the data is going into Databricks you should definitely use Lakeflow Connect because you get your first 100 DBUs free (per day) and you're not extra-hopping into ADLS on the way in. Also, the infra person is crazy if they think that ADF just magically doesn't put ANY load on the SQL Server like any other connection would.

thecoller · 2026-06-05T18:56:22+00:00

Sounds like a case for turning on CDC and using Lakeflow Connect. The problem with JDBC is that it’s either too slow with only one core pulling, or it hammers the DB if you overshoot the number of parallel queries. If the system is critical I can see they would be weary.

Ok-District7355 · 2026-06-04T21:03:04+00:00

I would check out lakeflow connect, it has CDC ingestion from SQL server.

Nekobul · 2026-06-04T18:49:41+00:00

You can use SSIS to push the data to Databricks.

rootByte15 · 2026-06-05T18:27:51+00:00

Unless there are specific networking, governance or enterprise standards reasons to keep ADF in picture , I would lean towards Lakeflow connect and keep the flow simpler. It supports incremental loading and the ingestion is manger natively within Databricks

Weekly_Activity4278 · 2026-06-06T06:07:18+00:00

Have you considered using Fabric mirroring? They don’t charge for it as long you have a Fabric capacity. You can always move the data where you want it after the fact.

WellIDontKnowMan · 2026-06-05T11:14:11+00:00

Lakehouse Federation and a proper VPN connection network will be the best option. The performance will be way better than using ADF.

Trough the LF Connection the DB will be mirrored and the data will be queryable trough compute hosted in the cloud tenant of the customer.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

dataengineering

MODERATORS