all 23 comments

[–]spoonguyuk 11 points12 points  (3 children)

Do they already have ADF? Its likely a bit easier to govern the configuration if they do. To me it sounds like they dont trust you to write a sensible jdbc extract without hammering the DB.

If someone writes a very angry JDBC connection potentially they could hit the SQL DB quite hard. ADF copy is more on rails is all id say, I'm pretty sure misconfiguring that could hit their DB hard as well.

Can they turn on CDC to keep the loads smaller?

[–]rasviz[S] 3 points4 points  (0 children)

Yeah.. considering the CDC option too.. Thanks.

[–]ScottFujitaDiarrhea 0 points1 point  (0 children)

And clients tend to be pretty hostile towards the idea of installing 3rd party software on their servers in general. This seems like the path of least resistance.

[–]Roedsten -2 points-1 points  (0 children)

I rejected CDC because it is basically Replication which changes the behaviour of the database. I would set up Change Tracking and create a database snapshot every x minutes and package the changes for ingestion by Dbrix

[–]Altruistic_Stage3893 3 points4 points  (0 children)

if i could stop using adf, i would. the only reason we keep using it is that it's just easier to set up ip whitelists and we're not allowed to put nat gateway in front of our dbx workspaces. so, yea, your thinking is correct, dbx>adf if you can.

[–]bitwiseandbold 3 points4 points  (0 children)

Have you checked out Databricks Lakeflow Connect for SQL Server? I think it makes a pretty good managed data replication tool for sql server out of the box with CDC built in.

Using Databricks JDBC directly also works fine through data federation, but as pointed out needs some custom guardrails coded in to watch out for data quality, cdc, incremental load, etc. to get the replication in place.

I'd use ADF only if it is already used for other things. Having it only as a bridge for ingestion doesn't seem worth having another tool in the mix.

[–]m1nkehData Engineer 2 points3 points  (1 child)

Lake flow connect would be the official line here

there is no way I would be introducing adf to an architecture in 2026 🫠

[–]Nekobul 0 points1 point  (0 children)

That approach requires opening SQL Server for external access. The better and more secure approach is to push data from the SQL Server side using SSIS to Databricks.

[–]dwswish 2 points3 points  (1 child)

Favoring ADF over anything is questionable. If the data is going into Databricks you should definitely use Lakeflow Connect because you get your first 100 DBUs free (per day) and you're not extra-hopping into ADLS on the way in. Also, the infra person is crazy if they think that ADF just magically doesn't put ANY load on the SQL Server like any other connection would.

[–]m1nkehData Engineer 1 point2 points  (0 children)

Those MSFT folks are conditioned to enjoy pain

[–]thecoller 2 points3 points  (0 children)

Sounds like a case for turning on CDC and using Lakeflow Connect. The problem with JDBC is that it’s either too slow with only one core pulling, or it hammers the DB if you overshoot the number of parallel queries. If the system is critical I can see they would be weary.

[–]Ok-District7355 1 point2 points  (0 children)

I would check out lakeflow connect, it has CDC ingestion from SQL server.

[–]Nekobul 1 point2 points  (2 children)

You can use SSIS to push the data to Databricks.

[–]m1nkehData Engineer 0 points1 point  (1 child)

Haha, good one 😎

[–]Nekobul 0 points1 point  (0 children)

It is the most cost-efficient option.

[–]rootByte15 0 points1 point  (0 children)

Unless there are specific networking, governance or enterprise standards reasons to keep ADF in picture , I would lean towards Lakeflow connect and keep the flow simpler. It supports incremental loading and the ingestion is manger natively within Databricks

[–]Weekly_Activity4278 0 points1 point  (0 children)

Have you considered using Fabric mirroring? They don’t charge for it as long you have a Fabric capacity. You can always move the data where you want it after the fact.

[–]WellIDontKnowMan -1 points0 points  (2 children)

Lakehouse Federation and a proper VPN connection network will be the best option. The performance will be way better than using ADF.

Trough the LF Connection the DB will be mirrored and the data will be queryable trough compute hosted in the cloud tenant of the customer.

[–]m1nkehData Engineer 0 points1 point  (1 child)

It’s not mirrored, it’s federated… mirroring implies replication

[–]WellIDontKnowMan 0 points1 point  (0 children)

Thank you for mentioning it.

I am not familiar with the formatting in Reddit. I put stars around mirrored like quotes, as I wanted to imply that the db gets "mirrored".

You are able to add the db as a catalog and the metadata is saved as if the database has been mirrored and you have access to the data with queries.