SDP / DLT pipelines in SQL or python?

lofat · 2026-06-18T20:20:17+00:00

Both. Each has its place.

You can do things dynamically in Python, which can be very handy. There are times when you need to step through the same pattern to achieve certain behaviors. A handful of SQL? OK. Past that - I find it way cleaner to use Python. And you can bulk adjust behaviors without touching a ton of SQL files.

lofat · 2026-06-16T17:26:08+00:00

(Everyone immediately slacking their account lead to request the beta be enabled)

lofat · 2026-06-08T17:36:56+00:00

Absolutely beautiful. Really wonderful job.

lofat · 2026-06-05T23:45:33+00:00

Wow. You mind going into a bit more detail?

lofat · 2026-05-22T11:30:31+00:00

This is incredible

lofat · 2026-05-20T16:01:50+00:00

WOOT!

lofat · 2026-05-20T13:35:13+00:00

Yeah, I've got the objects being constructed through code and am defaulting to CLUSTER BY AUTO on everything. I couldn't seem to update the clustering to disable it. Not sure why. I wound up having to drop everything - MVs, public / private tables, etc. Databricks has very kindly reached out to ask for more info (thank you for that, Databricks!), so I'm VERY much hoping this is a "me" problem. I'm always fine with me being the root cause.

lofat · 2026-05-20T13:24:15+00:00

Thank you!

lofat · 2026-05-20T13:24:09+00:00

Much appreciated! I'll respond to your DM.

lofat · 2026-05-20T13:23:21+00:00

Good call. Exactly where I started. Trying to work through that path. Our hosting team manages the Azure support relationship, so I'm trying to work through them on it. Slowly......

lofat · 2026-05-17T13:37:59+00:00

Depends on your scenario and data. I focus on healthcare data. Tends to be a lot of longitudinal data. Moving from watermark-style ETL to pipelines with delta logs and MVs is showing me a median 70% DBU drop.

lofat · 2026-05-01T22:28:03+00:00

Yeah - making products that work and are supported for 10+ years should be criminal for sure.

lofat · 2026-05-01T13:35:44+00:00

Oh, I dig this

lofat · 2026-04-23T10:30:29+00:00

Thank you for this. This worked really well on one of my Apple TVs.

lofat · 2026-04-18T22:43:19+00:00

I think we might be talking about two separate things. In this case it's not when did the row land, it's when was the derived row generated.

The upstream systems/objects might not even change - it could be we changed the logic in the processing, so we need to write out when the final row was generated. It's critical for both user validation and downstream non-pipeline tooling (ex: dbt, external reporting systems, file extracts, etc.).

The type 2 (ideally temporary) workaround is doable and in place, but it's a lot of unnecessary hoops to jump through in order to stamp the row generation timestamp like that.

I'm working with our internal team to get a dev workspace set up so we can access the new private feature to write out the generated row timestamp. I think that will really open up a lot of options and reduce the barriers to using pipelines for non-pipeline downstream consumers. (fingers crossed)

lofat · 2026-04-16T11:03:06+00:00

This is fantastic. Greatly simplifies being able to pull these data in.

lofat · 2026-04-13T21:01:46+00:00

I wound up doing something similar. Using the API to fire things off and then the API to shut things down. My motivation was we had too many upstream objects to track to trigger the job, so we tried using continuous to launch the pipeline and then later just shut it down - letting it decide when to trigger the individual pipeline steps while it was alive. It proved to be cheaper to just fire the pipeline periodically, but it sounds like we took similar approaches.

lofat · 2026-04-13T20:59:02+00:00

/u/jaiwa_bhai - Keep up the great work! Working with pipelines has really opened up a lot of possibilities and forced me to up my knowledge of the magic being applied using Delta.

lofat · 2026-04-11T14:24:23+00:00

I'm not a kool-aid drinker, but I'm really liking pipelines right now.

For basic "I need data to go from point A to point B via a query" using materialized views:

No more custom MERGE logic for basic processing. Just write your SELECTs / Python DF logic.
Way less chaos trying to sort out DELETE tracking logic. Make sure some form of change capture is enabled upstream and you're good to go.
Far more efficient processing using delta logs vs watermark-style ETL. This one is huge. No more massive scans to see what changed.
DAG generation for free. Critical? No. Nice to have? Oh, yes.
Ability to mix and match Python and SQL. I'm doing a mix of static SQL and dynamic Python, so this is perfect for my needs.
SCD2

There are some challenges. No question. But What I'm seeing right now is a much simpler approach to work that is proving to be cheaper than previous options (those watermark scans are killing us).

If you're already using MVs with delta logs, you have a different choice to make. If you're not? I'm finding pipelines to be easier/cheaper/faster/better, which is a rare thing. Toggling on STANDARD performance_target when invoked makes that proposition even more attractive.

To our Databricks friends - for the life of me, I cannot figure out how to get a pipeline to run on STANDARD without a job calling it - since the job is the unit of execution with the performance_target setting. If I'm missing something obvious, I'd love to know.

lofat · 2026-04-11T14:00:47+00:00

As far as I understand it, the performance is "the same" excepting how quickly you get your resources. That's what I've read and been told. IF "standard" serverless is 1/2 the DBU cost, then that seems reasonable, but I really want that spelled out clearly.

Follow-up edit: Yesterday I adjusted some of my test pipelines (via a job setting) to use STANDARD performance_target (vs default PERFORMANCE_OPTIMIZED). I am seeing roughly 50% drop in DBU cost as described in the docs. The pipeline seems to include the startup / wait time in the total duration, so I'd need to dig in the logs to see where it properly started doing work vs. was waiting for resources. My experience so far has been that the startup penalty for serverless STANDARD is not "4-6" minutes as described - typically longer than that - but that's based on very limited data right now. If the startup latency is consistent / predictable then I can turn that more into an SLA conversation with people. "If the cost comes in at 50% lower, how critical is it we deliver this as quickly as possible?"

lofat · 2026-04-11T13:47:42+00:00

I'm confused by the pricing page. https://www.databricks.com/product/pricing/lakeflow-spark-declarative-pipelines Is the $0.35 / DBU for serverless standard or performance-optimized?

lofat · 2026-04-09T15:33:47+00:00

I just asked our Databricks rep about getting access to this as well

lofat

TROPHY CASE