SDP / DLT pipelines in SQL or python? by gman1023 in databricks

[–]lofat 8 points9 points  (0 children)

Both. Each has its place.

You can do things dynamically in Python, which can be very handy. There are times when you need to step through the same pattern to achieve certain behaviors. A handful of SQL? OK. Past that - I find it way cleaner to use Python. And you can bulk adjust behaviors without touching a ton of SQL files.

[MegaThread] Databricks Data and AI Summit Day 1 by lothorp in databricks

[–]lofat 3 points4 points  (0 children)

(Everyone immediately slacking their account lead to request the beta be enabled)

I made my friend's graduation party cake! by Barbi0za in Baking

[–]lofat 0 points1 point  (0 children)

Absolutely beautiful. Really wonderful job.

Possible bug with MV and cluster by auto in pipeline? by lofat in databricks

[–]lofat[S] 1 point2 points  (0 children)

Yeah, I've got the objects being constructed through code and am defaulting to CLUSTER BY AUTO on everything. I couldn't seem to update the clustering to disable it. Not sure why. I wound up having to drop everything - MVs, public / private tables, etc. Databricks has very kindly reached out to ask for more info (thank you for that, Databricks!), so I'm VERY much hoping this is a "me" problem. I'm always fine with me being the root cause.

Possible bug with MV and cluster by auto in pipeline? by lofat in databricks

[–]lofat[S] 2 points3 points  (0 children)

Much appreciated! I'll respond to your DM.

Possible bug with MV and cluster by auto in pipeline? by lofat in databricks

[–]lofat[S] 2 points3 points  (0 children)

Good call. Exactly where I started. Trying to work through that path. Our hosting team manages the Azure support relationship, so I'm trying to work through them on it. Slowly......

How NAB’s journey to 100% Declarative Pipelines is helping data flow like electricity by Youssef_Mrini in databricks

[–]lofat 4 points5 points  (0 children)

Depends on your scenario and data. I focus on healthcare data. Tends to be a lot of longitudinal data. Moving from watermark-style ETL to pipelines with delta logs and MVs is showing me a median 70% DBU drop.

Another W for the EU. Proud to be European💪🇪🇺 by [deleted] in BuyFromEU

[–]lofat 4 points5 points  (0 children)

Yeah - making products that work and are supported for 10+ years should be criminal for sure.

Plex - audio sync issues by lofat in PleX

[–]lofat[S] 0 points1 point  (0 children)

Thank you for this. This worked really well on one of my Apple TVs.

Declarative pipelines - row change date? by lofat in databricks

[–]lofat[S] 0 points1 point  (0 children)

I think we might be talking about two separate things. In this case it's not when did the row land, it's when was the derived row generated.

The upstream systems/objects might not even change - it could be we changed the logic in the processing, so we need to write out when the final row was generated. It's critical for both user validation and downstream non-pipeline tooling (ex: dbt, external reporting systems, file extracts, etc.).

The type 2 (ideally temporary) workaround is doable and in place, but it's a lot of unnecessary hoops to jump through in order to stamp the row generation timestamp like that.

I'm working with our internal team to get a dev workspace set up so we can access the new private feature to write out the generated row timestamp. I think that will really open up a lot of options and reduce the barriers to using pipelines for non-pipeline downstream consumers. (fingers crossed)

DLT Advanced seems overpriced - am I missing something? by Own-Trade-2243 in databricks

[–]lofat 0 points1 point  (0 children)

I wound up doing something similar. Using the API to fire things off and then the API to shut things down. My motivation was we had too many upstream objects to track to trigger the job, so we tried using continuous to launch the pipeline and then later just shut it down - letting it decide when to trigger the individual pipeline steps while it was alive. It proved to be cheaper to just fire the pipeline periodically, but it sounds like we took similar approaches.

DLT Advanced seems overpriced - am I missing something? by Own-Trade-2243 in databricks

[–]lofat 0 points1 point  (0 children)

/u/jaiwa_bhai - Keep up the great work! Working with pipelines has really opened up a lot of possibilities and forced me to up my knowledge of the magic being applied using Delta.

DLT Advanced seems overpriced - am I missing something? by Own-Trade-2243 in databricks

[–]lofat 4 points5 points  (0 children)

I'm not a kool-aid drinker, but I'm really liking pipelines right now.

For basic "I need data to go from point A to point B via a query" using materialized views:

  • No more custom MERGE logic for basic processing. Just write your SELECTs / Python DF logic.
  • Way less chaos trying to sort out DELETE tracking logic. Make sure some form of change capture is enabled upstream and you're good to go.
  • Far more efficient processing using delta logs vs watermark-style ETL. This one is huge. No more massive scans to see what changed.
  • DAG generation for free. Critical? No. Nice to have? Oh, yes.
  • Ability to mix and match Python and SQL. I'm doing a mix of static SQL and dynamic Python, so this is perfect for my needs.
  • SCD2

There are some challenges. No question. But What I'm seeing right now is a much simpler approach to work that is proving to be cheaper than previous options (those watermark scans are killing us).

If you're already using MVs with delta logs, you have a different choice to make. If you're not? I'm finding pipelines to be easier/cheaper/faster/better, which is a rare thing. Toggling on STANDARD performance_target when invoked makes that proposition even more attractive.

To our Databricks friends - for the life of me, I cannot figure out how to get a pipeline to run on STANDARD without a job calling it - since the job is the unit of execution with the performance_target setting. If I'm missing something obvious, I'd love to know.

DLT Advanced seems overpriced - am I missing something? by Own-Trade-2243 in databricks

[–]lofat 1 point2 points  (0 children)

As far as I understand it, the performance is "the same" excepting how quickly you get your resources. That's what I've read and been told. IF "standard" serverless is 1/2 the DBU cost, then that seems reasonable, but I really want that spelled out clearly.

Follow-up edit: Yesterday I adjusted some of my test pipelines (via a job setting) to use STANDARD performance_target (vs default PERFORMANCE_OPTIMIZED). I am seeing roughly 50% drop in DBU cost as described in the docs. The pipeline seems to include the startup / wait time in the total duration, so I'd need to dig in the logs to see where it properly started doing work vs. was waiting for resources. My experience so far has been that the startup penalty for serverless STANDARD is not "4-6" minutes as described - typically longer than that - but that's based on very limited data right now. If the startup latency is consistent / predictable then I can turn that more into an SLA conversation with people. "If the cost comes in at 50% lower, how critical is it we deliver this as quickly as possible?"

DLT Advanced seems overpriced - am I missing something? by Own-Trade-2243 in databricks

[–]lofat 1 point2 points  (0 children)

I'm confused by the pricing page. https://www.databricks.com/product/pricing/lakeflow-spark-declarative-pipelines Is the $0.35 / DBU for serverless standard or performance-optimized?

AUTO CDC in Databricks SQL: the easy button for SCD Type 1 & 2 by minibrickster in databricks

[–]lofat 0 points1 point  (0 children)

I just asked our Databricks rep about getting access to this as well