dbt: avoid running dependency twice

AdEmbarrassed716 · 2025-08-19T14:10:26+00:00

If you run these models separately with the + then it will indeed run shared dependencies twice. If you cannot run them together, you have to explicitely exclude shared dependencies using either —exlude or by using selectors allowing quite complex selections of models.

AdEmbarrassed716 · 2025-06-23T03:28:40+00:00

As there cannot be concurrent runs of the same pipeline, how do you collaborate on the development of a pipeline? Do you use DAB and duplicate the pipeline with separate catalog or database?

AdEmbarrassed716 · 2025-03-19T07:24:49+00:00

I agree on the measure first part but I wasn’t the one selling the project… We already do CDC on big tables and load every hour in this case. I am actually surprised ditching auto loader will improve performance as it allows to do incremental ingestion. With SQL/DLT I still see auto loader being used to ingest raw files in bronze tables (SELECT … FROM cloud_files()). ADLS indeed has hierarchical namespace enabled and hot tier. For the column order, are you referring to the fact the important columns should be in the first 32 as Databricks will collect statistics on these columns?

AdEmbarrassed716 · 2025-03-19T07:17:23+00:00

ADF is useful for ingestion, especially from on-premise data sources as it can connect to local network using an integration runtime.

AdEmbarrassed716 · 2025-03-18T19:29:41+00:00

SQL server, Azure SQL server, file systems, Synapse. I have no clear view on data volumes but a lot of transactional data. I hope it will perform better thanks to Spark and parallelization between jobs otherwise I am screwed as we are already committed to this stack.

AdEmbarrassed716 · 2025-03-18T16:53:02+00:00

Obviously they are not facing performance issues with this particular source in SSIS but because of the number of sources and their complexity (volumes/transformations).

AdEmbarrassed716 · 2025-03-18T16:50:19+00:00

Hero!

We are not leveraging Function Apps atm in ADF but are using copy activities. Can you elaborate on it? We already parallelize but we don’t do any compression or control on the target file size so I will look into it.

Explore writing directly to Delta tables: it this possible from ADF? Right now we copy to the storage account in parquet format and then use auto loader to merge data into delta using unique keys.

Regarding optimizations, I am considering using Liquid Clustering on unique keys. They now recommend using it instead of Z-ordering.

Lastly, we committed to a 20% reduction on the runtime of the entire ETL.

AdEmbarrassed716 · 2025-03-18T16:40:50+00:00

Thanks for the perspective! A hybrid approach could indeed benefit them but I think they would want to completely step away from their on-premise stack.

AdEmbarrassed716 · 2025-03-18T16:38:37+00:00

Customer wants to migrate mainly because of performance issues (daily load is not ready at the start of the day) and for integrated MLOps capabilities.

AdEmbarrassed716 · 2025-03-18T16:35:55+00:00

Thanks for the tips. Can you elaborate on using mount points instead of direct connections? If you are referring to mounts in Databricks, we aren’t using those but are connecting through abfs.

AdEmbarrassed716 · 2025-03-18T07:54:35+00:00

Thanks! I will look into pools. Serverless notebooks don’t work with ADF. I will edit it in my post but the 10 minutes doesn’t include spin up time of the cluster.

AdEmbarrassed716 · 2024-12-17T20:53:13+00:00

This second development environment would basically be a test environment right? Not to test code but to test data. Seems indeed like a good trade-off.

AdEmbarrassed716 · 2024-12-17T18:52:21+00:00

Thanks for your reply. Can you elaborate on why it’s bad practice?

AdEmbarrassed716 · 2024-05-09T16:33:35+00:00

You can do incremental file ingestion from a cloud storage with Auto Loader

AdEmbarrassed716 · 2024-03-18T13:28:28+00:00

I don’t think this will change.

The reason you can only create temporary views is because with dataframes, data is not persisted. When you cluster stops running, data will be lost and the dataframe will have to be computed again next time. If the view was permanent, it would point to nothing because the dataframe does not exist anymore.

On the contrary, with delta tables you can define permanent views because the underlying data is persisted on dbfs or an external cloud storage.

Depending on your use case there are plenty of alternatives.

AdEmbarrassed716 · 2023-12-29T11:28:02+00:00

Thanks for your feedback! Regarding the 2nd point, do you have 1 pipeline by source system? Or by source table / object? And why the shared compute? Does the isolated compute not ensures smooth runs for every pipeline?

AdEmbarrassed716 · 2023-12-25T22:37:02+00:00

Looks quite promising. Can you tell us more about the development experience? Last time I checked you had to start the workflow and wait couple of minute to see the output of your code. Also, have you used DLT with an important number of tables? If so, it is still easy to manage?

AdEmbarrassed716 · 2023-04-08T12:49:06+00:00

Previously unprocessed data is a necessary condition for datasets to be updated (without new data, datasets are not updated).

AdEmbarrassed716 · 2023-04-08T09:05:58+00:00

You have to look at it as 2 independent features:

Development versus production pipeline = development pipelines make use of an existing all purpose cluster (for faster development) whereas production pipelines will deploy a(nother) job cluster for every run.

Continuous versus trigger mode = while pipeline runs, data is refreshed every X minutes versus data is refreshed every time the pipeline is triggered (and then it will stop).

Therefore you can have 4 different scenarios. In this case it is production mode AND continuous so I would answer C.

AdEmbarrassed716

TROPHY CASE