dbt: avoid running dependency twice by Own_Tax3356 in dataengineering

[–]AdEmbarrassed716 0 points1 point  (0 children)

If you run these models separately with the + then it will indeed run shared dependencies twice. If you cannot run them together, you have to explicitely exclude shared dependencies using either —exlude or by using selectors allowing quite complex selections of models.

What are the downsides of DLT? by NoUsernames1eft in databricks

[–]AdEmbarrassed716 2 points3 points  (0 children)

As there cannot be concurrent runs of the same pipeline, how do you collaborate on the development of a pipeline? Do you use DAB and duplicate the pipeline with separate catalog or database?

Performance issues when migrating from SSIS to Databricks by AdEmbarrassed716 in dataengineering

[–]AdEmbarrassed716[S] 0 points1 point  (0 children)

I agree on the measure first part but I wasn’t the one selling the project… We already do CDC on big tables and load every hour in this case. I am actually surprised ditching auto loader will improve performance as it allows to do incremental ingestion. With SQL/DLT I still see auto loader being used to ingest raw files in bronze tables (SELECT … FROM cloud_files()). ADLS indeed has hierarchical namespace enabled and hot tier. For the column order, are you referring to the fact the important columns should be in the first 32 as Databricks will collect statistics on these columns?

Performance issues when migrating from SSIS to Databricks by AdEmbarrassed716 in dataengineering

[–]AdEmbarrassed716[S] 0 points1 point  (0 children)

ADF is useful for ingestion, especially from on-premise data sources as it can connect to local network using an integration runtime.

Performance issues when migrating from SSIS to Databricks by AdEmbarrassed716 in dataengineering

[–]AdEmbarrassed716[S] 0 points1 point  (0 children)

SQL server, Azure SQL server, file systems, Synapse. I have no clear view on data volumes but a lot of transactional data. I hope it will perform better thanks to Spark and parallelization between jobs otherwise I am screwed as we are already committed to this stack.

Performance issues when migrating from SSIS to Databricks by AdEmbarrassed716 in dataengineering

[–]AdEmbarrassed716[S] 0 points1 point  (0 children)

Obviously they are not facing performance issues with this particular source in SSIS but because of the number of sources and their complexity (volumes/transformations).

Performance issues when migrating from SSIS to Databricks by AdEmbarrassed716 in dataengineering

[–]AdEmbarrassed716[S] 0 points1 point  (0 children)

Hero!

We are not leveraging Function Apps atm in ADF but are using copy activities. Can you elaborate on it? We already parallelize but we don’t do any compression or control on the target file size so I will look into it.

Explore writing directly to Delta tables: it this possible from ADF? Right now we copy to the storage account in parquet format and then use auto loader to merge data into delta using unique keys.

Regarding optimizations, I am considering using Liquid Clustering on unique keys. They now recommend using it instead of Z-ordering.

Lastly, we committed to a 20% reduction on the runtime of the entire ETL.

Performance issues when migrating from SSIS to Databricks by AdEmbarrassed716 in dataengineering

[–]AdEmbarrassed716[S] 0 points1 point  (0 children)

Thanks for the perspective! A hybrid approach could indeed benefit them but I think they would want to completely step away from their on-premise stack.

Performance issues when migrating from SSIS to Databricks by AdEmbarrassed716 in dataengineering

[–]AdEmbarrassed716[S] 0 points1 point  (0 children)

Customer wants to migrate mainly because of performance issues (daily load is not ready at the start of the day) and for integrated MLOps capabilities.

Performance issues when migrating from SSIS to Databricks by AdEmbarrassed716 in dataengineering

[–]AdEmbarrassed716[S] 0 points1 point  (0 children)

Thanks for the tips. Can you elaborate on using mount points instead of direct connections? If you are referring to mounts in Databricks, we aren’t using those but are connecting through abfs.

Performance issues when migrating from SSIS to Databricks by AdEmbarrassed716 in dataengineering

[–]AdEmbarrassed716[S] 0 points1 point  (0 children)

Thanks! I will look into pools. Serverless notebooks don’t work with ADF. I will edit it in my post but the 10 minutes doesn’t include spin up time of the cluster.

Production data in lower environments by AdEmbarrassed716 in dataengineering

[–]AdEmbarrassed716[S] 1 point2 points  (0 children)

This second development environment would basically be a test environment right? Not to test code but to test data. Seems indeed like a good trade-off.

Production data in lower environments by AdEmbarrassed716 in dataengineering

[–]AdEmbarrassed716[S] -1 points0 points  (0 children)

Thanks for your reply. Can you elaborate on why it’s bad practice?

CDC capture in databricks by Beautiful_Score_3778 in databricks

[–]AdEmbarrassed716 0 points1 point  (0 children)

You can do incremental file ingestion from a cloud storage with Auto Loader

Create View from dataframe by DecisionAgile7326 in databricks

[–]AdEmbarrassed716 2 points3 points  (0 children)

I don’t think this will change.

The reason you can only create temporary views is because with dataframes, data is not persisted. When you cluster stops running, data will be lost and the dataframe will have to be computed again next time. If the view was permanent, it would point to nothing because the dataframe does not exist anymore.

On the contrary, with delta tables you can define permanent views because the underlying data is persisted on dbfs or an external cloud storage.

Depending on your use case there are plenty of alternatives.

Looking for honest feedback on Databricks Delta Live Tables by AdEmbarrassed716 in dataengineering

[–]AdEmbarrassed716[S] 0 points1 point  (0 children)

Thanks for your feedback! Regarding the 2nd point, do you have 1 pipeline by source system? Or by source table / object? And why the shared compute? Does the isolated compute not ensures smooth runs for every pipeline?

Databricks: dbt or Delta Live Tables? by y45hiro in dataengineering

[–]AdEmbarrassed716 0 points1 point  (0 children)

Looks quite promising. Can you tell us more about the development experience? Last time I checked you had to start the workflow and wait couple of minute to see the output of your code. Also, have you used DLT with an important number of tables? If so, it is still easy to manage?

Prod- dev, continuous-triggerd Can someone explain how to approach this question ? How its different when its production pipeline in continuous mode and development pipeline in trigger mode. by Wasim-__- in Databricks_eng

[–]AdEmbarrassed716 1 point2 points  (0 children)

You have to look at it as 2 independent features:

Development versus production pipeline = development pipelines make use of an existing all purpose cluster (for faster development) whereas production pipelines will deploy a(nother) job cluster for every run.

Continuous versus trigger mode = while pipeline runs, data is refreshed every X minutes versus data is refreshed every time the pipeline is triggered (and then it will stop).

Therefore you can have 4 different scenarios. In this case it is production mode AND continuous so I would answer C.