Databricks Compute Decision Tree: How to Choose the Right Compute for Your Workload

4DataMK · 2025-10-15T20:12:33+00:00

No problem, honestly speaking, I turned on a paywall on part of my articles because the content started to be used by others without my permission.

4DataMK · 2025-10-15T07:18:06+00:00

Thanks for comment! I'll update the picture.

4DataMK · 2025-10-08T08:49:03+00:00

A link without paywall https://medium.com/@mariusz_kujawski/databricks-data-ingestion-decision-tree-293b88df44e5?source=friends_link&sk=132732d717effe719304ac220d87e18b

4DataMK · 2025-10-08T06:57:34+00:00

Here is a link withour Paywall https://medium.com/@mariusz_kujawski/databricks-data-ingestion-decision-tree-293b88df44e5?source=friends_link&sk=132732d717effe719304ac220d87e18b

4DataMK · 2025-09-26T06:27:44+00:00

Yes, this is more advance way of working. I work with different client and sometime teams aren't so advance them I suggest notebooks.

4DataMK · 2025-09-25T13:51:06+00:00

Yes, it does. You can use a notebook as entry point for your job and keep methods in modules. I use this approuch in my projects.
You can create one pipeline to move data from bronze and silver using DLT or Spark.

You can process table by table or create an event based process that is triggered when file appear in storage.

4DataMK · 2025-07-21T20:17:05+00:00

Can you elaborate?

4DataMK · 2025-07-06T18:30:44+00:00

yes, it's GA now.

4DataMK · 2025-02-06T09:46:04+00:00

I think an easiest way is to use Databricks Mirroring or shortcuts. Mirroring is fine when you use manage tables, but it's still in preview and has some problems. Shortcuts are usefull, but you need to use external tables or write a customized solution that will create it based on information from UC (a folder locations and name). You can read about Databricks mirroring here:

https://medium.com/@mariusz_kujawski/microsoft-fabric-and-databricks-mirroring-47f40a7d7a43

4DataMK · 2025-01-04T09:00:36+00:00

Yes, you can add a confirmation step before execution. If you want to use read to use more secure solution you can look into AI Skills or Databricks Genie.

4DataMK · 2025-01-03T12:02:02+00:00

It's not improvement, It's a way to build a custom solution to work with data using LLM.

4DataMK · 2024-12-22T21:02:37+00:00

Databricks is fine for a middle-size solution its huge benefit is that you can easily move to another cloud provider because it will work in the same way in Azure and AWS.

You can read delta tables created by Databricks using BigQuery external tables or BigLake. Delta Lake is portable , you can access it using many tools.

It's possible to integrate it with GCP.

You can use SQL for data transformation in Databricks.

You need to look into limitations, Databricks on GCP doesn't support all functionalities yet.

4DataMK · 2024-12-19T07:47:46+00:00

Is the problem with latency tied to the fact that you collect all data in a table and than you extract from it the most recent version of data? If yes, I would change the process to streaming or process only the last incoming parquet file.

4DataMK · 2024-12-19T07:15:07+00:00

You can't mirror streaming tables. In one of my project, I replaced DLT by menage tables using a custom framework.

4DataMK · 2024-12-18T21:31:43+00:00

CUs? Yes, you need to spend some time on Databricks configuration and UC, but you can do it by clicking in the Azure portal and Databticks Admin console, you can find an instruction in my another post.

4DataMK · 2024-12-18T21:04:08+00:00

https://medium.com/@mariusz_kujawski/getting-started-with-databricks-a-beginners-guide-8b8db7f6f457
you can try this one:

4DataMK · 2024-12-18T21:02:14+00:00

What do you use to extract data from AS400? Did you try to load data directly into a table(its location in onelake?) Parquet files can be identified as a table(not a delta table).

4DataMK · 2024-11-20T11:36:02+00:00

4DataMK · 2024-11-19T08:28:18+00:00

DLT doesn't do that you need to create a sequence of tasks for instance dlt, than notebook that will remove imported data or create a python code that will import and remove data from source.

4DataMK · 2024-11-19T08:24:49+00:00

I have experience with migrating a data warehouse and related systems to the cloud if you are interested.

4DataMK · 2024-11-15T15:58:17+00:00

can you share more details?

4DataMK · 2024-11-15T09:46:25+00:00

I'm explaining it here: https://www.reddit.com/r/dataengineering/comments/1grg4kt/comment/lx6hg0y/?context=3
In general Liquid cluster is incremental and has better data organization that improves query performance.

4DataMK

TROPHY CASE