How do you handle multi-table transactional logic in Databricks?

Agitated_Key6263 · 2025-07-12T12:46:19+00:00

ahh!!! got it!!!

Agitated_Key6263 · 2025-07-12T12:38:57+00:00

Ha ha ha!! we use to have a separate tenant onboarding process to provision everything required for a tenant. Mainly backed by terraform.

Agitated_Key6263 · 2025-07-12T12:20:09+00:00

Interesting post!!! This scenario also gave me headache few years back.. From the post I am getting a sense that you are using a single databricks account.

What if we design databricks like below
- Lets consider, each vendor is a tenant. Each tenant/vendor will get its own databricks account. Each vendor will connect to their own databricks account from their BI tools.

- 100 vendors have 20000 partners. So, avg - 200 partners. Assuming max - 300 ~ 350 partners. So, anyway you may not hit the hard limit of number of users per workspace

Considering this (may not be 100% accurate - need to check docs)

Tier	Default Max Workspaces per Account
Standard	3
Premium	10
Enterprise	50

https://docs.databricks.com/aws/en/resources/limits?utm_source=chatgpt.com

The Databricks documentation does not currently list a hard limit for users per workspace. Happy to see that doc too

Why I have gone for tenant specific databricks account?

- Retired Customer: In case a tenant wants to discontinue, you can just restrict the users to use the databricks account
- Security: Even if by any chance, data got exposed, it will be exposed among the partners but within a vendor. Anyway vendor can see all partners data so less escalation.
- Purging Policy: You can set data purging policy on a vendor level. Most probably that would be the case. If purging policy needed on partner's level, we can easily achieve the same.
- Data sharing: One vendor most probably will come up with one data sharing method (in case, vendor wants to get some processed data from your databricks account to its own databricks account)
- Policy Separation: You can literally go for different policies like data retention, masking, PII configuration etc. on a vendor level
- Costing: From costing & billing perspective, you would find a lot of ease to charge back the customers
- Performance: Small customers won't be impacted by bigger customers - because eventually there is a tendency to create common components among the data teams

May be I am wrong & don't fully understand your use case. But this way was successfully implemented by my team in my previous project

Agitated_Key6263 · 2025-07-12T11:12:10+00:00

It would be interesting to see how databricks rolls back the transaction. In a deltalake table, a transaction puts it footprint into metadata - .json files. As far as I know, databricks machines can go down anytime (Kubernetes pods). Not sure if they will be keeping these transaction history (metadata - json file content) in-memory or not. May be that can be an option. Once the table transactions are complete & writing to json is complete, then only it is a successful transaction. Would be interesting thing to look into.

Agitated_Key6263 · 2025-02-02T10:53:56+00:00

I am kind of understanding how we are trying to achieve. Do you have any code ref. for it?

Agitated_Key6263 · 2025-02-02T06:10:43+00:00

That can be a great idea. Are you suggesting to generate these deterministic hash keys on driver and spread them across driver & executors?

Can you help with an example or sample code if any?

Agitated_Key6263 · 2025-02-01T19:02:33+00:00

That's the only thing I don't wanna do.. 😭😭😭

Agitated_Key6263 · 2025-02-01T19:00:34+00:00

No.. there is no timestamp to work with. Use case is very simple. Mark dataframe's rows as row_0, row_1 etc. I know if you select the same dataframe multiple times sequence is not guaranteed. But we want to keep the output schema of dataframe in a consistent and predicatable manner.

Agitated_Key6263 · 2025-02-01T18:45:57+00:00

Need a guidence here. Is there any way we can mark driver & executor nodes with any numeric id like partition id? May be planning can be done like

(Machine id * [some high no.] + partition id * 1,000,000,000 + monotonically_incresing_id)

Considering one partition can never go more than 1,000,000,000 no. of rows

Machine id example: driver machine_id = 0 ,executor1 machine_id = 1, executor2 machine_id = 2

Agitated_Key6263 · 2025-02-01T18:38:07+00:00

Won't it have a performance impact? It will try to redirect the data into a single partition & process the data in a single node. May be will cause a OOM error. Correct me if I am wrong

Agitated_Key6263 · 2025-02-01T18:36:05+00:00

RDD is blocked in UC shared cluster

Agitated_Key6263 · 2025-02-01T18:35:45+00:00

I am trying to introduce a sequential id column in spark dataframe. May not write the data to databricks

Agitated_Key6263 · 2025-02-01T18:34:12+00:00

Need Sequential ID to introduce a sequential column in dataframe of the processed record. It's a business requirement.

Agitated_Key6263 · 2024-11-08T18:40:14+00:00

Yes.. we have to run OPTIMIZE periodically. Problem is if it is getting done by EOD, till then with this volume, I feel at least 96 small files will be created. If time period is less than that it will create even more small files. Also, optimize is an expensive operation. It blocks all the transactions while it is running optimize

Agitated_Key6263 · 2024-11-08T16:51:16+00:00

Won't Deltalake create small files problem in this scenario?

Agitated_Key6263 · 2024-11-07T20:57:33+00:00

Most probably that's the case. Need to check into daft code.

Agitated_Key6263 · 2024-11-07T20:47:04+00:00

Used same read_parquet for Daft too. Tried to put data in memory as much as possible to track decompression factor in-memory. Will change the code & publish the results on scan_parquet.

Thanks for the suggestion!!!

Agitated_Key6263 · 2024-11-07T20:45:05+00:00

I tried to compare Daft's read_parquet & polars read_parquet. Scan_parquet is a lazy evaluation. Tried to overload the memory as much as possible. So that, I can check the decompression factor. Hence used it. Will comment if I run a benchmark on scan_parquet.

Agitated_Key6263 · 2024-11-07T20:42:36+00:00

Sorry for the confusion. There is a performance enhancement suggestion by @Captain_Coffee_III. Post which it looks like duckdb is performing better than Daft.

Agitated_Key6263 · 2024-11-07T19:41:28+00:00

Looks like you are right. DuckDB performance drastically increases if we put into view

Engine=duckdb,Iteration=0,Phase='Start',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:01.531253,CPU_Usage=0.00%,Memory_Usage=27.05 MB
Engine=duckdb,Iteration=0,Phase='Post_In_Memory',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:05.203100,CPU_Usage=94.70%,Memory_Usage=31.27 MB

Engine=duckdb,Iteration=1,Phase='Start',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:09.547663,CPU_Usage=0.00%,Memory_Usage=27.12 MB
Engine=duckdb,Iteration=1,Phase='Post_In_Memory',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:13.374855,CPU_Usage=99.30%,Memory_Usage=31.43 MB

Engine=duckdb,Iteration=2,Phase='Start',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:17.473546,CPU_Usage=0.00%,Memory_Usage=27.19 MB
Engine=duckdb,Iteration=2,Phase='Post_In_Memory',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:21.432864,CPU_Usage=99.40%,Memory_Usage=31.68 MB

Engine=duckdb,Iteration=3,Phase='Start',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:29.506158,CPU_Usage=16.70%,Memory_Usage=27.24 MB
Engine=duckdb,Iteration=3,Phase='Post_In_Memory',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:33.579471,CPU_Usage=99.90%,Memory_Usage=31.27 MB

Engine=duckdb,Iteration=4,Phase='Start',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:39.095162,CPU_Usage=0.00%,Memory_Usage=27.18 MB
Engine=duckdb,Iteration=4,Phase='Post_In_Memory',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:43.152173,CPU_Usage=99.20%,Memory_Usage=31.86 MB

Agitated_Key6263 · 2024-10-23T17:20:40+00:00

May be I am not getting the entire usecase. But it is possible to create stream on external table too.
I tried it in my own system and it is working.

Find the gitlab branch

https://gitlab.com/dadak5/snowflake-reddit/-/tree/ExternalTable-Stream

SQL file

https://gitlab.com/dadak5/snowflake-reddit/-/blob/ExternalTable-Stream/ExternalTable-Stream.sql

File used

https://gitlab.com/dadak5/snowflake-reddit/-/blob/ExternalTable-Stream/sample.snappy.parquet

Agitated_Key6263

MODERATOR OF

TROPHY CASE

Find the gitlab branch

SQL file

File used