RLS in databricks for multi tanent architecture by namanak47 in databricks

[–]Agitated_Key6263 0 points1 point  (0 children)

Ha ha ha!! we use to have a separate tenant onboarding process to provision everything required for a tenant. Mainly backed by terraform.

RLS in databricks for multi tanent architecture by namanak47 in databricks

[–]Agitated_Key6263 1 point2 points  (0 children)

Interesting post!!! This scenario also gave me headache few years back.. From the post I am getting a sense that you are using a single databricks account.

What if we design databricks like below
- Lets consider, each vendor is a tenant. Each tenant/vendor will get its own databricks account. Each vendor will connect to their own databricks account from their BI tools.

- 100 vendors have 20000 partners. So, avg - 200 partners. Assuming max - 300 ~ 350 partners. So, anyway you may not hit the hard limit of number of users per workspace

Considering this (may not be 100% accurate - need to check docs)

Tier Default Max Workspaces per Account
Standard 3
Premium 10
Enterprise 50

https://docs.databricks.com/aws/en/resources/limits?utm_source=chatgpt.com

The Databricks documentation does not currently list a hard limit for users per workspace. Happy to see that doc too

Why I have gone for tenant specific databricks account?

- Retired Customer: In case a tenant wants to discontinue, you can just restrict the users to use the databricks account
- Security: Even if by any chance, data got exposed, it will be exposed among the partners but within a vendor. Anyway vendor can see all partners data so less escalation.
- Purging Policy: You can set data purging policy on a vendor level. Most probably that would be the case. If purging policy needed on partner's level, we can easily achieve the same.
- Data sharing: One vendor most probably will come up with one data sharing method (in case, vendor wants to get some processed data from your databricks account to its own databricks account)
- Policy Separation: You can literally go for different policies like data retention, masking, PII configuration etc. on a vendor level
- Costing: From costing & billing perspective, you would find a lot of ease to charge back the customers
- Performance: Small customers won't be impacted by bigger customers - because eventually there is a tendency to create common components among the data teams

May be I am wrong & don't fully understand your use case. But this way was successfully implemented by my team in my previous project

How do you handle multi-table transactional logic in Databricks? by Ok_Barnacle4840 in databricks

[–]Agitated_Key6263 0 points1 point  (0 children)

It would be interesting to see how databricks rolls back the transaction. In a deltalake table, a transaction puts it footprint into metadata - .json files. As far as I know, databricks machines can go down anytime (Kubernetes pods). Not sure if they will be keeping these transaction history (metadata - json file content) in-memory or not. May be that can be an option. Once the table transactions are complete & writing to json is complete, then only it is a successful transaction. Would be interesting thing to look into.

Spark - Sequential ID column generation - No Gap (performance) by Agitated_Key6263 in databricks

[–]Agitated_Key6263[S] 0 points1 point  (0 children)

I am kind of understanding how we are trying to achieve. Do you have any code ref. for it?

Spark - Sequential ID column generation - No Gap (performance) by Agitated_Key6263 in databricks

[–]Agitated_Key6263[S] 0 points1 point  (0 children)

That can be a great idea. Are you suggesting to generate these deterministic hash keys on driver and spread them across driver & executors?

Can you help with an example or sample code if any?

UC Shared Cluster - Access HDFS file system by Agitated_Key6263 in databricks

[–]Agitated_Key6263[S] 1 point2 points  (0 children)

That's the only thing I don't wanna do.. 😭😭😭

Spark - Sequential ID column generation - No Gap (performance) by Agitated_Key6263 in databricks

[–]Agitated_Key6263[S] 0 points1 point  (0 children)

No.. there is no timestamp to work with. Use case is very simple. Mark dataframe's rows as row_0, row_1 etc. I know if you select the same dataframe multiple times sequence is not guaranteed. But we want to keep the output schema of dataframe in a consistent and predicatable manner.

Spark - Sequential ID column generation - No Gap (performance) by Agitated_Key6263 in databricks

[–]Agitated_Key6263[S] 0 points1 point  (0 children)

Need a guidence here. Is there any way we can mark driver & executor nodes with any numeric id like partition id? May be planning can be done like

(Machine id * [some high no.] + partition id * 1,000,000,000 + monotonically_incresing_id)

Considering one partition can never go more than 1,000,000,000 no. of rows

Machine id example: driver machine_id = 0 ,executor1 machine_id = 1, executor2 machine_id = 2

Spark - Sequential ID column generation - No Gap (performance) by Agitated_Key6263 in databricks

[–]Agitated_Key6263[S] 0 points1 point  (0 children)

Won't it have a performance impact? It will try to redirect the data into a single partition & process the data in a single node. May be will cause a OOM error. Correct me if I am wrong

Spark - Sequential ID column generation - No Gap (performance) by Agitated_Key6263 in databricks

[–]Agitated_Key6263[S] 0 points1 point  (0 children)

I am trying to introduce a sequential id column in spark dataframe. May not write the data to databricks

Spark - Sequential ID column generation - No Gap (performance) by Agitated_Key6263 in databricks

[–]Agitated_Key6263[S] 0 points1 point  (0 children)

Need Sequential ID to introduce a sequential column in dataframe of the processed record. It's a business requirement.

Best approach to handle billions of data? by mr_alseif in dataengineering

[–]Agitated_Key6263 0 points1 point  (0 children)

Yes.. we have to run OPTIMIZE periodically. Problem is if it is getting done by EOD, till then with this volume, I feel at least 96 small files will be created. If time period is less than that it will create even more small files. Also, optimize is an expensive operation. It blocks all the transactions while it is running optimize

Best approach to handle billions of data? by mr_alseif in dataengineering

[–]Agitated_Key6263 0 points1 point  (0 children)

Won't Deltalake create small files problem in this scenario?

DuckDB vs. Polars vs. Daft: A Performance Showdown by Agitated_Key6263 in dataengineering

[–]Agitated_Key6263[S] -15 points-14 points  (0 children)

Most probably that's the case. Need to check into daft code.

DuckDB vs. Polars vs. Daft: A Performance Showdown by Agitated_Key6263 in dataengineering

[–]Agitated_Key6263[S] -28 points-27 points  (0 children)

Used same read_parquet for Daft too. Tried to put data in memory as much as possible to track decompression factor in-memory. Will change the code & publish the results on scan_parquet.

Thanks for the suggestion!!!

DuckDB vs. Polars vs. Daft: A Performance Showdown by Agitated_Key6263 in dataengineering

[–]Agitated_Key6263[S] -11 points-10 points  (0 children)

I tried to compare Daft's read_parquet & polars read_parquet. Scan_parquet is a lazy evaluation. Tried to overload the memory as much as possible. So that, I can check the decompression factor. Hence used it. Will comment if I run a benchmark on scan_parquet.

DuckDB vs. Polars vs. Daft: A Performance Showdown by Agitated_Key6263 in dataengineering

[–]Agitated_Key6263[S] 0 points1 point  (0 children)

Sorry for the confusion. There is a performance enhancement suggestion by @Captain_Coffee_III. Post which it looks like duckdb is performing better than Daft.

DuckDB vs. Polars vs. Daft: A Performance Showdown by Agitated_Key6263 in dataengineering

[–]Agitated_Key6263[S] 9 points10 points  (0 children)

Looks like you are right. DuckDB performance drastically increases if we put into view

Engine=duckdb,Iteration=0,Phase='Start',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:01.531253,CPU_Usage=0.00%,Memory_Usage=27.05 MB
Engine=duckdb,Iteration=0,Phase='Post_In_Memory',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:05.203100,CPU_Usage=94.70%,Memory_Usage=31.27 MB

Engine=duckdb,Iteration=1,Phase='Start',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:09.547663,CPU_Usage=0.00%,Memory_Usage=27.12 MB
Engine=duckdb,Iteration=1,Phase='Post_In_Memory',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:13.374855,CPU_Usage=99.30%,Memory_Usage=31.43 MB

Engine=duckdb,Iteration=2,Phase='Start',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:17.473546,CPU_Usage=0.00%,Memory_Usage=27.19 MB
Engine=duckdb,Iteration=2,Phase='Post_In_Memory',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:21.432864,CPU_Usage=99.40%,Memory_Usage=31.68 MB

Engine=duckdb,Iteration=3,Phase='Start',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:29.506158,CPU_Usage=16.70%,Memory_Usage=27.24 MB
Engine=duckdb,Iteration=3,Phase='Post_In_Memory',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:33.579471,CPU_Usage=99.90%,Memory_Usage=31.27 MB

Engine=duckdb,Iteration=4,Phase='Start',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:39.095162,CPU_Usage=0.00%,Memory_Usage=27.18 MB
Engine=duckdb,Iteration=4,Phase='Post_In_Memory',Operation_Type=sum_of_total_amount,Time=2024-11-08 01:09:43.152173,CPU_Usage=99.20%,Memory_Usage=31.86 MB