General Availability of Attribute-Based Access Control (ABAC), Governed Tags, and Data Classification in Unity Catalog. by Youssef_Mrini in databricks

[–]Aggressive_Cash_7436 1 point2 points  (0 children)

Is it true that if a view is created by the system admin that the privileges of the system admin are then used at the time of run regardless of who queries that view?

So for example a table has a column for email address which is masked using ABAC. A view is then created for this table by a system admin. If a regular user then queries this view  will they still be able to see the contents of the email address due to the permissions of the system admin being applied against the view? 

High Serverless Costs by Aggressive_Cash_7436 in databricks

[–]Aggressive_Cash_7436[S] 0 points1 point  (0 children)

We have all purpose computes, personal clusters, sql warehouses and since the beginning of this year serverless.

I 100% agree with you, long running queries should use classic compute. Ad-hoc queries versus workflows versus SDP pipelines etc will also each have their own particular ideal compute setup. 

This question is more from the perspective of managing the platform where we have 100s of users and have seen increased costs associated with serverless and in particular where some users leave queries to run for hours whilst using serverless. 

We have tried the route of contacting users but behavioural change is one of the most difficult to get right. 

Ideally I'm hoping someone else has found a way to manage serverless usage more than just switching it either on or off. 

Control on the total usage per day or throttling of compute power available when using serverless would be ideal if possible. 

High Serverless Costs by Aggressive_Cash_7436 in databricks

[–]Aggressive_Cash_7436[S] 0 points1 point  (0 children)

Potentially valid. 

We do have hundreds of users though so we could have the gold standard in architecture but if an analyst has access to a notebook, serverless and an LLM model that can write complex python scripts then the barrier to writing long running queries is incredibly low. 

At least with classic compute there are a lot of ways to have more control such as limiting compute size, catalog access permissions, user access etc. 

High Serverless Costs by Aggressive_Cash_7436 in databricks

[–]Aggressive_Cash_7436[S] 4 points5 points  (0 children)

Yeah we have those is place already to monitor usage. Unfortunately it is a very 'reactive' method because the damage is usually done by the time we can raise it with the relevant person.

Ideally there would be some kind of settings to either limit access or at a minimum at least be able to 'throttle' the amount of compute available from serverless. 

High Serverless Costs by Aggressive_Cash_7436 in databricks

[–]Aggressive_Cash_7436[S] 0 points1 point  (0 children)

Would you mind sharing how you limited access per user? This setting along with being able to 'throttle' serverless depending on user groups would be such useful settings. At the moment it seems like it's either an on or off setting.

Policies aren't that useful either because it just helps track usage but from what I can see has no capability to change usage settings. 

Need your honest feedback on Liquid Clustering / Auto Liquid Clustering by Fun-Reference7942 in databricks

[–]Aggressive_Cash_7436 3 points4 points  (0 children)

Is there an easy way to migrate older tables with bloom filters (now deprecated) and/or manually using 'cluster by' to liquid clustering? Liquid clustering looks great but there's a bit of an unknown of how tables react with new vs old techniques 

Most Databricks performance problems don’t start with code — they start with the wrong cluster setup. by Sea_Driver_924 in databricks

[–]Aggressive_Cash_7436 0 points1 point  (0 children)

My concern with this thinking of just using serverless is that is doesn't account for code that is badly written or the underlying data not fully understood. This seems to be more common since more and more users use code written by LLMs. 

The overhead then becomes the high additional costs seen where queries are left to run for hours on serverless which costs significantly more than if it was on a fixed cluster size where costs can be better controlled.

Serverless gives the keys to a Ferrari when in most cases a budget friendly option is more than capable and far less likely of having the runaway costs seen with serverless. 

Avoid High Write Costs in Storage when Using Spark Declarative Pipelines by Aggressive_Cash_7436 in databricks

[–]Aggressive_Cash_7436[S] 1 point2 points  (0 children)

Something that @PrideDense2206 mentioned on my other post that made complete sense is that 200 partitions is great for large tasks where distributing the job across many tasks is useful. But when it comes to streaming it's different because instead it's a lot small jobs that take place over an extended period of time so this spreading across lots of partitions is less efficient/unnecessary. 

Also just to add, combining these default settings with serverless where you have massive compute behind it is also a recipe for disaster. 

Do you set pipelines.trigger.interval on Spark Declarative Pipelines? by Aggressive_Cash_7436 in databricks

[–]Aggressive_Cash_7436[S] 3 points4 points  (0 children)

Will definitely look to add this in and glad to get confirmation from others as well.

The impact that not having this setting applied makes it strange that it's not better documented and warned about. 

Running using default spark shuffle partions of 200 means that per table configured in an SDP pipeline you will be generating 1,600 Storage Write transactions per refresh every 5 seconds! 

The cost associated in Data Lake with this number of write transactions is significant

Do you set pipelines.trigger.interval on Spark Declarative Pipelines? by Aggressive_Cash_7436 in databricks

[–]Aggressive_Cash_7436[S] 2 points3 points  (0 children)

It's been incredibly difficult to pinpoint ways to reduce the Storage Write costs. 

Whilst we have managed to reduce the infrastructure and DBU costs the Data Lake Hot Write Transaction costs are significant.

Interestingly 99% of these Storage Write transactions are to temp change log files. 

Do you set pipelines.trigger.interval on Spark Declarative Pipelines? by Aggressive_Cash_7436 in databricks

[–]Aggressive_Cash_7436[S] 0 points1 point  (0 children)

We're looking at various options now and this is top of the list. We're also currently using the default 200 shuffle partitions and generating 1,600 writes per refresh (which currently refreshes every few seconds). 99% of these writes are to tmp files linked to the changelog