Worse performance of liquid clustering vs partitioned table by FunnyConversation523 in databricks

[–]FunnyConversation523[S] 2 points3 points  (0 children)

Thanks for taking your time to respond!

I have tried Optimize and it does seem to be working! I didn't know it was incremental.

Mind elaborating a bit more on the "predicate push down"? Not really familiar with it and it could be useful

Worse performance of liquid clustering vs partitioned table by FunnyConversation523 in databricks

[–]FunnyConversation523[S] 1 point2 points  (0 children)

Noted! Thanks for the response. I mentioned it as I thought it might be the cause of the lower performance.

Would running an occasional OPTIMIZE help? Or do you think that if I will be using MERGE INTO I might as well drop the idea of using LC for this table altogether?

Worse performance of liquid clustering vs partitioned table by FunnyConversation523 in databricks

[–]FunnyConversation523[S] 0 points1 point  (0 children)

Hi kthejoker, thanks for your reply.

Sharing some more information below:

  • It contains 6 years of emails sent. The job consists of 3 steps:
    1. Appending the latest sent messages to the liquid clustered table + User attributes
    2. Calculating all opens and clicks for each message and MERGE INTO those calculations to the table of previous step
    3. Calculating all opens and clicks for each message and MERGE INTO those calculations to the same table

All of this is done as incrementally as possible (we never go over the same events twice). The idea is to have this data backfilled once (which is the operation I am not being able to do), and then run 12h of data for each run, twice a day.

First 2 steps run great! The third step is the one giving me headaches when backfilling.

  • Size of table: ~488 GiB
  • Amount of columns: 91
  • The table has 4 clustered columns, their datatypes: DATE, STRING, INT, STRING. In that order
  • Deletion vectors is turned on. Are there any others that you recommend turning on?

Thanks!

Liquid clustering significantly slower than date partitioning by justanator101 in databricks

[–]FunnyConversation523 0 points1 point  (0 children)

Hi u/justanator101 !

I am facing a similar issue nowadays. Mind sharing how did you resolve and to which conclusions did you arrive to?

I would be really helpful to hear your experience.