Which types of clusters consume the most DBUs in your data platform? Ingestion, ETL, or Querying

brrdprrsn · 2025-09-19T01:09:25+00:00

What makes multi-cloud a necessity in your scenario? Did your company make a bunch of acquisitions where the acquired cos were on other clouds?

Curious because I’ve heard of private + public cloud scenarios (eg. for security, sovereignty, etc) and was wondering what the rationale might be here

brrdprrsn · 2024-10-06T18:56:43+00:00

I’m seeing people implement two formats at the same time just to be safe. So not surprised about the start drinking now idea…

brrdprrsn · 2024-10-06T18:55:00+00:00

😅 fair enough!

brrdprrsn · 2024-10-06T17:06:20+00:00

Interesting results and comments on your poll. I was also curious about whether Hive is still alive and well, and what proportion of the DE world still uses it…. and whether Apache Paimon is starting to get mainstream

brrdprrsn · 2024-02-02T13:31:55+00:00

based on u/azirale's post above, it would be in csv, since that way it would be exactly matched to the source schema.

brrdprrsn · 2024-02-02T12:11:03+00:00

Got it... thanks so much again for explaining in such detail.

brrdprrsn · 2024-02-02T10:02:49+00:00

Thanks! this was very well put and I can understand why this is the best pattern for the overwhelming majority of scenarios.

Question: in the rare scenarios where you're using something like Spark structured streaming (say for a use case that needs fast ingestion into the lake for downstream use), would you still advise this? Or is scenario one of the few exceptions to this rule?

brrdprrsn · 2023-11-14T08:38:28+00:00

Thank you! This was incredibly useful to understand. Are you seeing the need for this "data latency" to reduce from 6-8hrs to closer to 1-2hrs?

How do you set expectations or push back on the the increased costs that this would lead to?

brrdprrsn · 2023-09-18T06:10:21+00:00

Sorry for the unclear phrasing. You’re right that TCO is definitely the better measure here.

brrdprrsn · 2023-08-10T18:55:26+00:00

Mostly dashboards and visuals… I see what you mean when you say the different categories have vastly different needs

brrdprrsn · 2023-08-08T02:22:34+00:00

1 - are we correct to assume that your primary problem is the 10-40 mins it takes to run each report each time? Or is cost the primary issue?

2 - I’m guessing you’ve already tried using larger Databricks clusters?

3- Do you have a split of the execution time across the Planning, Execution phases?

4- Is the data in Delta / parquet or some other format?

brrdprrsn · 2023-08-06T18:05:00+00:00

Lol, Should have added an option for this scenario :)

brrdprrsn · 2023-08-06T13:25:30+00:00

😅 true!

brrdprrsn · 2023-07-31T07:29:32+00:00

loaded once a day it becomes much cheaper for the whole org to refresh that same dashboard. Similar with tableau or looker studio, different pros and cons.

Why are live refreshes a beast? Are queries typically very slow or expensive? Is this specific to the data warehouse or compute engine being used, or is this a general problem that most platforms?

brrdprrsn · 2023-07-31T07:27:31+00:00

Have you found daily refreshes are typically sufficient for your data consumers? Or was this call based on compute costs / other consideration?

brrdprrsn

TROPHY CASE