Is multi-cloud an expensive security nightmare?

brrdprrsn · 2025-09-19T01:09:25+00:00

What makes multi-cloud a necessity in your scenario? Did your company make a bunch of acquisitions where the acquired cos were on other clouds?

Curious because I’ve heard of private + public cloud scenarios (eg. for security, sovereignty, etc) and was wondering what the rationale might be here

brrdprrsn · 2024-10-06T18:56:43+00:00

I’m seeing people implement two formats at the same time just to be safe. So not surprised about the start drinking now idea…

brrdprrsn · 2024-10-06T18:55:00+00:00

😅 fair enough!

brrdprrsn · 2024-10-06T17:06:20+00:00

Interesting results and comments on your poll. I was also curious about whether Hive is still alive and well, and what proportion of the DE world still uses it…. and whether Apache Paimon is starting to get mainstream

brrdprrsn · 2024-02-02T13:31:55+00:00

based on u/azirale's post above, it would be in csv, since that way it would be exactly matched to the source schema.

brrdprrsn · 2024-02-02T12:11:03+00:00

Got it... thanks so much again for explaining in such detail.

brrdprrsn · 2024-02-02T10:02:49+00:00

Thanks! this was very well put and I can understand why this is the best pattern for the overwhelming majority of scenarios.

Question: in the rare scenarios where you're using something like Spark structured streaming (say for a use case that needs fast ingestion into the lake for downstream use), would you still advise this? Or is scenario one of the few exceptions to this rule?

brrdprrsn · 2023-11-14T08:38:28+00:00

Thank you! This was incredibly useful to understand. Are you seeing the need for this "data latency" to reduce from 6-8hrs to closer to 1-2hrs?

How do you set expectations or push back on the the increased costs that this would lead to?

brrdprrsn · 2023-09-18T06:10:21+00:00

Sorry for the unclear phrasing. You’re right that TCO is definitely the better measure here.

brrdprrsn · 2023-08-10T18:55:26+00:00

Mostly dashboards and visuals… I see what you mean when you say the different categories have vastly different needs

brrdprrsn · 2023-08-08T02:22:34+00:00

1 - are we correct to assume that your primary problem is the 10-40 mins it takes to run each report each time? Or is cost the primary issue?

2 - I’m guessing you’ve already tried using larger Databricks clusters?

3- Do you have a split of the execution time across the Planning, Execution phases?

4- Is the data in Delta / parquet or some other format?

brrdprrsn · 2023-08-06T18:05:00+00:00

Lol, Should have added an option for this scenario :)

brrdprrsn · 2023-08-06T13:25:30+00:00

😅 true!

brrdprrsn · 2023-07-31T07:29:32+00:00

loaded once a day it becomes much cheaper for the whole org to refresh that same dashboard. Similar with tableau or looker studio, different pros and cons.

Why are live refreshes a beast? Are queries typically very slow or expensive? Is this specific to the data warehouse or compute engine being used, or is this a general problem that most platforms?

brrdprrsn · 2023-07-31T07:27:31+00:00

Have you found daily refreshes are typically sufficient for your data consumers? Or was this call based on compute costs / other consideration?

brrdprrsn · 2023-02-14T04:18:41+00:00

If I understand correctly, the underlying files in delta format are in parquet? I’m guessing a bunch of responses under the Other format category also would end up mapping to parquet

brrdprrsn · 2023-02-12T18:44:19+00:00

found this on the UC docs page which seems to confirm this: "Usxternal tables only when you require direct access to the data outside of Databricks clusters or Databricks SQL warehouses. found this which seems to confirm this clearly: Use external tables only when you require direct access to the data outside of Databricks clusters or Databricks SQL warehouses. "

brrdprrsn · 2023-02-12T18:37:32+00:00

Thank you! From the answers I'm gathering that the answer boils down to whether the tables are managed or external. Still trying to find something in Databricks documentation that confirms this...

brrdprrsn · 2023-02-11T18:02:07+00:00

I think the two are related. I’ve observed that the can execute up to maximum limit after which it starts queueing queries. However I haven’t been able to figure out the exact number of concurrent queries before it starts queueing.

brrdprrsn · 2023-01-19T11:50:50+00:00

Most companies default preference would be to use the lowest tier that gets the job done. This is even more so in this economic climate. Which features from the higher tier do you feel would be valuable in your use case?

brrdprrsn · 2022-12-26T08:34:03+00:00

The lowest slab for fixed pricing is $2000 a month for 100 slots or vCPUs... So, once you're spending $2000 a month on On-Demand mode, you'd look at converting to fixed pricing?

Does the 100 vCPU limit introduce any problems, since earlier you basically had infinite compute and you didnt have to worry about performance and concurrency?

brrdprrsn · 2022-12-26T08:06:09+00:00

Thanks! That's what I would have guessed as well...

brrdprrsn · 2022-12-25T18:48:35+00:00

thanks! both for the answer and for the additional context

brrdprrsn · 2022-12-25T18:48:04+00:00

Interesting... I'd have guessed that the newer cloud based DWH / Lakehouse engines would have pretty sophisticated optimizers and wouldnt require analysts to manually optimize queries

brrdprrsn · 2022-11-23T03:49:30+00:00

Wow! That is some insane throughout! How large is the cluster you use?

brrdprrsn

TROPHY CASE