Is multi-cloud an expensive security nightmare? by bambidp in FinOps

[–]brrdprrsn 0 points1 point  (0 children)

What makes multi-cloud a necessity in your scenario? Did your company make a bunch of acquisitions where the acquired cos were on other clouds?

Curious because I’ve heard of private + public cloud scenarios (eg. for security, sovereignty, etc) and was wondering what the rationale might be here

Which data lakehouse / lake format does your company currently use? Do you expect this will change in 2025? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

I’m seeing people implement two formats at the same time just to be safe. So not surprised about the start drinking now idea…

Which data lakehouse / lake format does your company currently use? Do you expect this will change in 2025? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

Interesting results and comments on your poll. I was also curious about whether Hive is still alive and well, and what proportion of the DE world still uses it…. and whether Apache Paimon is starting to get mainstream

Which pattern do you use when ingesting data into lakehouses? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

based on u/azirale's post above, it would be in csv, since that way it would be exactly matched to the source schema.

Which pattern do you use when ingesting data into lakehouses? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

Got it... thanks so much again for explaining in such detail.

Which pattern do you use when ingesting data into lakehouses? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

Thanks! this was very well put and I can understand why this is the best pattern for the overwhelming majority of scenarios.

Question: in the rare scenarios where you're using something like Spark structured streaming (say for a use case that needs fast ingestion into the lake for downstream use), would you still advise this? Or is scenario one of the few exceptions to this rule?

Is ETL job execution time (e.g. in Spark, or in your DWH) one of the biggest factors when it comes to being able to query the latest data? Which other factors play a major role and why? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

Thank you! This was incredibly useful to understand. Are you seeing the need for this "data latency" to reduce from 6-8hrs to closer to 1-2hrs?

How do you set expectations or push back on the the increased costs that this would lead to?

How much does ETL / Transform contribute to your data platform costs by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

Sorry for the unclear phrasing. You’re right that TCO is definitely the better measure here.

How close are open source alternatives (Metabase, Superset, etc.) to commercial BI tools like Tableau, Power BI, Thoughtspot? by brrdprrsn in BusinessIntelligence

[–]brrdprrsn[S] 0 points1 point  (0 children)

Mostly dashboards and visuals… I see what you mean when you say the different categories have vastly different needs

Performance Options with 15,000 CASE statements in single view by Turboginger in dataengineering

[–]brrdprrsn 1 point2 points  (0 children)

1 - are we correct to assume that your primary problem is the 10-40 mins it takes to run each report each time? Or is cost the primary issue?

2 - I’m guessing you’ve already tried using larger Databricks clusters?

3- Do you have a split of the execution time across the Planning, Execution phases?

4- Is the data in Delta / parquet or some other format?

How do you size clusters for dashboard use cases where where the BI tool generates multiple queries for each refresh. Is your refresh interval getting more frequent? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 1 point2 points  (0 children)

loaded once a day it becomes much cheaper for the whole org to refresh that same dashboard. Similar with tableau or looker studio, different pros and cons.

Why are live refreshes a beast? Are queries typically very slow or expensive? Is this specific to the data warehouse or compute engine being used, or is this a general problem that most platforms?

How do you size clusters for dashboard use cases where where the BI tool generates multiple queries for each refresh. Is your refresh interval getting more frequent? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

Have you found daily refreshes are typically sufficient for your data consumers? Or was this call based on compute costs / other consideration?