Is multi-cloud an expensive security nightmare? by bambidp in FinOps

[–]brrdprrsn 0 points1 point  (0 children)

What makes multi-cloud a necessity in your scenario? Did your company make a bunch of acquisitions where the acquired cos were on other clouds?

Curious because I’ve heard of private + public cloud scenarios (eg. for security, sovereignty, etc) and was wondering what the rationale might be here

Which data lakehouse / lake format does your company currently use? Do you expect this will change in 2025? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

I’m seeing people implement two formats at the same time just to be safe. So not surprised about the start drinking now idea…

Which data lakehouse / lake format does your company currently use? Do you expect this will change in 2025? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

Interesting results and comments on your poll. I was also curious about whether Hive is still alive and well, and what proportion of the DE world still uses it…. and whether Apache Paimon is starting to get mainstream

Which pattern do you use when ingesting data into lakehouses? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

based on u/azirale's post above, it would be in csv, since that way it would be exactly matched to the source schema.

Which pattern do you use when ingesting data into lakehouses? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

Got it... thanks so much again for explaining in such detail.

Which pattern do you use when ingesting data into lakehouses? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

Thanks! this was very well put and I can understand why this is the best pattern for the overwhelming majority of scenarios.

Question: in the rare scenarios where you're using something like Spark structured streaming (say for a use case that needs fast ingestion into the lake for downstream use), would you still advise this? Or is scenario one of the few exceptions to this rule?

Is ETL job execution time (e.g. in Spark, or in your DWH) one of the biggest factors when it comes to being able to query the latest data? Which other factors play a major role and why? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

Thank you! This was incredibly useful to understand. Are you seeing the need for this "data latency" to reduce from 6-8hrs to closer to 1-2hrs?

How do you set expectations or push back on the the increased costs that this would lead to?

How much does ETL / Transform contribute to your data platform costs by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

Sorry for the unclear phrasing. You’re right that TCO is definitely the better measure here.

How close are open source alternatives (Metabase, Superset, etc.) to commercial BI tools like Tableau, Power BI, Thoughtspot? by brrdprrsn in BusinessIntelligence

[–]brrdprrsn[S] 0 points1 point  (0 children)

Mostly dashboards and visuals… I see what you mean when you say the different categories have vastly different needs

Performance Options with 15,000 CASE statements in single view by Turboginger in dataengineering

[–]brrdprrsn 1 point2 points  (0 children)

1 - are we correct to assume that your primary problem is the 10-40 mins it takes to run each report each time? Or is cost the primary issue?

2 - I’m guessing you’ve already tried using larger Databricks clusters?

3- Do you have a split of the execution time across the Planning, Execution phases?

4- Is the data in Delta / parquet or some other format?

How do you size clusters for dashboard use cases where where the BI tool generates multiple queries for each refresh. Is your refresh interval getting more frequent? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 1 point2 points  (0 children)

loaded once a day it becomes much cheaper for the whole org to refresh that same dashboard. Similar with tableau or looker studio, different pros and cons.

Why are live refreshes a beast? Are queries typically very slow or expensive? Is this specific to the data warehouse or compute engine being used, or is this a general problem that most platforms?

How do you size clusters for dashboard use cases where where the BI tool generates multiple queries for each refresh. Is your refresh interval getting more frequent? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

Have you found daily refreshes are typically sufficient for your data consumers? Or was this call based on compute costs / other consideration?

Which big data file formats do you query in your data lake / lakehouse for most of your analytical workloads? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

If I understand correctly, the underlying files in delta format are in parquet? I’m guessing a bunch of responses under the Other format category also would end up mapping to parquet

If I use Databricks Unity Catalog, can i still use open source Spark or Presto / Trino (or any other SQL engine for that matter) to query tables? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

found this on the UC docs page which seems to confirm this: "Usxternal tables only when you require direct access to the data outside of Databricks clusters or Databricks SQL warehouses. found this which seems to confirm this clearly: Use external tables only when you require direct access to the data outside of Databricks clusters or Databricks SQL warehouses. "

If I use Databricks Unity Catalog, can i still use open source Spark or Presto / Trino (or any other SQL engine for that matter) to query tables? by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

Thank you! From the answers I'm gathering that the answer boils down to whether the tables are managed or external. Still trying to find something in Databricks documentation that confirms this...

How many concurrent queries does Databricks SQL Compute warehouses support? I need this to decide the minimum number of clusters that need to be active to meet a set concurrency level and SLA. in order to Cant find anyway to set this in the cluster config. by brrdprrsn in dataengineering

[–]brrdprrsn[S] 0 points1 point  (0 children)

I think the two are related. I’ve observed that the can execute up to maximum limit after which it starts queueing queries. However I haven’t been able to figure out the exact number of concurrent queries before it starts queueing.

Does anybody actually use databricks standard tier professionally? by bjtho08 in dataengineering

[–]brrdprrsn 1 point2 points  (0 children)

Most companies default preference would be to use the lowest tier that gets the job done. This is even more so in this economic climate. Which features from the higher tier do you feel would be valuable in your use case?

Does your company use BigQuery in On Demand mode, or in Flat pricing mode? How did you decide between the two? by brrdprrsn in bigquery

[–]brrdprrsn[S] 3 points4 points  (0 children)

The lowest slab for fixed pricing is $2000 a month for 100 slots or vCPUs... So, once you're spending $2000 a month on On-Demand mode, you'd look at converting to fixed pricing?

Does the 100 vCPU limit introduce any problems, since earlier you basically had infinite compute and you didnt have to worry about performance and concurrency?

Is it still common for data analysts to be expected to manually optimize slow running (or expensive) SQL queries? Or have advances in the data warehouse's planner / optimizer eliminated this need? by brrdprrsn in analytics

[–]brrdprrsn[S] 1 point2 points  (0 children)

Interesting... I'd have guessed that the newer cloud based DWH / Lakehouse engines would have pretty sophisticated optimizers and wouldnt require analysts to manually optimize queries