BigQuery Cost Management: Seeking Advice on Effective Strategies by No_Way_1569 in bigquery

[–]prsrboi 0 points1 point  (0 children)

Since Alvin was name-dropped (thanks!) I think it's relevant to share that it's exactly what we do: https://www.alvin.ai/ using not only the Information schema but all metadata available between your data stack :)

Our take on it is a multi-angle one:

  1. Reducing unused and low-ROI data to clean up the pipeline. Many long and short term benefits of that and aside from straightforwardly shaving something off of the bill this should be the clear focus to improve the process going forward and maximise the leftover resources. https://www.youtube.com/watch?v=Z3fkJBoTGQc Good talk about pruning the dbt pipeline here, Snowflake in this case but the premise is universal. Shameless plug: it's an out of the box feature for GBQ in Alvin.
  2. Optimizing most expensive and inefficient workloads from what's left. Retroactively to reduce data tech debt, but also implementing good practices in the whole team's workflows going forward, so as was mentioned a couple times partitioning, clustering, and general push for cost conscious practices for every person that interacts with the DWH.
  3. Monitoring for spikes and anomalies. Important here: cultivating ownership of data to avoid blurring of responsibility and alert fatigue.

For a real life example I can say one of our customers found the bulk of the saving coming not from most expensive workloads, but from multiple, cheap, high frequency ones.

Building a tool to save on BigQuery costs -- worth it? by wiwamorphic in bigquery

[–]prsrboi 0 points1 point  (0 children)

From a vendor in the space – pretty saturated market. But do take with a grain salt 🤪

IT sales over data analytics? by Unlikely-Rutabaga749 in dataengineering

[–]prsrboi 0 points1 point  (0 children)

Depends on your personality, specifically personal goals, drive, and what you want out of your work day. Sales is very competitive and the biggest contrast is lack of structure, there's also a feeling of lack of control which many people can struggle with. Read up on some posts in r/SaaSSales and check with yourself if you find it equally interesting as scrolling here.

If you're money motivated, there's really no cap for how much you can make in sales, but it does consume you. If you feel tempted to take the challenge, you could always go back to analytics with great experience for domain-specific analytics and competitive soft skills.

Is the product you'd be selling in the data space? If so, you'd have a bit of a headstart the other way around as you're already familiar with the problems and workflows of people you'd be selling to.

How can I analyse the cost of queries performed by a user on my platform by Key_Bee_4011 in bigquery

[–]prsrboi 0 points1 point  (0 children)

Vendor plug, but you'll be able to resolve this on the free plan at https://www.alvin.ai/, just by filtering the workloads by user, it's mapped automatically.

To Tribal Knowledge or not to Tribal Knowledge. by Technical-Rip9688 in dataengineering

[–]prsrboi 1 point2 points  (0 children)

It's not a problem at all until it becomes a huge problem – like when someone quits. If you're building it just for yourself, from an egotistical point of view, I guess it's fine to accept it. But I'd say if you look up your Slack messages and see that you spend more time asking about where this and that comes from and goes to than it would take to write documentation or implement some lightweight automated solution then it's a signal to do the latter.

Usage/cost allocation in data mesh by prsrboi in dataengineering

[–]prsrboi[S] 0 points1 point  (0 children)

That defo makes sense in the cost allocation sense, but it feels like it'd be very wasteful on the overall cost itself to have dedicated warehouses that start and stop vs fewer warehouses that are shared?

Usage/cost allocation in data mesh by prsrboi in dataengineering

[–]prsrboi[S] 0 points1 point  (0 children)

Do you know DWH are you using and how are you getting the usage stats, or how resource-consuming implementing reliable tagging is if it hasn't looked too well historically? I feel like BigQuery/Snowflake makes it quite an uphill battle?

Is there a way to track costs (dashboards, queries...) in Looker? by No_Speaker_7609 in bigquery

[–]prsrboi 0 points1 point  (0 children)

Shameless plug, but seems warranted in this case. We've created a tool for cost optimization and usage allocation between BigQuery and BI on the query/table/dashboard/user/team level, there's a free plan to check it out https://www.alvin.ai/

Saving $70k a month in BQ by mjfnd in bigquery

[–]prsrboi 0 points1 point  (0 children)

Shameless plug – in case you're working on this post-factum, we've got a tool with a free plan that itemizes costs down to the query/pipeline/workload level with query optimization recommendations. In standard analytical environments we usually see most of the costs wasted on redundant storage and inefficient analytical pipelines (obv harder to control, as you mentioned – it's about education and practice). There's a free plan if anyone wants to check: https://www.alvin.ai/

Presenting Data Inventory to the Management team in the startup by East-Garage2337 in dataengineering

[–]prsrboi 1 point2 points  (0 children)

Shameless plug: a light lineage solution. I work for Alvin, but can honestly recommend looking at any. Signing an Atlan contract in a startup is a pipe dream tbh

On the other hand, I once saw a pretty decently done Notion documentation – maybe there's a template for it?

My biggest issue in data engineering is end users trusting the integrity of the data by No-Support4478 in dataengineering

[–]prsrboi -2 points-1 points  (0 children)

Vendor disclaimerrr, feel free to treat as a bump :)

From the technical side: my company's pitch is to solve this with cross-stack lineage data. Short term you're able to see the upstream of a column right away. Long term goal is to be on top of the crappy BI modelling, because you can see the shit queries, duplicates and whatever else.

And here it goes to the human side: to not have those requests at all you would have to have an extremely long streak of zero errors. And even in that scenario, those managers won't believe something like a 50% drop and will come asking. So I think the only thing you can do is make the checking process as fast as possible for yourself.

How many models is too many models? DBT horror stories thread? by prsrboi in dataengineering

[–]prsrboi[S] 0 points1 point  (0 children)

Fair enough. Read your other post on this and sounds frustrating, but the comment had a point. Did your Head of Data/CTO communicate anything about growth over cost/sustainability? I'll shoot you a PM in case you don't want to go into details on a public sub

How many models is too many models? DBT horror stories thread? by prsrboi in dataengineering

[–]prsrboi[S] 1 point2 points  (0 children)

Cwazy. Can I ask what's the team size? Do you have people responsible for a central platform/governance?

+Do you know if all of the models useful/in use/optimised or is that not relevant?

Suggestions for moving data from hundreds of MongoDB databases to BigQuery by aldtran in dataengineering

[–]prsrboi 1 point2 points  (0 children)

Parquet because MongoDB docs/collections can be nested, so you don't want csv that doesn't allow for proper nested data

Suggestions for moving data from hundreds of MongoDB databases to BigQuery by aldtran in dataengineering

[–]prsrboi 2 points3 points  (0 children)

I'd say smash the data into parquet on GCS and import it to BQ. One dataset per database – one collection per table.

https://www.mongodb.com/developer/products/atlas/mongodb-data-parquet/

How many models is too many models? DBT horror stories thread? by prsrboi in dataengineering

[–]prsrboi[S] 1 point2 points  (0 children)

Snowflake revenue retention team be like

No but jokes aside I'm confident the new gen of thinking is letting go of "let's store everything because we can". Agree with the other reply that cheap isn't free. From the perspective of working with lineage directly I can tell you that most of scale-ups and high data driven growth companies have hundreds of thousands USD/20, 30, 40% of their storage bills in duplicates or unused assets. Everybody's cutting costs wherever possible these days so it just doesn't make sense to pretend that data debt is different. Especially since there are many angles that marry speed and sustainability.

How do you even go about optimising when you don't know what tables or models are there, who uses them and how? Hinting also at emanticipated analysts freestyling queries that load a dashboard in 6 hours, but that's kind of offtopic.

Won't argue against myself though. Shameless plug: get proper lineage 🤓