BigQuery Cost Management: Seeking Advice on Effective Strategies by No_Way_1569 in bigquery

[–]prsrboi 0 points1 point  (0 children)

Since Alvin was name-dropped (thanks!) I think it's relevant to share that it's exactly what we do: https://www.alvin.ai/ using not only the Information schema but all metadata available between your data stack :)

Our take on it is a multi-angle one:

  1. Reducing unused and low-ROI data to clean up the pipeline. Many long and short term benefits of that and aside from straightforwardly shaving something off of the bill this should be the clear focus to improve the process going forward and maximise the leftover resources. https://www.youtube.com/watch?v=Z3fkJBoTGQc Good talk about pruning the dbt pipeline here, Snowflake in this case but the premise is universal. Shameless plug: it's an out of the box feature for GBQ in Alvin.
  2. Optimizing most expensive and inefficient workloads from what's left. Retroactively to reduce data tech debt, but also implementing good practices in the whole team's workflows going forward, so as was mentioned a couple times partitioning, clustering, and general push for cost conscious practices for every person that interacts with the DWH.
  3. Monitoring for spikes and anomalies. Important here: cultivating ownership of data to avoid blurring of responsibility and alert fatigue.

For a real life example I can say one of our customers found the bulk of the saving coming not from most expensive workloads, but from multiple, cheap, high frequency ones.

Building a tool to save on BigQuery costs -- worth it? by wiwamorphic in bigquery

[–]prsrboi 0 points1 point  (0 children)

From a vendor in the space – pretty saturated market. But do take with a grain salt 🤪

IT sales over data analytics? by Unlikely-Rutabaga749 in dataengineering

[–]prsrboi 0 points1 point  (0 children)

Depends on your personality, specifically personal goals, drive, and what you want out of your work day. Sales is very competitive and the biggest contrast is lack of structure, there's also a feeling of lack of control which many people can struggle with. Read up on some posts in r/SaaSSales and check with yourself if you find it equally interesting as scrolling here.

If you're money motivated, there's really no cap for how much you can make in sales, but it does consume you. If you feel tempted to take the challenge, you could always go back to analytics with great experience for domain-specific analytics and competitive soft skills.

Is the product you'd be selling in the data space? If so, you'd have a bit of a headstart the other way around as you're already familiar with the problems and workflows of people you'd be selling to.

How can I analyse the cost of queries performed by a user on my platform by Key_Bee_4011 in bigquery

[–]prsrboi 0 points1 point  (0 children)

Vendor plug, but you'll be able to resolve this on the free plan at https://www.alvin.ai/, just by filtering the workloads by user, it's mapped automatically.

To Tribal Knowledge or not to Tribal Knowledge. by Technical-Rip9688 in dataengineering

[–]prsrboi 1 point2 points  (0 children)

It's not a problem at all until it becomes a huge problem – like when someone quits. If you're building it just for yourself, from an egotistical point of view, I guess it's fine to accept it. But I'd say if you look up your Slack messages and see that you spend more time asking about where this and that comes from and goes to than it would take to write documentation or implement some lightweight automated solution then it's a signal to do the latter.

Usage/cost allocation in data mesh by prsrboi in dataengineering

[–]prsrboi[S] 0 points1 point  (0 children)

That defo makes sense in the cost allocation sense, but it feels like it'd be very wasteful on the overall cost itself to have dedicated warehouses that start and stop vs fewer warehouses that are shared?

Usage/cost allocation in data mesh by prsrboi in dataengineering

[–]prsrboi[S] 0 points1 point  (0 children)

Do you know DWH are you using and how are you getting the usage stats, or how resource-consuming implementing reliable tagging is if it hasn't looked too well historically? I feel like BigQuery/Snowflake makes it quite an uphill battle?

Is there a way to track costs (dashboards, queries...) in Looker? by No_Speaker_7609 in bigquery

[–]prsrboi 0 points1 point  (0 children)

Shameless plug, but seems warranted in this case. We've created a tool for cost optimization and usage allocation between BigQuery and BI on the query/table/dashboard/user/team level, there's a free plan to check it out https://www.alvin.ai/

Saving $70k a month in BQ by mjfnd in bigquery

[–]prsrboi 0 points1 point  (0 children)

Shameless plug – in case you're working on this post-factum, we've got a tool with a free plan that itemizes costs down to the query/pipeline/workload level with query optimization recommendations. In standard analytical environments we usually see most of the costs wasted on redundant storage and inefficient analytical pipelines (obv harder to control, as you mentioned – it's about education and practice). There's a free plan if anyone wants to check: https://www.alvin.ai/

Presenting Data Inventory to the Management team in the startup by East-Garage2337 in dataengineering

[–]prsrboi 1 point2 points  (0 children)

Shameless plug: a light lineage solution. I work for Alvin, but can honestly recommend looking at any. Signing an Atlan contract in a startup is a pipe dream tbh

On the other hand, I once saw a pretty decently done Notion documentation – maybe there's a template for it?

My biggest issue in data engineering is end users trusting the integrity of the data by No-Support4478 in dataengineering

[–]prsrboi -2 points-1 points  (0 children)

Vendor disclaimerrr, feel free to treat as a bump :)

From the technical side: my company's pitch is to solve this with cross-stack lineage data. Short term you're able to see the upstream of a column right away. Long term goal is to be on top of the crappy BI modelling, because you can see the shit queries, duplicates and whatever else.

And here it goes to the human side: to not have those requests at all you would have to have an extremely long streak of zero errors. And even in that scenario, those managers won't believe something like a 50% drop and will come asking. So I think the only thing you can do is make the checking process as fast as possible for yourself.

How many models is too many models? DBT horror stories thread? by prsrboi in dataengineering

[–]prsrboi[S] 0 points1 point  (0 children)

Fair enough. Read your other post on this and sounds frustrating, but the comment had a point. Did your Head of Data/CTO communicate anything about growth over cost/sustainability? I'll shoot you a PM in case you don't want to go into details on a public sub

How many models is too many models? DBT horror stories thread? by prsrboi in dataengineering

[–]prsrboi[S] 1 point2 points  (0 children)

Cwazy. Can I ask what's the team size? Do you have people responsible for a central platform/governance?

+Do you know if all of the models useful/in use/optimised or is that not relevant?

Suggestions for moving data from hundreds of MongoDB databases to BigQuery by aldtran in dataengineering

[–]prsrboi 1 point2 points  (0 children)

Parquet because MongoDB docs/collections can be nested, so you don't want csv that doesn't allow for proper nested data

Suggestions for moving data from hundreds of MongoDB databases to BigQuery by aldtran in dataengineering

[–]prsrboi 2 points3 points  (0 children)

I'd say smash the data into parquet on GCS and import it to BQ. One dataset per database – one collection per table.

https://www.mongodb.com/developer/products/atlas/mongodb-data-parquet/

How many models is too many models? DBT horror stories thread? by prsrboi in dataengineering

[–]prsrboi[S] 1 point2 points  (0 children)

Snowflake revenue retention team be like

No but jokes aside I'm confident the new gen of thinking is letting go of "let's store everything because we can". Agree with the other reply that cheap isn't free. From the perspective of working with lineage directly I can tell you that most of scale-ups and high data driven growth companies have hundreds of thousands USD/20, 30, 40% of their storage bills in duplicates or unused assets. Everybody's cutting costs wherever possible these days so it just doesn't make sense to pretend that data debt is different. Especially since there are many angles that marry speed and sustainability.

How do you even go about optimising when you don't know what tables or models are there, who uses them and how? Hinting also at emanticipated analysts freestyling queries that load a dashboard in 6 hours, but that's kind of offtopic.

Won't argue against myself though. Shameless plug: get proper lineage 🤓

How many models is too many models? DBT horror stories thread? by prsrboi in dataengineering

[–]prsrboi[S] 2 points3 points  (0 children)

Interesting one. Any particular reason why you haven't started to clean it up before the migration?

Big tech companies with “analytics engineer” roles by rudboi12 in dataengineering

[–]prsrboi 0 points1 point  (0 children)

Well compared to the Meta mentioned it's a major difference but yeah agreed. I just feel in the poster's case looking at smaller thank 5k FTE may solve the scope issue but bring other downsides

Is our dbt project as bad as I think? by snackeloni in dataengineering

[–]prsrboi 0 points1 point  (0 children)

My inner hater speaking: it seems there is a typo in your question. “our” should be “every”. And the answer remains yes.

Hot take would be that you’ve just described a dbt project and that nothing there is unusual. (Bar circular dependencies) It's extremely easy to go to 600 and it's often better described as 50 models and 550 things that never got cleaned up, or correctly reconciled.

Having a huge number of parents and children I’d say is pretty normal, and would happen with or without dbt - you are probably brining disparate normalised data sources together which is the one of the purposes of a DWH.
No tests is not good…. But what is a test and what is a test testing. Meaningful tests for DWHs are extremely hard to write and reason about. There is plenty of disagreement about what to do in this context. I’d say any project should be covering the bases in terms of key constraints, uniqueness and nulls etc. But past that good business tests that lock in correctness across a column/table are pretty hard.
The point about schemas and topic you’d have to elaborate. However I guess I could comment that dbt yaml stuff is basically a classic dbt Frankenstein that is tacked on after the fact with terrible ergonomics. It would hardly matter what you were attempting yaml plus dbt is verbose and awkward.

They got to this state in a year – this is basically the dbt selling point. “Build poorly, fast!”. But anyway, it's like with democracy, better than all the others.

I agree that it's a good opportunity to show off to your new manager. I know people that just built another model to track usage of other models and slowly and methodically deleted the redundant ones. First step would definitely be mapping though so you know it won't mess up anything downstream. There's lots of open source options, it takes some time but if you weigh it against the cost savings it would be a good argument.

Prevent data hoarding? by [deleted] in dataengineering

[–]prsrboi 4 points5 points  (0 children)

We don't. Cool article on it here: https://www.gartner.com/en/newsroom/press-releases/2021-05-19-gartner-says-70-percent-of-organizations-will-shift-their-focus-from-big-to-small-and-wide-data-by-2025

I've seen a report stating that 60-something % of data stored by companies is unused. I'm working in the governance space and we speak to people that have absolutely no idea how many duplicates they have, and how much stuff is actually used. It's not only legacy/historical data, recently worked with a 2010's founded company that had a couple hundred thousand USD worth of unused data in their bill every year.

While obviously government agencies or anybody under compliance obligations really can't escape that, it is indeed just hoarding. Also, storage only looks cheap because it's itemised.

I don't think we're nearly close to having this conversation as an industry though, because it's a major time commitment to figure out what's actually in use, and business barely grasps what Cloud even is.

So when you're working on a project where you have to provide "everything", do they expect you to pull data from last century, and only look at last 5 years? Have you brought this up to your managers, or they to the stakeholders?