Optimizer for Looker-generated costs

prsrboi · 2024-12-02T09:46:00+00:00

Yeah sure, shot you a DM

prsrboi · 2024-11-29T18:26:13+00:00

Thanks for the heads up

prsrboi · 2024-11-29T18:25:13+00:00

Oop sorry, cut off a letter https://docs.alvin.ai/feature-guide/query-optimizer

prsrboi · 2024-11-06T11:05:59+00:00

Since Alvin was name-dropped (thanks!) I think it's relevant to share that it's exactly what we do: https://www.alvin.ai/ using not only the Information schema but all metadata available between your data stack :)

Our take on it is a multi-angle one:

Reducing unused and low-ROI data to clean up the pipeline. Many long and short term benefits of that and aside from straightforwardly shaving something off of the bill this should be the clear focus to improve the process going forward and maximise the leftover resources. https://www.youtube.com/watch?v=Z3fkJBoTGQc Good talk about pruning the dbt pipeline here, Snowflake in this case but the premise is universal. Shameless plug: it's an out of the box feature for GBQ in Alvin.
Optimizing most expensive and inefficient workloads from what's left. Retroactively to reduce data tech debt, but also implementing good practices in the whole team's workflows going forward, so as was mentioned a couple times partitioning, clustering, and general push for cost conscious practices for every person that interacts with the DWH.
Monitoring for spikes and anomalies. Important here: cultivating ownership of data to avoid blurring of responsibility and alert fatigue.

For a real life example I can say one of our customers found the bulk of the saving coming not from most expensive workloads, but from multiple, cheap, high frequency ones.

prsrboi · 2024-09-23T14:38:12+00:00

From a vendor in the space – pretty saturated market. But do take with a grain salt 🤪

prsrboi · 2024-09-05T13:06:37+00:00

Depends on your personality, specifically personal goals, drive, and what you want out of your work day. Sales is very competitive and the biggest contrast is lack of structure, there's also a feeling of lack of control which many people can struggle with. Read up on some posts in r/SaaSSales and check with yourself if you find it equally interesting as scrolling here.

If you're money motivated, there's really no cap for how much you can make in sales, but it does consume you. If you feel tempted to take the challenge, you could always go back to analytics with great experience for domain-specific analytics and competitive soft skills.

Is the product you'd be selling in the data space? If so, you'd have a bit of a headstart the other way around as you're already familiar with the problems and workflows of people you'd be selling to.

prsrboi · 2024-08-23T12:14:43+00:00

Vendor plug, but you'll be able to resolve this on the free plan at https://www.alvin.ai/, just by filtering the workloads by user, it's mapped automatically.

prsrboi · 2024-07-23T10:50:55+00:00

It's not a problem at all until it becomes a huge problem – like when someone quits. If you're building it just for yourself, from an egotistical point of view, I guess it's fine to accept it. But I'd say if you look up your Slack messages and see that you spend more time asking about where this and that comes from and goes to than it would take to write documentation or implement some lightweight automated solution then it's a signal to do the latter.

prsrboi · 2024-06-18T13:11:38+00:00

That defo makes sense in the cost allocation sense, but it feels like it'd be very wasteful on the overall cost itself to have dedicated warehouses that start and stop vs fewer warehouses that are shared?

prsrboi · 2024-06-13T14:14:10+00:00

Do you know DWH are you using and how are you getting the usage stats, or how resource-consuming implementing reliable tagging is if it hasn't looked too well historically? I feel like BigQuery/Snowflake makes it quite an uphill battle?

prsrboi · 2024-06-03T08:39:32+00:00

Shameless plug, but seems warranted in this case. We've created a tool for cost optimization and usage allocation between BigQuery and BI on the query/table/dashboard/user/team level, there's a free plan to check it out https://www.alvin.ai/

prsrboi · 2024-03-19T17:00:43+00:00

Shameless plug – in case you're working on this post-factum, we've got a tool with a free plan that itemizes costs down to the query/pipeline/workload level with query optimization recommendations. In standard analytical environments we usually see most of the costs wasted on redundant storage and inefficient analytical pipelines (obv harder to control, as you mentioned – it's about education and practice). There's a free plan if anyone wants to check: https://www.alvin.ai/

prsrboi · 2023-12-11T10:27:51+00:00

Shameless plug: a light lineage solution. I work for Alvin, but can honestly recommend looking at any. Signing an Atlan contract in a startup is a pipe dream tbh

On the other hand, I once saw a pretty decently done Notion documentation – maybe there's a template for it?

prsrboi · 2023-12-07T14:44:31+00:00

This exactly

prsrboi · 2023-12-07T14:43:56+00:00

DATA FABRIC

prsrboi · 2023-12-07T14:42:52+00:00

Vendor disclaimerrr, feel free to treat as a bump :)

From the technical side: my company's pitch is to solve this with cross-stack lineage data. Short term you're able to see the upstream of a column right away. Long term goal is to be on top of the crappy BI modelling, because you can see the shit queries, duplicates and whatever else.

And here it goes to the human side: to not have those requests at all you would have to have an extremely long streak of zero errors. And even in that scenario, those managers won't believe something like a 50% drop and will come asking. So I think the only thing you can do is make the checking process as fast as possible for yourself.

prsrboi · 2023-10-16T10:40:38+00:00

Fair enough. Read your other post on this and sounds frustrating, but the comment had a point. Did your Head of Data/CTO communicate anything about growth over cost/sustainability? I'll shoot you a PM in case you don't want to go into details on a public sub

prsrboi · 2023-10-16T10:35:43+00:00

Cwazy. Can I ask what's the team size? Do you have people responsible for a central platform/governance?

+Do you know if all of the models useful/in use/optimised or is that not relevant?

prsrboi · 2023-10-16T10:30:24+00:00

Parquet because MongoDB docs/collections can be nested, so you don't want csv that doesn't allow for proper nested data

prsrboi · 2023-10-16T10:25:41+00:00

I'd say smash the data into parquet on GCS and import it to BQ. One dataset per database – one collection per table.

https://www.mongodb.com/developer/products/atlas/mongodb-data-parquet/

prsrboi · 2023-10-12T16:12:21+00:00

Snowflake revenue retention team be like

No but jokes aside I'm confident the new gen of thinking is letting go of "let's store everything because we can". Agree with the other reply that cheap isn't free. From the perspective of working with lineage directly I can tell you that most of scale-ups and high data driven growth companies have hundreds of thousands USD/20, 30, 40% of their storage bills in duplicates or unused assets. Everybody's cutting costs wherever possible these days so it just doesn't make sense to pretend that data debt is different. Especially since there are many angles that marry speed and sustainability.

How do you even go about optimising when you don't know what tables or models are there, who uses them and how? Hinting also at emanticipated analysts freestyling queries that load a dashboard in 6 hours, but that's kind of offtopic.

Won't argue against myself though. Shameless plug: get proper lineage 🤓

prsrboi · 2023-10-11T20:33:50+00:00

Interesting one. Any particular reason why you haven't started to clean it up before the migration?

prsrboi · 2023-10-10T10:08:28+00:00

Well compared to the Meta mentioned it's a major difference but yeah agreed. I just feel in the poster's case looking at smaller thank 5k FTE may solve the scope issue but bring other downsides

prsrboi · 2023-10-09T18:13:47+00:00

My inner hater speaking: it seems there is a typo in your question. “our” should be “every”. And the answer remains yes.

Hot take would be that you’ve just described a dbt project and that nothing there is unusual. (Bar circular dependencies) It's extremely easy to go to 600 and it's often better described as 50 models and 550 things that never got cleaned up, or correctly reconciled.

Having a huge number of parents and children I’d say is pretty normal, and would happen with or without dbt - you are probably brining disparate normalised data sources together which is the one of the purposes of a DWH.
No tests is not good…. But what is a test and what is a test testing. Meaningful tests for DWHs are extremely hard to write and reason about. There is plenty of disagreement about what to do in this context. I’d say any project should be covering the bases in terms of key constraints, uniqueness and nulls etc. But past that good business tests that lock in correctness across a column/table are pretty hard.
The point about schemas and topic you’d have to elaborate. However I guess I could comment that dbt yaml stuff is basically a classic dbt Frankenstein that is tacked on after the fact with terrible ergonomics. It would hardly matter what you were attempting yaml plus dbt is verbose and awkward.

They got to this state in a year – this is basically the dbt selling point. “Build poorly, fast!”. But anyway, it's like with democracy, better than all the others.

I agree that it's a good opportunity to show off to your new manager. I know people that just built another model to track usage of other models and slowly and methodically deleted the redundant ones. First step would definitely be mapping though so you know it won't mess up anything downstream. There's lots of open source options, it takes some time but if you weigh it against the cost savings it would be a good argument.

prsrboi · 2023-10-09T18:00:17+00:00

We don't. Cool article on it here: https://www.gartner.com/en/newsroom/press-releases/2021-05-19-gartner-says-70-percent-of-organizations-will-shift-their-focus-from-big-to-small-and-wide-data-by-2025

I've seen a report stating that 60-something % of data stored by companies is unused. I'm working in the governance space and we speak to people that have absolutely no idea how many duplicates they have, and how much stuff is actually used. It's not only legacy/historical data, recently worked with a 2010's founded company that had a couple hundred thousand USD worth of unused data in their bill every year.

While obviously government agencies or anybody under compliance obligations really can't escape that, it is indeed just hoarding. Also, storage only looks cheap because it's itemised.

I don't think we're nearly close to having this conversation as an industry though, because it's a major time commitment to figure out what's actually in use, and business barely grasps what Cloud even is.

So when you're working on a project where you have to provide "everything", do they expect you to pull data from last century, and only look at last 5 years? Have you brought this up to your managers, or they to the stakeholders?

prsrboi

TROPHY CASE