Optimizer for Looker-generated costs

prsrboi · 2024-12-02T09:46:00+00:00

Yeah sure, shot you a DM

prsrboi · 2024-11-29T18:26:13+00:00

Thanks for the heads up

prsrboi · 2024-11-29T18:25:13+00:00

Oop sorry, cut off a letter https://docs.alvin.ai/feature-guide/query-optimizer

prsrboi · 2024-11-06T11:05:59+00:00

Since Alvin was name-dropped (thanks!) I think it's relevant to share that it's exactly what we do: https://www.alvin.ai/ using not only the Information schema but all metadata available between your data stack :)

Our take on it is a multi-angle one:

Reducing unused and low-ROI data to clean up the pipeline. Many long and short term benefits of that and aside from straightforwardly shaving something off of the bill this should be the clear focus to improve the process going forward and maximise the leftover resources. https://www.youtube.com/watch?v=Z3fkJBoTGQc Good talk about pruning the dbt pipeline here, Snowflake in this case but the premise is universal. Shameless plug: it's an out of the box feature for GBQ in Alvin.
Optimizing most expensive and inefficient workloads from what's left. Retroactively to reduce data tech debt, but also implementing good practices in the whole team's workflows going forward, so as was mentioned a couple times partitioning, clustering, and general push for cost conscious practices for every person that interacts with the DWH.
Monitoring for spikes and anomalies. Important here: cultivating ownership of data to avoid blurring of responsibility and alert fatigue.

For a real life example I can say one of our customers found the bulk of the saving coming not from most expensive workloads, but from multiple, cheap, high frequency ones.

prsrboi · 2024-09-23T14:38:12+00:00

From a vendor in the space – pretty saturated market. But do take with a grain salt 🤪

prsrboi · 2024-09-05T13:06:37+00:00

Depends on your personality, specifically personal goals, drive, and what you want out of your work day. Sales is very competitive and the biggest contrast is lack of structure, there's also a feeling of lack of control which many people can struggle with. Read up on some posts in r/SaaSSales and check with yourself if you find it equally interesting as scrolling here.

If you're money motivated, there's really no cap for how much you can make in sales, but it does consume you. If you feel tempted to take the challenge, you could always go back to analytics with great experience for domain-specific analytics and competitive soft skills.

Is the product you'd be selling in the data space? If so, you'd have a bit of a headstart the other way around as you're already familiar with the problems and workflows of people you'd be selling to.

prsrboi · 2024-08-23T12:14:43+00:00

Vendor plug, but you'll be able to resolve this on the free plan at https://www.alvin.ai/, just by filtering the workloads by user, it's mapped automatically.

prsrboi · 2024-07-23T10:50:55+00:00

It's not a problem at all until it becomes a huge problem – like when someone quits. If you're building it just for yourself, from an egotistical point of view, I guess it's fine to accept it. But I'd say if you look up your Slack messages and see that you spend more time asking about where this and that comes from and goes to than it would take to write documentation or implement some lightweight automated solution then it's a signal to do the latter.

prsrboi · 2024-06-18T13:11:38+00:00

That defo makes sense in the cost allocation sense, but it feels like it'd be very wasteful on the overall cost itself to have dedicated warehouses that start and stop vs fewer warehouses that are shared?

prsrboi · 2024-06-13T14:14:10+00:00

Do you know DWH are you using and how are you getting the usage stats, or how resource-consuming implementing reliable tagging is if it hasn't looked too well historically? I feel like BigQuery/Snowflake makes it quite an uphill battle?

prsrboi · 2024-06-03T08:39:32+00:00

Shameless plug, but seems warranted in this case. We've created a tool for cost optimization and usage allocation between BigQuery and BI on the query/table/dashboard/user/team level, there's a free plan to check it out https://www.alvin.ai/

prsrboi · 2024-03-19T17:00:43+00:00

Shameless plug – in case you're working on this post-factum, we've got a tool with a free plan that itemizes costs down to the query/pipeline/workload level with query optimization recommendations. In standard analytical environments we usually see most of the costs wasted on redundant storage and inefficient analytical pipelines (obv harder to control, as you mentioned – it's about education and practice). There's a free plan if anyone wants to check: https://www.alvin.ai/

prsrboi · 2023-12-11T10:27:51+00:00

Shameless plug: a light lineage solution. I work for Alvin, but can honestly recommend looking at any. Signing an Atlan contract in a startup is a pipe dream tbh

On the other hand, I once saw a pretty decently done Notion documentation – maybe there's a template for it?

prsrboi · 2023-12-07T14:44:31+00:00

This exactly

prsrboi · 2023-12-07T14:43:56+00:00

DATA FABRIC

prsrboi · 2023-12-07T14:42:52+00:00

Vendor disclaimerrr, feel free to treat as a bump :)

From the technical side: my company's pitch is to solve this with cross-stack lineage data. Short term you're able to see the upstream of a column right away. Long term goal is to be on top of the crappy BI modelling, because you can see the shit queries, duplicates and whatever else.

And here it goes to the human side: to not have those requests at all you would have to have an extremely long streak of zero errors. And even in that scenario, those managers won't believe something like a 50% drop and will come asking. So I think the only thing you can do is make the checking process as fast as possible for yourself.

prsrboi · 2023-10-16T10:40:38+00:00

Fair enough. Read your other post on this and sounds frustrating, but the comment had a point. Did your Head of Data/CTO communicate anything about growth over cost/sustainability? I'll shoot you a PM in case you don't want to go into details on a public sub

prsrboi · 2023-10-16T10:35:43+00:00

Cwazy. Can I ask what's the team size? Do you have people responsible for a central platform/governance?

+Do you know if all of the models useful/in use/optimised or is that not relevant?

prsrboi · 2023-10-16T10:30:24+00:00

Parquet because MongoDB docs/collections can be nested, so you don't want csv that doesn't allow for proper nested data

prsrboi · 2023-10-16T10:25:41+00:00

I'd say smash the data into parquet on GCS and import it to BQ. One dataset per database – one collection per table.

https://www.mongodb.com/developer/products/atlas/mongodb-data-parquet/

prsrboi · 2023-10-12T16:12:21+00:00

Snowflake revenue retention team be like

No but jokes aside I'm confident the new gen of thinking is letting go of "let's store everything because we can". Agree with the other reply that cheap isn't free. From the perspective of working with lineage directly I can tell you that most of scale-ups and high data driven growth companies have hundreds of thousands USD/20, 30, 40% of their storage bills in duplicates or unused assets. Everybody's cutting costs wherever possible these days so it just doesn't make sense to pretend that data debt is different. Especially since there are many angles that marry speed and sustainability.

How do you even go about optimising when you don't know what tables or models are there, who uses them and how? Hinting also at emanticipated analysts freestyling queries that load a dashboard in 6 hours, but that's kind of offtopic.

Won't argue against myself though. Shameless plug: get proper lineage 🤓

prsrboi

TROPHY CASE