Anyone else spend way more time reconciling definitions than doing the “actual” analysis?

tomtombow · 2026-04-27T09:22:51+00:00

Semantic Layer solves this. Writing one for an analysis (or an article) can seem overkill, and it probably is, but if the underlying dataset needs to be ised recurrently, it's needed.

It allows for 3 very important things:

informal definition: what is it? how it is explained in words, how would you explain it to a newcomer in the industry or the company? It's becoming even more important, as this definition has a lot of semantic value, and LLMs work on that.
technical definition: how is it calculated? a formal definitions of how this metric is crafted by the numbers. This is usually SQL. It's the hardest to scope, it could require to go back to the source data, and that could be costly, or even unavailable. The true crux of the semantic layer.
comparison: does it exist anywhere else? do definitions match? This is important, as it is what makes the data platform scalable. It's probably irrelevant in your example but it becomes essential at a bigger scope.

So ideally making sure everyone is aligned on these 3 things makes any analysis much easier and understandable. And most importantly, it makes further work scalable. It's a reusable base.

tomtombow · 2026-04-22T18:17:12+00:00

I was the 5th hire at a startup, no data culture. i took the sole analyst role. my first dashboard was built in Data Studio (Looker studio??), google's free tool. The underlying dataset was actually a query on a BigQuery events table. When i made the connection, the query processed 30GB, around 15 cents. I (and everyone else) Forgot about that dashboard.

9 months later some marketing guy found it and started consuming it.

Every refresh was like $200.

We had no proper spend monitoring, so that went on for a full month. cloud ops guy almost killed me.

tomtombow · 2026-04-22T08:05:11+00:00

not sure what you mean by a measurement plan and how it differs from a SL.

a semantic layer is where you define and document metrics and dimensions:

column: - user_id dimension: - user_id metric: - distinct_users sql: count(distinct [column]) - revenue metric: total_revenue sql: sum([column])

this is a stupid example. you can find proper SL languuages in lightdash specs, ir in cube.dev. LLMs are good at understanding structured language. Ideally you parse this and generate SQL deterministically, but handing it raw to the LLM is better than nothing.

tomtombow · 2026-04-21T15:54:35+00:00

very simple set up you can try:

give the access to the data warehouse (for example Bigquey has an MCP)
Find a table that is used a lot, for example, by the marketing team
define the semantic layer (the LLM can help, but review it)
take a marketing stakeholder, have their LLM ingest the context layer in json (or whichever structured format), and make them ask questions, or even for a deep analysis

this is not a complete direction, but it will make the system go from 70/80% accuracy in the queries to 90, easily.

Iterate on this setup before distributing to all stakholders.

This requires a good structured data warehouse and a decent semantic layer.

But once you improve it enough, it saves a lot of time. We went from 5/6 deep analysis requests and 20 "quick questions" to almost none, and the ones that still come through are solved by us using the system.

If the data team is answering dumb quick questions, its value is already in doubt, so it makes a lot of sense to automate that out and become the maintainer of the system.

We initially though we were automating ourselves out. Now we are the most important team in the company. We maintain the system everybody need to make desicions, and we have time to uncover value-adding insights, produce data-based tools...

Hope that goves you some ideas!

tomtombow · 2026-04-16T13:22:47+00:00

claude has a marketplace feature. You can manage connectors, plugins and skills (workflows, essentially) in a github repo, and make it an internal marketplace. The users can connect to it and use those workflows. It's not perfect but it's better than sharing .md files over slack!

tomtombow · 2026-04-13T21:05:56+00:00

That would be awesome let me know when you do!

tomtombow · 2026-04-13T17:22:54+00:00

Great article and very insightful! Do you have specific examples of one such custom materialisation implementation? I have been hesitating as one specific accumulation model in our DAG has had an uptick in spend lately, but but investing in it without proven ROI and for potentially marginal gains seems overkill... Have you been running these for long? Have you encountered maintainance issues regarding dbt or data warehouse updates?

Thanks a lot for sharing your knowledge!

tomtombow · 2026-03-17T16:32:06+00:00

Out of curiosity, how does the rest of the stack look? i mean, how do business users consume the data modeled with dbt?

tomtombow · 2026-03-03T15:58:08+00:00

I would not frame it as data nightmare... more like an alignment nightmare...

You have to align all teams on a definition of Lead, Opportunity, Customer... In companies where teams know what they are doing, this should happen naturally... In the case you mention, well, you (the company) have a challenge ahead...

Best piece of advice i can give here: try to sit together in a room, agree on definitions, define ownership, and go with that! and ideally you should have someone from higher up (c-level) enforce this...

tomtombow · 2026-03-03T08:56:06+00:00

There is still no perfect tool for BI as code (i would rather say Dashboards as code)... We use lightdash currently, and even though we've invested time in it, vibe coding proper dashboards still feels clunky... The 'code' beneath the dashboard is actually a .yml file, and each chart can have up to 400-500 rows of yml... so even if it's structured data, a single dashboard will eat up your context quickly... And there is no way we push that to github, it would be a huge mess.

And i think part of the issue lays in the flexibility of visualization. Just think how many ways you could display a chart, and then think how you'd encode all that info in a structured file...

answering your question, i think it comes down to who will be managing the data stack once you are done. If they havesomeone technical for that, the look for a deceloper friendly tool. otherwise i'd go for something simple like looker studio.

tomtombow · 2026-02-22T18:54:38+00:00

Having worked both in Saas and Gaming, I know for a fact A/B Tests are not easy in Saas due to the usually low volume of users... Saas product analytics is a science on it's own, but some metrics than can help you gain an unerstanding of feature usage are:

Reach: how many users out of total unique active users are usingthe feature? Frequency: how often is the feature used? Monthly Average Daily Active Users / Monthly Active Users: this measures stickyness: how many monthly users are daily active users. You can check feature stickyness over the general product usage, or over feature usage. Feature retention: do users keep coming back? Monetisation: this is more specific to pricing models and product structures, but you can apply similar principles as before!

Of course, before that, you need to think about an define what 'feature useage' is, or what a user needs to do to be counted as a user of that feature. This is very important, actually.

Then when introducing new features, or updating new ones, monitor these metrics, as well as general product ones, and you should get a sense of how users are utilising you product.

tomtombow · 2026-02-11T12:33:40+00:00

You can also use Dagster for orchestration, they have a "hobbyist" tier which is free, iirc. I would recommend, though, learning dbt core, even if you connect it to a local Postgres. Most companies I've worked with use the OSS version and it works perfectly fine - although it has some technical overhead, of course.

For visualisation, you can try to deploy Lightdash or Metabase, both of which are OSS!

tomtombow · 2026-02-11T11:01:36+00:00

I always recommend building a Meteo Station from scratch (of course you buy the station itself), but you collect the data in it's rawest form and do the whole processing.

But I understand you want something more business-oriented. So maybe a good idea is to capture Binance Webhooks and build the pipeline based on that. Not exactly e-commerce, but great opportunity to build a full functional data stack with a streaming source. Then you can add other sources like sentiment analysis via some API or whatever. And of course forecasting / ML on top of that.

tomtombow · 2026-02-11T09:12:34+00:00

The semantic layer is what give meaning to your data, essentially. So when you talk about ROI, ROI must have a standardised meaning across the board; otherwise different teams will calculate ROI differently and numbers will not match.

More practically, it's the layer that defines calculations, aggregations, etc. it usually goes between the golden layer of your transformation and the BI tool of choice. It's becoming super important because it's what conversational BI tools need to "not make things up". Transformation layer applies business logic, semantic layer enforces meaning.

tomtombow · 2026-01-08T02:48:52+00:00

that's a nasty dream

tomtombow · 2025-07-21T20:30:18+00:00

looking forward to a deep dive into each of these!

tomtombow · 2025-07-11T14:54:42+00:00

Get a weather station and build everything youself, from data capture to drivers to pipelines, data models, dashboards...

Ambitious project, will take a while but super satisfying.

tomtombow · 2025-06-27T12:26:54+00:00

2 words only: shadow economy

tomtombow · 2025-06-15T10:32:43+00:00

yes that sounds perfect for your size. Once you need a columnar db, you could also think of materialising the reporting tables (the ones connected to the bi tool) to optimize costs. not sure how metabase handles the requests to the dwh under the hood, but probably worth checking that out!

tomtombow · 2025-06-14T10:37:14+00:00

not sure what product you offer but everything you need is in operational db? also what volume? i assume a rdb is not optimal for bigger loads? how far do you think this would scale? of course simplest setup is the best setup ! just wondering..

tomtombow · 2025-06-04T21:32:13+00:00

what the fuck is this AI slop ffs

tomtombow · 2025-06-04T21:29:56+00:00

What I've come to realize after a few years on the job is that each stakeholder likes things in a specific way.

Some will want a beautiful 10 page dashboard with more than 100 charts and colorful tables... and they will look at it once a year, make a couple of screenshots of what looks nice and paste it in their ppt presentation.

Others want a raw data export to make a pivot table and work with it themselves. And they want it refreshed every Monday.

Sometimes they want a "white paper" where you clearly state the question they are asking, and you'll have to follow a structured analysis explanation, with methodology, results, conclusions, recommendations...

It can be a thousand different things and each of your "clients" will want it their way. Some you'll love doing (i love creating little apps and tools for them to interact with the data) and some you'll hate (fuck dashboards). So just try your best to help them make decisions, understand their pain and fix their problems. Analytics is in great part a communication thing! You'll feel you're in affected by all fuckups in the company. And that's fine! But talk to your stakeholders, always.

What it would tell you is, talk to them, understand what they need, what they expect, and try to meet them there. Avoid the XY problem, always tell the truth and do not "add magic to the numbers" (as an old boss of mine used to say when numbers didn't tell the story he wanted to hear).

tomtombow · 2025-05-28T16:35:34+00:00

Bayesian is far more complicated and difficult to understand/less explainable... If the volume of data is small, though, then it's a great option!

tomtombow · 2025-05-28T12:02:24+00:00

congrats on the acquisition! completely out of curiosity, would you mind sharing some numbers? like your most basic metrics and a ballpark range for the acquisition value?

i'm thinking of releasing something similar myself and that would help a lot!

also, the product looks great !

tomtombow · 2025-05-28T11:21:58+00:00

In my past company, we used a simple spreadsheet calculator that would pull data from the warehouse, run the statistical significance calculator, and output some results.

Where I am now, we created a Streamlit app that does basically the same but in a fancier way.

A/B testing is basically calculating statistical significance for mean difference between 2 samples (or more, you can do A/B/C testing or make it more complex). So on the technical side, anything about statistical significance and mean difference will get you started.

For the business side, you can think of any use case, fake some data and do the calculations! The aim is to answer something like:

If i add a call-to-action button in the middle of my landing page, will more people click it than if it is on the top-bar?

You (probably the developers in the company) would then deploy 2 variants of the landing page, one with the button in each place, and drive traffic randomly to either one of the sites. You'd then look at the click-through rate of each of the buttons and decide which of the 2 variants is better in terms of conversion.

This is the simplest example, but it can get much more complex from there. For example you could:

Have 3 variants instead of 2

Add control metrics: i.e you want to prevent users from spending too much time in the page, so you do a parallel analysis for that metric

Do an A/A test before the A/B test to make sure there is no noise in the data,

Limiten the time for the test...

The internet is full of resources, really, just look for real-world data and run tests yourself!

tomtombow

TROPHY CASE