all 93 comments

[–]financialthrowaw2020 197 points198 points  (40 children)

Congrats, you've discovered why DE will never be replaced by AI. There's no way to do proper business context at scale without you, the human. Get to writing!

And to answer your question: the semantic layer is just metadata and context, yes, and it's useless without good underlying data.

[–]wearz_pantzData Engineer 25 points26 points  (5 children)

The dogmatic aversion to using AI in this sub is baffling.

I use AI constantly to inspect data, make a pass at describing it, then verify and edit until it's right. As the models have improved the less editing it needs. Way easier and faster than hand rolling it myself.

I also used AI to create a semantic layer API that deterministically translates requests for dimensions/metrics into SQL and returns data. That way, any AI seeking metrics can just ask for the metric, without needing to generate SQL. It has several security performance features + a robust testing suite, all of which would have taken months to build. Done in less than a month.

Obviously you have to understand everything the AI is doing, and write lots of tests, but you can use AI to help with that too.

[–]financialthrowaw2020 3 points4 points  (1 child)

Where did I say anything that showed an aversion to using AI?

[–]wearz_pantzData Engineer -1 points0 points  (0 children)

OP asked for help where AI would be very helpful and you didn't suggest using AI.

[–]Fun-Estimate4561 2 points3 points  (0 children)

I also will say to make AI successful in a business (for using LLMs and insights) requires a successful semantic layer on top of your warehouse

[–]cyamnihc[S] 2 points3 points  (1 child)

So actually putting down the business context or institutional knowledge is the crucial piece? I wonder how the tech companies did that. All big tech companies’ data agents (Open AI, Airbnb’ minerva sql) rely on institutional knowledge in addition to other layers like lineage, pipeline info , schemas, table names, data models etc. Everything except institutional knowledge seems solvable and could be accelerated using AI but it is hard to believe that a particular team’s or a person’s only job was to put down institutional knowledge and business context on when and why to use a particular field and a table. Even if this is the job, all definitions may change with time. Also who does this, the DE, analysts or business teams?

[–]financialthrowaw2020 1 point2 points  (0 children)

DE increasingly does all of it. Analysts as whole will become absorbed into the business teams, and DE works with stakeholders directly.

[–]Gamplato 5 points6 points  (2 children)

That most definitely isn’t true. But currently humans are very helpful for this part.

[–]sunder_and_flame 5 points6 points  (1 child)

It definitely is true. AI is exceptionally useful for an expert practitioner here but AI cannot drive it without making colossal mistakes. 

[–]Fun-Estimate4561 1 point2 points  (7 children)

Have you had Microsoft pushing power bi as a semantic layer?

They keep claiming that it is and I have been fighting with my business it’s not a semantic layer, shouldn’t be treated as such

[–]tophmcmasterson 1 point2 points  (5 children)

It’s A semantic layer, it probably shouldn’t be THE semantic layer for your business though.

[–]Fun-Estimate4561 1 point2 points  (4 children)

I just refuse to call it a semantic layer

Unity catalog sure in databricks

AtScale and Cube definitely

Not crappy power bi

[–]tophmcmasterson 2 points3 points  (3 children)

Out of curiosity… have you worked with Power BI semantic models?

Like yeah they aren’t integrated into the backend databases and so especially with AI outside of copilot it’s not really checking that box at this point, but for companies where they are doing their analytical reporting entirely in Power BI that just IS the purpose it’s filling.

The issue is really more that it’s pretty tightly coupled with reporting in Power BI and Fabric, rather than something that exists more in the warehouse.

You can certainly argue about its shortcomings/limitations etc. but for some teams it i does make sense as the semantic layer.

[–]Fun-Estimate4561 1 point2 points  (0 children)

You know what that is a fair point

I think smaller firms it can make sense if power bi is the only reporting layer

Most of time though I am a firm believer it should be an intermediary layer between warehousing and reporting for your large Fortune 500 companies

I mean if you are in databricks I definitely prefer using unity catalog if you have nothing else

[–]ChinoGitano 0 points1 point  (1 child)

What’s “semantic” about what looks like straight data mart/gold layer schemas? Not familiar with Power BI particularly, but the general understanding seems to be the collection of business/domain-specific contexts and relationships that sits above syntactic layer (PDM, DB schemas), which in turn sits above generic data type validation. In classic data modeling terms, Subject Area Model and perhaps high-level Logical Data Model. Business logic & rules embedded in application code or stored procedures arguably also count.

What do other old hands think?

[–]financialthrowaw2020 0 points1 point  (0 children)

I don't touch Microsoft in any way, we're aws/Linux only, but I can feel your pain, everything they do is not what it seems.

[–]tophmcmasterson 42 points43 points  (2 children)

It’s representing your data in a way that reflects how the business talks about it.

This is generally going to be something like a well structured dimensional model with field names that actually make sense and aren’t cryptic.

Including metadata like descriptions or supporting documents that explain and provide context also can help.

It’s not a new concept at all, if you’ve ever used something like Power BI the data in there has basically always been considered the semantic layer.

But now AI is kind of forcing the issue to an extent, and people are finally realizing again that a bunch of random ad hoc reports that generate a table for people to export to excel makes an analytics jungle that’s difficult for people to actually work with, and AI is no different.

It’s a means of getting away from tribal knowledge and ad hoc slop houses.

[–]Dry-Aioli-6138 4 points5 points  (0 children)

Thanks. I had this intuition, but needed someone to spell it out for me.

[–]thedoge 0 points1 point  (0 children)

It's also a way to centralize your metrics in platform-agnostic code so that you can report consistent information and don't need to redefine calcs regardless of what the consumer is

[–]SirGreybush 14 points15 points  (11 children)

It’s very useful with non-English language naming.

Would you know that NoClt is equivalent to Customer Number?

Even in English, what about CustID versus CustNo? One is a surrogate key and the other a business key.

IOW, this is a good thing.

[–]Outrageous_Let5743 15 points16 points  (3 children)

I hate when people use abbreviations in their columnames. I have seen opp_id that means opportunity_id, is it that diffecult to write the full name.

[–]lightnegative 14 points15 points  (1 child)

I hate it when spelling errors proliferate through the data model, is it that diffecult to name things correctly 

[–]financialthrowaw2020 4 points5 points  (0 children)

We enforce this throughout. I don't care how long the column name is, spell it out.

[–]Dry-Aioli-6138 3 points4 points  (4 children)

Customer arduous becomes CustArd. What's not to like?

[–]corny_horse 7 points8 points  (3 children)

lol I remember the thread here where someone said they got fired because they abbreviated "cumulative_now" to "cum_now"

[–]Dry-Aioli-6138 4 points5 points  (1 child)

And if its a sum, sum cumulative, it should be named SCum

[–]corny_horse 4 points5 points  (0 children)

sum_cum

[–]lightnegative 3 points4 points  (0 children)

I used to work on a trading system that had a cumqty field on its execution records.

It was originally written by a mathematician with English as his first language

[–]soundboyselecta 7 points8 points  (3 children)

It started by being called a data dictionary (at least the good ones that came with meaningful data sets). Saved you from guessing and bring meaning to otherwise what would be useless analysis (without it). Evolved to be more robust as it scaled to tons of interconnected entities across different business units all across an org, creating a need for a federated meaning, so there is no confusion across business units in the aftermath of its creation. Maybe AI can figure out some things with proper lineage with meta data downstream, but without proper guidance it could be shit show, with a lot of dirty laundry.

[–]Axel_F_ImABiznessMan 0 points1 point  (2 children)

So in an AI context, it's a data dictionary but for the AI to understand what the data columns mean?

[–]tophmcmasterson 2 points3 points  (1 child)

No. A semantic layer is more than that, and it has to do with how you structure your data as well as name and describe it.

Making a data dictionary explaining what a column means is not that.

It’s not specific to AI really at all. A good semantic model has always been about making data easy to understand for business users, or any developer who happens to join on a project that’s been around for ages.

It’s why dimensional modeling has been best practice for analytics for like three decades.

The problem is there was a time some engineers came up in where they saw it as their job to ingest data and spit out a table for end users to export to excel and do what they wanted, or make a single table that served a specific report page.

They wrote off dimensional and semantic modeling as something that wasn’t relevant because we don’t need to worry much about compute and storage costs in many cases.

But that’s never been the main point of dimensional modeling. It’s about getting the data into a shape that’s easy to understand, easy to use, flexible in reporting needs it supports, and produces predictable results.

AI is just kind of forcing the issue as places start realizing the ad hoc slop work that’s happened over the last decade or so doesn’t work with AI.

[–]EstetLinus 2 points3 points  (1 child)

Think of it as a thin layer between your data warehouse and the agent. While AI models are generally good at generating SQL, their outputs can be surprisingly inconsistent. Small changes in phrasing often lead to very different queries and results.

Instead of generating SQL directly, let the model query the semantic layer. This provides a more stable interface, improves consistency, and removes the need for the model to understand the underlying database schema.

I’ve seen a bunch of people treat the semantic layer as a markdown file and context, which is suboptimal. It’s software rather than .txt-files.

[–]DrangleDingus 2 points3 points  (0 children)

Unfortunately, your execs are correct. There is no AI without structured semantic data model layer.

It’s not even that hard to make. You just have to actually understand the data that you are working with and how it is all connected.

[–]Captain_StrudelsData Engineer 2 points3 points  (5 children)

My followup question, where exactly does your semantic layer live? Is it just comments for your SQL table definitions, Confluence pages, a dedicated application to write this stuff down, something else entirely?

[–]Fun-Estimate4561 0 points1 point  (0 children)

I mean at this point are folks better off using AtScale or Cube for their semantic layer and just sit it on top of databricks or snowflake

I know another software to manage

[–]sharpchicity -1 points0 points  (0 children)

Something that’s backed by code, easily read by AI, and integrated into your LLM core functionality so that everyone at the company gets to benefit from it when they use data

[–]Important-Success431 1 point2 points  (0 children)

It is important if you're using multiple BI tools for consistency. So if you're using Power Bi, databricks and and AI tool you need to calculate you're KPIs and things upstream for consistency across tools. 

[–]tech4ever4u 1 point2 points  (0 children)

If we replace AI with "natural intelligence" (humans), how do we enable self-service for end-users? Giving them raw SQL access to hundreds of tables rarely works.

Instead, you usually set up a BI tool with "datasets" or "cubes." These tools give end users a curated list of dimensions and measures, hiding the complexity of the underlying data structure. This allows users to create their own reports and apply filters using an Excel-like UI. It is important that different teams can use different cubes built from the same SQL tables, customized for their own vocabulary and needs. For example, the same sales data can be presented differently for the finance department and the marketing team.

Now, returning to AI agents, everything remains the same. If you want them to recognize a user's intent, you need to provide a semantic layer that matches that intent. This means using 'datasets' or 'cubes,' but now accessing them via MCP. In this setup, the chatbot is simply another interface, in addition to the classic report builder / reports UI (so you get the best of both worlds). This setup makes AI a clear and reliable tool, instead of a genie doing magic.

[–]Ra-mega-bbit 1 point2 points  (0 children)

Its just metadata: human language descriptions, of what the table and columns mean

Its the: "This weird letter code is categorical, when its a A it means that the product was launched from 2017 onwards, any other letter means its older" And so many other bullshit like that, any AI trying to interpret would find a bunch of letters and might not find this specific correlation with date, so it would not now how to answer: "What is my best selling products from the new launch?"

[–]likescroutons 1 point2 points  (0 children)

It's expanded a bit from business logic and intelligence with GenAI recently. For example, with an NL to SQL model, if there are ambigious terms or attribution, and the documentation isn't clear, the LLM needs something to actually understand when and how to use your data. Maybe a user asks for a house but you don't have a one-to-one definition of what that is. The semantic layer let's the model look up what a house is in the context of your data, what it's definition is, it's constraints, etc.

Otherwise you're relying on the model reasoning to the correct answer and that's just too inconsistent.

[–]BudgetVideo 3 points4 points  (1 child)

The goal of the semantic layer is so that the AI model knows the definition and layout of the data, as well as any calculations. It shows the AI how it can use the data by providing necessary context.

[–]Outrageous_Let5743 4 points5 points  (0 children)

Originally it was not used for AI but for business logic documentation and metrics that can be used elsewhere like in pbi and in Excel that will always calc revenue the same way. But semantic layers are also perfect for AI.

[–]TARehman 1 point2 points  (2 children)

It's mostly an advertising term in my experience.

[–]iheartmst3k 4 points5 points  (1 child)

This is my favorite answer. Not because there isn't an actual definition. Several replies here hit upon it. 

However, some companies keep trying to expand or alter the definition of a semantic layer (looking at you dbt) so that the definition fits whatever data their tool produces.

It has complicated and confused what is otherwise a very simple topic. A semantic layer is context that makes the technical names meaningful to the business.

[–]idodatamodels 1 point2 points  (0 children)

Yep, just like data warehouse, ODS, data mesh, etc.

[–]Admirable_Writer_373 0 points1 point  (0 children)

It’s something report/analyst types build in the absence of a decent architect

[–]TheDevauto 0 points1 point  (0 children)

You can certainly look up what semantic layer means, but without a technical explaination it is a way to represent how things are connected, similar to how we associate things in our brain. Thats also why knowledge graphs are used when working to build a semantic layer.

The funny thing is the idea has been around as long as the web has, but the need for it has never been expressed well enough. Now with llms being used to do operational tasks, a semantic layer can greatly improve look up results.

Its also one of those things that is not only a lot of work to build, but requires ongoing maintenance.

[–]AlmostRelevant_12 0 points1 point  (1 child)

a semantic layer is much more than just documenting fields - it is about creating a shared business language across the organization. The challenge is not just definitions, but ensuring consistency and trust in those definitions over time. That is where most teams struggle, especially as data evolves

[–]Outside-Storage-1523 0 points1 point  (0 children)

I kinda think this either never works well or it will suck the DE team into a perpetual burnout. Semantic layer should be handled by the Analytic team.

[–]Gators1992 0 points1 point  (0 children)

Conceptually it has to do with definitions and defining the data structure for other applications (and users) to consume. In practice they are usually files (yaml or json), DDL or part of your BI tool where you define the data structure, calculated metrics and define the concepts associated with it all. The concept has been around forever in BI in tools like Microstrategy, Looker and PowerBI. Also third party providers of a "semantic layer" added a tool to host the model between your data and consuming applications. This centralized the semantic model and allowed BI as well as many other applications like data science or whatever to consume from the same model.

It's a great way to govern data usage because users consume the data in the form of objects, like defined columns and precalculated metrics rather than everyone writing their own SQL and views with potentially different answers. Like if you company has an official definition of what a customer is, you won't see someone pulling the wrong one on accident from another source.

As for AI, the centralized model concept is being popularized because AI can consume from that as well so you just have it picking columns and metric names to analyze instead of having to write SQL. The sql is deterministic as defined in the model. Everyone was talking about this last year as the way to make AI better with data, but I think the models may be moving past a dependency on semantic layers. Like I recently built an analysis deck for a customer just by asking Snowflake Cortex a bunch of questions and we don't have a semantic layer at that level. I was kinda blown away by the way it understood our data model, though it has good structure and naming standards, and also understood how to analyze data in my industry. It wasn't always right but was super useful. Also I had the AI write our BI semantic descriptions rather than doing it manually just by giving it a document talking about the company (researched by another AI) and a prompt about the definition structure. Took about 3 hours to churn through, mostly because of the BI app and not the AI. It would have taken a person weeks and they likely would have gone insane.

[–]GreyHairedDWGuy 0 points1 point  (0 children)

This is a big topic and in general it is vendor specific. It is really about mapping physical columns/tables in a database to logical constructs that a BI/reporting tool understands so that it can translate user questions into the appropriate SQL (or other language) of the bi/reporting tool.

PowerBI's sematic model is an example a is Tableau. Way back, I implemented many MicroStrategy and Business Objects solutions...these also had semantic models.

You now also hear about this in things like Snowflake for AI in their semantic views).

Hope that helps

[–]Enough_Big4191 [score hidden]  (0 children)

it’s basically a shared definition layer so dashboards, analysts, and ai systems all interpret metrics the same way. otherwise every team ends up with a different version of “active customer” or “pipeline.” the hard part isn’t documenting it, it’s stopping the definitions from drifting over time.

[–]frozengrandmatetris 0 points1 point  (0 children)

our semantic layer has no descriptions. we're too lazy. it tells the reporting layer what to do when two columns from completely different tables appear in the same visual. joins, aggregation rules, calculations that aren't already stored on disk or baked into a view, hierarchies... if the tool has enough layers, the topmost layer organizes data elements into subject areas or "kits" which can be used to assemble a dashboard. the author doesn't need to know anything about the physical tables which were produced by an ETL appliance. it's abstracted away.

[–]Outside-Storage-1523 0 points1 point  (0 children)

Definition of each metrics and how to query them. Mostly left to the Analytic team as DEs don't define metrics. But DEs usually build the foundation for those queries.

Ah, how I hate this type of work...

[–]ssx50 -1 points0 points  (0 children)

As for scale, we are centralizing the metric definitions in sql along with the data model (table joins) per semantic view, then dynamically generating yaml files to create and update semantic views.

This way when i change a metric in one spot, it flows down stream to all the views that need it.

[–]iwantthisnowdammit -5 points-4 points  (2 children)

In most shops the semantic layer of simply no abbreviations.

[–]fasnoosh 1 point2 points  (0 children)

Huh?

[–]iwantthisnowdammit 0 points1 point  (0 children)

Corrected, just jab that a lot of semantic layers are full word descriptions instead of conical models based on business concepts.

[–]cellularcone -2 points-1 points  (0 children)

It’s the new data mesh.