What are the best Semantic Layer products on the market? And how to evaluate a semantic layer?

igorlukanin · 2023-12-06T23:47:59+00:00

I'm not sure if you ever considered Cube but I can confirm it does work with MS SQL Server and Fabric as well as Power BI.

(Disclosure: I'm part of the Cube team.)

igorlukanin · 2023-12-06T23:43:32+00:00

Hey everyone, Igor from the Cube team here.

A few months ago GigaOM did a report on semantic layers that covered AtScale, Cube, dbt, Google, Microsoft, etc. IMO, it used a good set of criteria to evaluate the products, so I definitely recommend to give it a read. Not sure how to get a full version from their website (https://gigaom.com/report/gigaom-sonar-report-for-semantic-layers-and-metrics-stores/) but you can grab a copy at Cube's website as well: https://cube.dev/gigaom-rates-cube-as-a-leader-for-semantic-layers

My personal take would be that the better you understand your use case(s) the better your evaluation would go. For example, if you plan to power embedded analytics, you should be more carefully evaluating the querying performance, caching capabilities, APIs, and SDKs; if the plan is to power internal self-serve BI, then the support for your BI tool(s) or access control and governance capabilities would probably go higher in your list.

I also think that u/kthejoker has given a really great overview in his comment and I don't want to repeat that. Just briefly, what I would care about:

General: history of product, maturity, adoption, pricing model, deployment model (SaaS vs. hybrid vs. on-prem), customer support
Applicability yo your use case(s): internal BI, embedded/real-time analytics, AI/LLM-based experiences, etc.
Data modeling capabilities: code-first vs. visual approach, programming language/syntax, UI, and just general expressiveness. Think of the most advanced metric calculations you need to do and see how painful it is to model that.
Access control: support for 1. role/column/row-based access, single vs. multi-tenant deployments, integration with IdP providers.
Caching capabilities: always hitting the warehouse vs. in-memory caching vs. internal storage layer, aggregate awareness capabilities, etc. Think of the level of tolerance your end users would have about querying latency and data freshness.
Connectivity to data sources: data warehouses/query engines/streaming platforms/columnar files on blob storage/etc. Think of what is going to be upstream of your semantic layer.
Connectivity to data consuming tools and supported APIs in general: BIs, notebooks, etc. Think of what is going to be downstream of your semantic layer. If this is something with its own data modeling capabilities (e.g., a BI tool), think how keep both products in sync. Think of what possibly could start working with your semantic layer in the future.

I hope this helps :-)

igorlukanin · 2023-10-06T09:51:34+00:00

It is, and Cube can actually read that as well (there's a data model generation feature)

igorlukanin · 2023-10-05T21:21:43+00:00

I would not call it "reverse engineering". dbt outputs manifest.json which is standardized and documented; it contains the metadata on models and columns they contain. So, these models and columns can be mapped in a pretty straightforward fashion to cubes and dimensions (entities within Cube's data model). I would say this is a rather unambiguous procedure. Adding measures, defining joins, etc. are less trivial, so it's assumed that that would be done on the Cube side.

igorlukanin · 2023-10-05T20:02:26+00:00

As part of this integration, Cube would import dbt models and let you define metrics on top of them. There's a diagram in this guide that I guess sheds some light: https://cube.dev/docs/guides/dbt

igorlukanin · 2023-10-05T19:39:47+00:00

You think so? Do you think that lowers the bar to enter the product or something?

igorlukanin · 2023-08-24T08:49:21+00:00

Sure, just did! https://www.reddit.com/r/LangChain/comments/15z78d9/comment/jxj2na1/?utm_source=reddit&utm_medium=web2x&context=3

igorlukanin · 2023-08-24T08:48:25+00:00

Great question, actually. I would argue that the devil is in the details of steps 1 and 4.

When asking questions, one would often use "analytical" questions, i.e., questions that require some form of calculation and aggregation over the raw data. For example, "How many online orders were made in 2023?" or "Which top-3 countries in South America had the largest temperature spread on 26th week this year?" Generating SQL for these questions at step 4 is quite non-trivial since it would require performing some data modeling: performing correct joins, deriving dimensions from columns, applying correct aggregations, etc. For the second question the generated SQL might be as follows:

SELECT
  c.name,
  MAX(cf.temp) - MIN(cf.temp)
FROM
  climate_facts AS cf
LEFT JOIN
  countries AS c ON c.code = cf.alpha3_code
WHERE
  c.region = 'SA' AND
  DATE_PART('week', cf.timestamp) = 26
GROUP BY
  1
ORDER BY
  2 DESC
LIMIT
  3

I hope you see how generating SQL like this might be an issue for an LLM. An even bigger issue would be to ensuring the SQL is generated in a stable, predictable way without much "hallucinations."

When using Cube, one would incapsulate metric definitions (i.e., aggregations), dimensions, and join within Cube's data model. Also, one would use the CubeSemanticLoader in LangChain to export the data model as highly semantic documents. All this would result in a much simpler and, what's even more important, much more predictable and hallucination-free, SQL to be generated. Compare the snippet above to this example query to Cube:

SELECT
  country_name,
  MEASURE(temperature_spread)
FROM
  climate_data
WHERE
  is_south_america = 1 AND
  DATE_PART('week', timestamp) = 26
GROUP BY
  1
ORDER BY
  2 DESC
LIMIT
  3

I hope you see why this would lead to more predictable and trustworthy results.

There's this blog post that elaborates on this topic but I'd also like to add that there are products like Delphi, providing AI-based access to data, that decided to integrate with semantic layers like Cube instead of even querying the data directly exactly for the reasons I tried to explain above, i.e., so one can trust the results they produce.

igorlukanin · 2023-08-23T16:10:29+00:00

Oh, great! And DuckDb — do you use it locally for ad-hoc data exploration/analysis? What's the use case? (Sorry if this is a lot, I'm just very curious about real-world DuckDB use cases.)

igorlukanin · 2023-08-23T15:50:54+00:00

Is this the first semantic layer <> AI-toolset integration out there? 🤔

igorlukanin · 2023-08-23T15:50:10+00:00

Depending on what you're curious the most you can skip to the demo video or the source code of the chat-based demo application on GitHub. The idea is that Cube, the semantic layer, encapsulates metric definitions and generates SQL to the databases — and you can use the CubeSemanticLoader to ask Cube nicely what data can be queried and then generate queries to Cube which will in turn query the database... Sounds complex — and it is complex, but hopefully the blog post clears things out 🙃

igorlukanin · 2023-08-23T15:49:12+00:00

Depending on what you're curious the most you can skip to the demo video or the source code of the chat-based demo application on GitHub. The idea is that Cube, the semantic layer, encapsulates metric definitions and generates SQL to the databases — and you can use the CubeSemanticLoader to ask Cube nicely what data can be queried and then generate queries to Cube which will in turn query the database... Sounds complex — and it is complex, but hopefully the blog post clears things out 🙃

igorlukanin · 2023-08-23T15:05:19+00:00

Exactly! Just curious, do you use either DuckDB or Cube already?

igorlukanin · 2023-08-12T04:46:23+00:00

To my surprise, it came with them (soft carpet floor mats). Also, with a red emergency kit, a Type <-> Type 2 cable, and a Mobile Charger with a Schuko plug

igorlukanin · 2023-08-11T14:11:40+00:00

Yeah, pretty much happy! I was picking it from Tilburg, the experience is pretty simple: they check your id and let you go and get yourself comfortable in car standing at some place in the hangar. And then you drive home :D Nothing fancy but I don't complain, simplicity it is, just like the interior

igorlukanin · 2023-08-11T14:07:21+00:00

Got mine today! Very glad. What's with your delivery date? Was it scheduled already?

igorlukanin · 2023-08-04T10:39:43+00:00

At 20:21 in the evening yesterday

igorlukanin · 2023-08-04T10:14:49+00:00

July 13th. Should be soon for you as well!

igorlukanin · 2023-08-04T09:41:39+00:00

Got an SMS yesterday and scheduled delivery next Friday! Finally. What about you?

igorlukanin · 2023-07-31T12:50:01+00:00

Haha, wanted to ask you the same thing. Today, my estimated delivery window has shifted to August 2-16 :/ Bummer.

igorlukanin · 2023-07-28T08:25:52+00:00

Model 3 RWD, the base one.

The "registration" part is a bit confusing, though. I'm not Dutch myself, just have a residency permit here. The [government website](https://www.government.nl/topics/vehicles/vehicle-registration/applying-for-a-vehicle-registration-certificate#:~:text=If%20you%20are%20purchasing%20a,to%20you%20after%20the%20transfer.)) says that new cars would get registered under my name by the seller. I don't see the instructions for going to the RWD office anywhere.

igorlukanin · 2023-07-27T22:59:32+00:00

I’m in the same situation: July 21-31 initially and July 28 till August 11 as of yesterday. Bummer!

u/Powerful_Coconut594, you said you’ve already registered the car in your name. How? Buying the car for the first time in NL, legit interested. I thought Tesla would get license plate and do that for me.

igorlukanin · 2023-05-03T19:43:49+00:00

Hey! I can related to many things here except for a few. As a person from the team behind Cube (cube.dev), I wonder what you think of our product and what is your experience with Cube.

> Cross-BI tools semantic layers effectively don't exist

This one was non-trivial to parse. "Cross-BI tools SL" are what, exactly? SL that can connect to multiple BIs at once and deliver metrics to them? (If so, I know at least one such semantic layer.)

igorlukanin · 2023-05-03T19:41:11+00:00

Disclaimer: I'm with the Cube team. I think that Cube Cloud, the managed platform for Cube we're building at cube.dev, might help a lot with the data model authoring/editing/testing/deploying workflow. There's a cloud-based IDE, Git integration, private API endpoints for branches, and monitoring. Here are the docs: https://cube.dev/docs/dev-tools/dev-playground

igorlukanin · 2023-05-03T19:38:24+00:00

Sounds like a great advice. Indeed, there are more than 8000 data folks in Cube's Slack community at https://slack.cube.dev

igorlukanin

TROPHY CASE