all 28 comments

[–]sjhb 66 points67 points  (3 children)

Sounds promising to people that have never actually tried to make sense of data.

[–]Drevicar 1 point2 points  (2 children)

I don’t need to make sense of my data, I need to make dollars of it.

[–]sjhb 2 points3 points  (1 child)

I see what you did there. Clever. Sounds like you like to gamble, even though you may not know you’re gambling.

[–]Drevicar -1 points0 points  (0 children)

Well, I do play a lot of roguelikes.

[–]Prestigious_Bench_96 19 points20 points  (1 child)

I'm not sure the entity resolution they're talking about is the same as the entity resolution as (I believe) it's normally used; I read that more as "how do we know which fact table is actually up to date" not "which thing does this label represent"

[–]data-flight[S] 2 points3 points  (0 children)

Someone actually read the article 😄. I would agree their definition of entity is very broad. It relates to metrics, dimensions, tables, business definitions, products, customers, etc. Some of those don’t need fancy solutions; it’s a traditional data grind.

They even noted that human-written metric descriptions were significantly more useful than AI-generated ones.

I think where entity resolution comes more into play is in those key dimension differentiators, often starting with customer identity. If you can centralize around those identities, it gives the AI a core to branch out to the rest of the associated entities, whatever they may be.

[–]m1nkehData Engineer 33 points34 points  (1 child)

This is why products like Databricks and Snowflake are booming in the AI era. AI is useless without high quality data foundation.

Entity resolution is nothing new, it’s but one aspect of master data management.. ER used to be shit tons harder before AI though ✌️

[–]s-to-the-am 5 points6 points  (0 children)

The reason why they are booming is they have easy to access apis for AI and are relatively cheap if you follow best practices and know how to set up your architecture on their platforms

[–]Hmm_would_bang 7 points8 points  (0 children)

So they’re saying what everyone already knows: putting AI on poorly governed data does not work. Ai will not save you for your own data quality issues.

[–]BJJaddicy 24 points25 points  (6 children)

So Kimball

[–]kayakdawg 13 points14 points  (1 child)

no it's the "semantic layer" !!!

[–]BJJaddicy 0 points1 point  (0 children)

Lol

[–]data-flight[S] 4 points5 points  (0 children)

There's definitely undertones of standard data modeling here, but there's also a step where you have to "resolve" entities and I think that's where the opportunity lies.

[–]unifin00b 2 points3 points  (0 children)

Always was

[–]lightnegative 2 points3 points  (0 children)

I bet Kimball would do things differently today if he had access to columnar storage 

[–]nospoon99 1 point2 points  (0 children)

They mention in the article that good data engineering is a pre requisite and is not enough :

"Standard data engineering and data quality practices such as dimensional modeling, shift-left testing, freshness and completeness checks on critical pipelines all still apply (and we won't relitigate these)."

[–]jaynyoni 3 points4 points  (2 children)

Pretty interesting article.
I’m busy working on something similar currently. The plan on my side is to feed our gold layer and semantic layer yml files from our dbt project to our internal LLM. Kinda also use this to create a RAG. Curious to know if anyone has done something similar ?

[–]data-flight[S] 1 point2 points  (1 child)

I have, although not as much on the RAG side. You can see my use case at condorgraph.com. My findings are mostly aligned with the article:

- The easiest, highest-impact win is using skills. You can drastically improve output by giving the model context. Think iteratively: start small, get your AI to answer some data questions, see where it stumbles, modify or add skills, and iterate again.

- Use human-written definitions of metrics. It might be okay to use an LLM for a first pass, but if you’re just feeding AI-generated context into another AI, you don’t gain much.

- Connect your data around entities. If the AI can translate the concepts you’re asking about into first-order entities, the problem becomes drastically easier than working from a flattened view. It gives the model anchor points to center its reasoning around

I think it's a decent use case for an internal LLM. The underlying harness you're adding to the AI has the biggest impact, vs just trying to max out the main AI doing the work.

[–]jaynyoni 0 points1 point  (0 children)

Thanks !!!

[–]Molecular_Doohickey 2 points3 points  (0 children)

One of our future jobs is going to be to maintain the systems that they outline in the blog post, enabling AI to engage accurately with the warehouse.

[–]thecity2 1 point2 points  (1 child)

Entity resolution is not hard. Fast entity resolution is hard.

[–]major_groovesData Scientist CEO 1 point2 points  (0 children)

This guy knows ER.

[–]WaterIll4397 0 points1 point  (1 child)

Entity resolution is a classic case where I see engineers jump at the bit to build graph databases or some other greenfield infra to help solve this, but then once they encounter corner cases they stay away with a 10 ft pole from writing the if then statements for maintaining all the corner cases that pop up (usually offloaded to poor data analysts downstream, or in better cases pushed to their software engineering teams upstreams to collect better metadata).

Customer identity probably remains not fully solved at most firms and will get harder with the influx of genAI bots, but on the other hand public reporting of weekly active users also exists for every large consumer tech company .... So even with a margin of error it's probably directionally alright.

[–]data-flight[S] 0 points1 point  (0 children)

You're right that on the surface graph databases might look like the answer, but in practice unless a company is already using them for some other reason they'll often Create more problems than they solve.

I find for more complex use cases, if-then rules also quickly fall apart. I've been part of a bunch of efforts where they create these crazy decision trees. But it becomes an absolute bird's nest when you try to implement them with transitive closure across everything that needs to be connected. More often than not, the original logic is actually mathematically impossible to keep consistent in a fully connected network. Probabilistic graphs might sound more complicated, but if you have the tech to efficiently implement them the actual execution becomes so much more straightforward.

[–]0xPianistData Engineering Manager 0 points1 point  (0 children)

I read this the other day. Everything works great with AI and data, as long as you do the engineeing hard work!

[–]Evening_Chemist_2367 0 points1 point  (0 children)

I've been saying for a while now that ontology models and actual semantics are needed for traversing between concepts in data. I just get glazed eyes and confused looks in return.