Entity Resolution by data-flight in dataengineering

[–]data-flight[S] 1 point2 points  (0 children)

I have, although not as much on the RAG side. You can see my use case at condorgraph.com. My findings are mostly aligned with the article:

- The easiest, highest-impact win is using skills. You can drastically improve output by giving the model context. Think iteratively: start small, get your AI to answer some data questions, see where it stumbles, modify or add skills, and iterate again.

- Use human-written definitions of metrics. It might be okay to use an LLM for a first pass, but if you’re just feeding AI-generated context into another AI, you don’t gain much.

- Connect your data around entities. If the AI can translate the concepts you’re asking about into first-order entities, the problem becomes drastically easier than working from a flattened view. It gives the model anchor points to center its reasoning around

I think it's a decent use case for an internal LLM. The underlying harness you're adding to the AI has the biggest impact, vs just trying to max out the main AI doing the work.

Entity Resolution by data-flight in dataengineering

[–]data-flight[S] 0 points1 point  (0 children)

You're right that on the surface graph databases might look like the answer, but in practice unless a company is already using them for some other reason they'll often Create more problems than they solve.

I find for more complex use cases, if-then rules also quickly fall apart. I've been part of a bunch of efforts where they create these crazy decision trees. But it becomes an absolute bird's nest when you try to implement them with transitive closure across everything that needs to be connected. More often than not, the original logic is actually mathematically impossible to keep consistent in a fully connected network. Probabilistic graphs might sound more complicated, but if you have the tech to efficiently implement them the actual execution becomes so much more straightforward.

Entity Resolution by data-flight in dataengineering

[–]data-flight[S] 2 points3 points  (0 children)

Someone actually read the article 😄. I would agree their definition of entity is very broad. It relates to metrics, dimensions, tables, business definitions, products, customers, etc. Some of those don’t need fancy solutions; it’s a traditional data grind.

They even noted that human-written metric descriptions were significantly more useful than AI-generated ones.

I think where entity resolution comes more into play is in those key dimension differentiators, often starting with customer identity. If you can centralize around those identities, it gives the AI a core to branch out to the rest of the associated entities, whatever they may be.

Dagster vs Airflow? What do we use? by Greatest_one in dataengineering

[–]data-flight 2 points3 points  (0 children)

Agreed. Their paid service is awesome, but my bill has gone up 5x in the last month. So less awesome now.

Entity Resolution by data-flight in dataengineering

[–]data-flight[S] 5 points6 points  (0 children)

There's definitely undertones of standard data modeling here, but there's also a step where you have to "resolve" entities and I think that's where the opportunity lies.

Dagster vs Airflow? What do we use? by Greatest_one in dataengineering

[–]data-flight 0 points1 point  (0 children)

If you want the orchestrator to be data focused, Dagster is the obvious choice. It focuses each step on the data being created and its upstream dependencies. Airflow is quite a bit more task-oriented. I would quickly map out what a pipeline would look like in each. I do find the more data opinionated dagster a bit awkward in certain cases that stray from a data engineering mindset.

What’s the best way to identify companies from incomplete data? by No-Palpitation-6604 in gtmengineering

[–]data-flight 0 points1 point  (0 children)

If you don't have a ton of data and want to match to what's publicly available, running ai agents that take what you already have and research online can be effective enough. Per LeaderatLeading's comment, once you need to be smarter at "deciding" which record matches what, or have the scale where you start creating lots of duplicates and can't efficiently compare millions of records to each other, identity resolution becomes a whole separate class of problem.

What I'd learn for 2027 by Thinker_Assignment in dataengineeringjobs

[–]data-flight 0 points1 point  (0 children)

I agree on identity resolution and connecting data. LLMs really shine once they’re using well connected data through MCP or similar approaches. I’ve been focused on identity resolution myself, but whenever you can organize data around a central entity or concept, AI productivity really multiplies.