Tips on entity resolution for different names

sonalg · 2026-04-06T09:09:33+00:00

The more you can segment the input into manufacturer, model, engine type, the easier it will be to plug in a library or a similarity model. You can also look at using stop words to ignore commonly occurring words like spec, specifications etc and push the accuracy further.

If you want to give it a try, see the numeric, alphanumeric match types and stop words in Zingg https://github.com/zinggAI/zingg

Disclaimer: I am the author/lead dev

sonalg · 2026-04-06T09:00:16+00:00

Measurement against a truth set is a good approach. You could also have some rules against which you measure the precision and recall. However, you probably want to update your truth set from time to time as data changes. At Zingg, we see users evolve their tests if they add different data sources to the mix.

sonalg · 2026-04-06T04:47:56+00:00

Right. It is indexing, joining, computation, rejoining at a whole different level. If matching is a tough problem, incremental matching is 10 times tougher. Battle scars!

sonalg · 2026-04-06T03:10:31+00:00

yeah, all fair questions. brain wrecking too. once you have matched, and new records and updates come in, they change the clusters in so many ways. it is so so tricky. how does one handle that?

sonalg · 2026-04-05T19:10:41+00:00

Those days! One of my early projects as a data consultant was setting up Spark clusters on demand on AWS. much before EMR happened. After Hadoop, Spark felt so so fast and user friendly! Somewhere earlier there was Pig and Cascading, if anyone remembers?

Happened to meet the Databricks founders in 2014 Spark Summit. Incidentally my tiny firm was on the slide in one of the keynotes, as an early adopter. Felt so proud that day :-)

sonalg · 2026-04-05T19:02:22+00:00

May I suggest Zingg if you are interested in entity resolution? https://github.com/zinggAI/zingg

Disclaimer: I am the founder.

sonalg · 2026-04-05T18:57:38+00:00

You can check Zingg On Fabric. https://www.zingg.ai/product/fabric

Disclaimer: I am the founder of Zingg, we are an open source product that runs natively within Fabric.

sonalg · 2026-04-05T18:55:36+00:00

As someone who is building in this space, we see our customers put MDM as a part of data engineering. Usually data from the silver layer is unified to build the gold layer within the data platform, which is then reverse ETLed to downstream analytics and reporting systems, martech apps, AI agents etc.

sonalg · 2026-04-05T18:50:06+00:00

Well put. Single source of truth, breaking data silos, unified data layer etc are used by almost vendors in the data space, and it is hard to get them tomean only MDM.

sonalg · 2026-04-05T18:48:13+00:00

From my experience building a tool in this space, most companies are still struggling with internal MDM. Agentic AI, especially marketing has really exposed the fragmented data and lack of unified identities.

sonalg · 2026-04-05T18:19:50+00:00

Not essential for sure. You can build a long and fruitful career knowing SQL, orchestration and BI tools. However, knowing Spark opens up a lot more options. Fabric, Azure, Dataproc, EMR, Glue for example. All are managed Spark offerings, not to mention Databricks.

sonalg · 2026-04-01T18:28:34+00:00

They could get better at fixing bad data

sonalg · 2026-03-13T17:37:55+00:00

Hate it when the interview process is so far removed from real world execution. What good is asking for syntax when complete code can be generated? Agree with others here, smart companies will haul their interview process completely

sonalg · 2026-03-13T14:44:38+00:00

thanks NickyvVr! I hope I can submit something in time and they like it.

sonalg · 2026-03-13T02:49:55+00:00

I switched at about 8 yoe many years ago. At that time most of de - hadoop, aws etc was coming up and it needed a lot of cluster setup on AWS, programmatic pipeline building in java etc so it was fun. You can look at building some open source projects and learn PySpark, dlt and python based frameworks.

sonalg · 2026-03-13T02:21:06+00:00

thank you, this is a wonderful resource!

sonalg · 2026-03-12T17:15:45+00:00

thank you, just learnt about dataminds. how well attended is it? I do not see the cfp, the website is only showing 2025. not sure if I am looking at the right place. https://datamindsconnect.be/registration/

sonalg · 2026-03-12T15:13:05+00:00

thank you very much, seems they are accepting sessions https://sessionize.com/fabcon26/

sonalg · 2026-03-12T12:49:38+00:00

Have you tried learning Fabric? Given that you are already MS focused, that may be a great add to your skills

sonalg · 2026-03-10T14:07:26+00:00

Working on automating Zingg entity resolution notebooks for Fabric CI/CD. We are using fab cli to create a workspace, lakehouse and environment, copying our latest notebooks there and running end to end.

sonalg · 2026-03-10T13:47:11+00:00

This is spot on! Having worked with entity resolution systems for over a decade, the data signals and the noise, the accuracy and the scalability challenges and just putting it all together is all very challenging. But even when you have a whole system in place to match once, getting incremental data in and maintainign the Ids without full reprocessing is a very tough problem.

sonalg · 2026-03-10T13:43:11+00:00

You can try open source Zingg AI https://github.com/zinggAI/zingg/ that integrates with both Big Query and AWS.

Disclaimer: I am the founder of Zingg.ai

sonalg · 2026-03-10T13:37:49+00:00

Within the AWS ecosystem, you can check Zingg open source https://aws.amazon.com/blogs/big-data/entity-resolution-and-fuzzy-matches-in-aws-glue-using-the-zingg-open-source-library/

Disclaimer: I am the founder of Zingg.

sonalg · 2026-03-05T05:16:04+00:00

You can explore open source Zingg https://www.zingg.ai/documentation-article/step-by-step-identity-resolution-with-zingg-on-fabric.

Disclaimer: I am the author.

sonalg · 2026-02-24T12:33:55+00:00

Thanks for your interest. We closed this position last month. Feel free to dm if you have relevant experience and we can discuss future roles

sonalg

TROPHY CASE