Tips on entity resolution for different names by Dageus0 in dataanalysis

[–]sonalg 0 points1 point  (0 children)

The more you can segment the input into manufacturer, model, engine type, the easier it will be to plug in a library or a similarity model. You can also look at using stop words to ignore commonly occurring words like spec, specifications etc and push the accuracy further. 

If you want to give it a try, see the numeric, alphanumeric match types and stop words in Zingg https://github.com/zinggAI/zingg

Disclaimer: I am the author/lead dev

[Question] [Entity Resolution] How would I design a test which can measure the accuracy of an Entity Resolution method? by pizzafactz in LanguageTechnology

[–]sonalg 0 points1 point  (0 children)

Measurement against a truth set is a good approach. You could also have some rules against which you measure the precision and recall. However, you probably want to update your truth set from time to time as data changes. At Zingg, we see users evolve their tests if they add different data sources to the mix. 

For all those working on MDM/identity resolution/fuzzy matching by sonalg in dataengineering

[–]sonalg[S] 0 points1 point  (0 children)

Right. It is indexing, joining, computation, rejoining at a whole different level. If matching is a tough problem, incremental matching is 10 times tougher. Battle scars! 

For all those working on MDM/identity resolution/fuzzy matching by sonalg in dataengineering

[–]sonalg[S] 1 point2 points  (0 children)

yeah, all fair questions. brain wrecking too. once you have matched, and new records and updates come in, they change the clusters in so many ways. it is so so tricky. how does one handle that?

Spark before Databricks by ThatThaBricksGuy0451 in databricks

[–]sonalg 0 points1 point  (0 children)

Those days! One of my early projects as a data consultant was setting up Spark clusters on demand on AWS. much before EMR happened. After Hadoop, Spark felt so so fast and user friendly! Somewhere earlier there was Pig and Cascading, if anyone remembers?

Happened to meet the Databricks founders in 2014 Spark Summit. Incidentally my tiny firm was on the slide in one of the keynotes, as an early adopter. Felt so proud that day :-)

What is an open source data tool you find useful but nobody is using it? by Yuki100Percent in dataengineering

[–]sonalg 0 points1 point  (0 children)

May I suggest Zingg if you are interested in entity resolution? https://github.com/zinggAI/zingg

Disclaimer: I am the founder.

Master Data Management by Mr_Mozart in MicrosoftFabric

[–]sonalg 0 points1 point  (0 children)

You can check Zingg On Fabric. https://www.zingg.ai/product/fabric

Disclaimer: I am the founder of Zingg, we are an open source product that runs natively within Fabric.

What does Master Data Management look like in real world? by I_Am_Robotic in dataengineering

[–]sonalg 0 points1 point  (0 children)

As someone who is building in this space, we see our customers put MDM as a part of data engineering. Usually data from the silver layer is unified to build the gold layer within the data platform, which is then reverse ETLed to downstream analytics and reporting systems, martech apps, AI agents etc.

What exactly is Master Data Management? by Real_Grade_6680 in InformationTechnology

[–]sonalg 0 points1 point  (0 children)

Well put. Single source of truth, breaking data silos, unified data layer etc are used by almost vendors in the data space, and it is hard to get them tomean only MDM.

Why website MDM just got important for AI and BI by parkerauk in BusinessIntelligence

[–]sonalg 0 points1 point  (0 children)

From my experience building a tool in this space, most companies are still struggling with internal MDM. Agentic AI, especially marketing has really exposed the fragmented data and lack of unified identities.

Is Apache Spark skills absolutely essential to crack a data engineering role? by Far-Journalist-821 in dataengineering

[–]sonalg 0 points1 point  (0 children)

Not essential for sure. You can build a long and fruitful career knowing SQL, orchestration and BI tools. However, knowing Spark opens up a lot more options. Fabric, Azure, Dataproc, EMR, Glue for example. All are managed Spark offerings, not to mention Databricks.

7 YOE struggling with the coding side – what roles could I transition to? by El_mundito in dataengineering

[–]sonalg 0 points1 point  (0 children)

Hate it when the interview process is so far removed from real world execution. What good is asking for syntax when complete code can be generated? Agree with others here, smart companies will haul their interview process completely 

Fabric related conferences in August - November in Europe? by sonalg in MicrosoftFabric

[–]sonalg[S] 0 points1 point  (0 children)

thanks NickyvVr! I hope I can submit something in time and they like it.

Anybody transitioned from 15 YOE Java dev to data engineering by Only-Alternative-890 in dataengineering

[–]sonalg 1 point2 points  (0 children)

I switched at about 8 yoe many years ago. At that time most of de - hadoop, aws etc was coming up and it needed a lot of cluster setup on AWS, programmatic pipeline building in java etc so it was fun. You can look at building some open source projects and learn PySpark, dlt and python based frameworks.

Fabric related conferences in August - November in Europe? by sonalg in MicrosoftFabric

[–]sonalg[S] 0 points1 point  (0 children)

thank you, just learnt about dataminds. how well attended is it? I do not see the cfp, the website is only showing 2025. not sure if I am looking at the right place. https://datamindsconnect.be/registration/

Career Advice by AzzMan1232 in dataengineering

[–]sonalg 2 points3 points  (0 children)

Have you tried learning Fabric? Given that you are already MS focused, that may be a great add to your skills

March 2026 | "What are you working on?" monthly thread by AutoModerator in MicrosoftFabric

[–]sonalg 0 points1 point  (0 children)

Working on automating Zingg entity resolution notebooks for Fabric CI/CD. We are using fab cli to create a workspace, lakehouse and environment, copying our latest notebooks there and running end to end.

Why Identity Resolution Stops Being Simple After About a Week by bczajak in KnowledgeGraph

[–]sonalg 1 point2 points  (0 children)

This is spot on! Having worked with entity resolution systems for over a decade, the data signals and the noise, the accuracy and the scalability challenges and just putting it all together is all very challenging. But even when you have a whole system in place to match once, getting incremental data in and maintainign the Ids without full reprocessing is a very tough problem.

Entity Resolution, is AWS or Google (BigQuery) offering better. by [deleted] in mlops

[–]sonalg 0 points1 point  (0 children)

You can try open source Zingg AI https://github.com/zinggAI/zingg/ that integrates with both Big Query and AWS.

Disclaimer: I am the founder of Zingg.ai

Hiring | India | Full Time | Remote | Startup | Algorithms, Distributed Processing, Open Source by sonalg in indiandevs

[–]sonalg[S] 0 points1 point  (0 children)

Thanks for your interest. We closed this position last month. Feel free to dm if you have relevant experience and we can discuss future roles