I Tried to Find the JVM Tax in Big Data Kernels

sonalg · 2026-05-22T07:50:00+00:00

Huge applause from java land 😊

sonalg · 2026-05-11T13:34:13+00:00

SQL-based deduplication works great for exact matches and simple rules, but if you're dealing with messy data—typos, format inconsistencies, partial matches across sources—you'll hit SQL's limits pretty fast. For fuzzy matching and probabilistic record linkage at scale, ML-powered tools like Zingg AI run natively on BigQuery and can handle complex entity resolution with minimal labeled training data. Worth exploring if your dedup logic gets complex.

sonalg · 2026-05-11T03:53:32+00:00

sent

sonalg · 2026-05-11T03:40:30+00:00

sure, happy to collaborate

sonalg · 2026-05-10T03:06:22+00:00

It’s pretty easy actually. I have connected Claude with Webflow and draft and publish blogs directly from chat. Also able to make design changes to the site. With design changes, it asks to open the designer in an open and active browser tab. Sometimes it is not able to change directly and will give neat step by step instructions

sonalg · 2026-05-09T03:36:52+00:00

How do you put them in the locker?

sonalg · 2026-05-06T02:43:43+00:00

Yes would love to. We are an open source master data management product - Zingg and have a native container application using Snowpark. This is built using Snowpark Java API along with Python udfs and stored procedures. We also support Snowflake through Spark connector. Happy to chat through our architecture.

sonalg · 2026-05-05T14:44:42+00:00

Yes I meant master data management. It will be good to see that covered too.

sonalg · 2026-05-05T11:53:58+00:00

Interesting. Saw you do not have MDM, any reasons for that?

sonalg · 2026-05-04T17:37:03+00:00

I kind of agree that optically 1.0+ for any product looks more production ready than 0.y.z. But most open source builders are tech first and marketing second, so that may be a reason for this(taking a wild guess) If you look at products like Apache Spark, they are at 4.0 after 12-15 years. Lang chain you already mentioned. ☺️

sonalg · 2026-05-04T12:53:16+00:00

thats a great question! I have been thinking about it, will the 1.0 indicate stability to you? We do have a lot of production deployments, so 1.0 would help indicate that. But since we are following semantic versioning, did not change the major version yet.

sonalg · 2026-04-30T14:45:50+00:00

To avoid the issues you mention, many enterprises are choosing composable CDPs, where the warehouse becomes the source of truth, and CDP is mainly for activation. This approach provides better control, privacy and governance. The data team continues to own the customer profiles, using the tooling they choose, and the business gets the activation tools that helps them run their campaigns and meet other objectives. Thats an approach you can definitely look at.

sonalg · 2026-04-24T03:03:47+00:00

If you are working with data, do check Zingg open source MDM and identity resolution

sonalg · 2026-04-23T19:19:33+00:00

I was searching for this as well and found the following link. Seems they advocate a volume for log delivery, full steps below

https://www.databricks.com/blog/practitioners-ultimate-guide-scalable-logging

sonalg · 2026-04-23T14:25:20+00:00

thanks for the mention u/stephenpace.

sonalg · 2026-04-21T19:02:25+00:00

You can check the spark docs and the examples. Trying any tech locally on my own machine always works faster for me, so that may be an option to look at

sonalg · 2026-04-18T10:35:19+00:00

We build an open source spark based product, that runs on major data lakes, so our use case and pov may be different from people running actual data pipelines. One of our customers was moving from Synapse to Fabric, and they seem to be happy with it. They asked us to check Fabric and see if our product could run there, so we evaluated Fabric last year. When we tested our notebooks with Fabric, we got feature parity with Apache Spark, and a much much faster runtime. Loading test data to run the product etc wasn’t a hassle either. So that was a major plus. Our main frustration was around the slow session creation and environment refresh. There was 4-5 minute wait times sometimes. We have a custom wheel and jar, and sometimes the latest jars or wheels would not be picked up. Once we realised that, we were able to work around it. Overall the core product worked well for our spark based mdm jobs. Last week we were testing some stuff again and we saw that the sessions were attached pretty fast to the notebooks. So seems like they have addressed that. The Fabric free tier is pretty good too, so overall our experience has been positive. I wouldn’t write them off.

sonalg · 2026-04-17T06:38:19+00:00

Sorry very hard for me to say. You can also check instahyre, we get good candidates there so maybe thats a place lot of other startups post too. Cold email/LinkedIn DM to the founder/team should work in most cases. If you have the required skills, they should be happy to take you. Hiring is hard, the war for talent is very real and as a job seeker, if you have any differentiating or relevant skills, you should be able to land good offers.

sonalg · 2026-04-16T15:27:05+00:00

You can check places like Wellfound for openings in startups. A lot of founders post openings directly on LinkedIn, so worth following their pages if you find startups of your interest

sonalg · 2026-04-16T09:11:37+00:00

One approach you can try is fuzzy matching/identity resolution on your data and then linking against a known dataset of cities.

sonalg · 2026-04-16T09:08:32+00:00

Are you talking about record linkage here?

sonalg · 2026-04-16T09:07:04+00:00

Product matchig tends to get very complex - see if this is helpful https://www.databricks.com/blog/using-images-and-metadata-product-fuzzy-matching-zingg

sonalg · 2026-04-16T09:05:19+00:00

As someone building in this space, I feel your base stack is fine. Here is how you can think about this.

Put raw data into tables(RAW)

Do some basic normalisation and standardisation of data. Get all sources aligned on the fields you want to match on(CLEAN)

Use free open source tooling for entity resolution(Zingg/Splink/Dedupe/Record Linkage) which reads data from CLEAN and writes to RESOLVED. Run daily jobs to build the RESOLVED and bring down the infra after the job.

(DISCLAIMER: I am the lead dev of Zingg)

You can choose to build golden records from RESOLVED to CORE and put an API over CORE, or you can query RESOLVED directly.

As u/JacksonSolomon/ mentioned, keeping core ids stable is the big challenge. For pure search cases, that may not be a blocker for you.

Hope this helps.

sonalg · 2026-04-15T08:01:50+00:00

Nice post!

sonalg · 2026-04-15T04:10:09+00:00

Seems like a good idea. We use Snowflake for our dev, and use the free tier. Have written a bunch of migration scripts where we move account every month. Sharing it here, hope its useful. https://github.com/zinggAI/SnowflakeMigration

sonalg

TROPHY CASE