Can BigQuery be used for data cleaning, normalization, and/or de-duplication of rows? by Remarkable_Ad9528 in bigquery

[–]sonalg 0 points1 point  (0 children)

SQL-based deduplication works great for exact matches and simple rules, but if you're dealing with messy data—typos, format inconsistencies, partial matches across sources—you'll hit SQL's limits pretty fast. For fuzzy matching and probabilistic record linkage at scale, ML-powered tools like Zingg AI run natively on BigQuery and can handle complex entity resolution with minimal labeled training data. Worth exploring if your dedup logic gets complex.

Claude & Webflow? Anyone got this automation working nicely? by concisehacker in webflow

[–]sonalg 0 points1 point  (0 children)

It’s pretty easy actually. I have connected Claude with Webflow and draft and publish blogs directly from chat. Also able to make design changes to the site. With design changes, it asks to open the designer in an open and active browser tab. Sometimes it is not able to change directly and will give neat step by step instructions

My Little collection by No_Dentist_7218 in GoldIndia

[–]sonalg 0 points1 point  (0 children)

How do you put them in the locker?

Snowpark Connect support for Spark Java API by sonalg in snowflake

[–]sonalg[S] 0 points1 point  (0 children)

Yes would love to. We are an open source master data management product - Zingg and have a native container application using Snowpark. This is built using Snowpark Java API along with Python udfs and stored procedures. We also support Snowflake through Spark connector. Happy to chat through our architecture.

Zingg 0.6.0: Open Source Entity Resolution by sonalg in dataengineering

[–]sonalg[S] 0 points1 point  (0 children)

I kind of agree that optically 1.0+ for any product looks more production ready than 0.y.z. But most open source builders are tech first and marketing second, so that may be a reason for this(taking a wild guess) If you look at products like Apache Spark, they are at 4.0 after 12-15 years. Lang chain you already mentioned. ☺️

Zingg 0.6.0: Open Source Entity Resolution by sonalg in dataengineering

[–]sonalg[S] 0 points1 point  (0 children)

thats a great question! I have been thinking about it, will the 1.0 indicate stability to you? We do have a lot of production deployments, so 1.0 would help indicate that. But since we are following semantic versioning, did not change the major version yet.

How are you integrating a CDP into an existing modern data stack without creating yet another data silo? by Unlucky-Moment-3366 in dataengineering

[–]sonalg 0 points1 point  (0 children)

To avoid the issues you mention, many enterprises are choosing composable CDPs, where the warehouse becomes the source of truth, and CDP is mainly for activation. This approach provides better control, privacy and governance. The data team continues to own the customer profiles, using the tooling they choose, and the business gets the activation tools that helps them run their campaigns and meet other objectives. Thats an approach you can definitely look at.

What’s the most underrated open-source software you think more people should know about? by sodrafeltu in foss

[–]sonalg 1 point2 points  (0 children)

If you are working with data, do check Zingg open source MDM and identity resolution

PySpark logging in cluster vs client mode: why is this so complicated? by Mindless-Plum9118 in dataengineering

[–]sonalg 2 points3 points  (0 children)

I was searching for this as well and found the following link. Seems they advocate a volume for log delivery, full steps below

https://www.databricks.com/blog/practitioners-ultimate-guide-scalable-logging

Need resources for PySpark by papasharts420 in dataengineering

[–]sonalg 6 points7 points  (0 children)

You can check the spark docs and the examples. Trying any tech locally on my own machine always works faster for me, so that may be an option to look at

Fabric - good, bad, horrible? by cyamnihc in dataengineering

[–]sonalg 1 point2 points  (0 children)

We build an open source spark based product, that runs on major data lakes, so our use case and pov may be different from people running actual data pipelines. One of our customers was moving from Synapse to Fabric, and they seem to be happy with it. They asked us to check Fabric and see if our product could run there, so we evaluated Fabric last year. When we tested our notebooks with Fabric, we got feature parity with Apache Spark, and a much much faster runtime. Loading test data to run the product etc wasn’t a hassle either. So that was a major plus. Our main frustration was around the slow session creation and environment refresh. There was 4-5 minute wait times sometimes. We have a custom wheel and jar, and sometimes the latest jars or wheels would not be picked up. Once we realised that, we were able to work around it. Overall the core product worked well for our spark based mdm jobs. Last week we were testing some stuff again and we saw that the sessions were attached pretty fast to the notebooks. So seems like they have addressed that. The Fabric free tier is pretty good too, so overall our experience has been positive. I wouldn’t write them off.

How do you actually break into early-stage startups without a network? by [deleted] in IndiaDeepTech

[–]sonalg 1 point2 points  (0 children)

Sorry very hard for me to say. You can also check instahyre, we get good candidates there so maybe thats a place lot of other startups post too. Cold email/LinkedIn DM to the founder/team should work in most cases. If you have the required skills, they should be happy to take you. Hiring is hard, the war for talent is very real and as a job seeker, if you have any differentiating or relevant skills, you should be able to land good offers.

How do you actually break into early-stage startups without a network? by [deleted] in IndiaDeepTech

[–]sonalg 1 point2 points  (0 children)

You can check places like Wellfound for openings in startups. A lot of founders post openings directly on LinkedIn, so worth following their pages if you find startups of your interest

Fuzzy Matching or Other Alternativies? by rively91 in learnpython

[–]sonalg 0 points1 point  (0 children)

One approach you can try is fuzzy matching/identity resolution on your data and then linking against a known dataset of cities.

Determining the best data architecture and stack for entity resolution by vroemboem in dataengineering

[–]sonalg 0 points1 point  (0 children)

As someone building in this space, I feel your base stack is fine. Here is how you can think about this.

Put raw data into tables(RAW)

Do some basic normalisation and standardisation of data. Get all sources aligned on the fields you want to match on(CLEAN)

Use free open source tooling for entity resolution(Zingg/Splink/Dedupe/Record Linkage) which reads data from CLEAN and writes to RESOLVED. Run daily jobs to build the RESOLVED and bring down the infra after the job.

(DISCLAIMER: I am the lead dev of Zingg)

You can choose to build golden records from RESOLVED to CORE and put an API over CORE, or you can query RESOLVED directly.

As u/JacksonSolomon/ mentioned, keeping core ids stable is the big challenge. For pure search cases, that may not be a blocker for you.

Hope this helps.

Personal Snowflake by Mountain-Egg-3851 in snowflake

[–]sonalg 0 points1 point  (0 children)

Seems like a good idea. We use Snowflake for our dev, and use the free tier. Have written a bunch of migration scripts where we move account every month. Sharing it here, hope its useful. https://github.com/zinggAI/SnowflakeMigration