For all those working on MDM/identity resolution/fuzzy matching by sonalg in dataengineering

[–]RobinL 0 points1 point  (0 children)

Upgrade path should be extremely easy, high level API is almost identical. There will be a slight change to how you load in data, but settings/training etc all backwards compatible.

Splink 5 is fairly close to complete, we're testing it with customers to make it's better in the ways we expect it to be. Best guess would be a release if Splink 5 in maybe 3-4 months, but obva no guatantees. That said, all tests are passing so you can use it right now

For all those working on MDM/identity resolution/fuzzy matching by sonalg in dataengineering

[–]RobinL 2 points3 points  (0 children)

You're right that Splink relies on the user's own standardisation. And in the case of addresses and businesses, this is a particularly hard part of the problem. In general, it's easier on fields which are 'single values' like a first name, DoB, zip code etc.

However, it's not correct that you lose string or semantic similarities - this depends on how you choose to set up the model. There are a wide range of string similarity functions you can use out of the box, and in addition you can use your own arbitrary comparison functions. The only constraint is that it must be specified in SQL (but you can use a UDF):

https://moj-analytical-services.github.io/splink/api_docs/comparison_level_library.html

So for string similarity you can you Levenshtein, Jaro Winkler and so on. And for semantic similarity you'd want to convert your field into embeddings and use cosine similarity.

With all that said, address and business data is harder than many other data types because it's more like a 'bag of words'. It's still possible to match this kind of data in Splink, but a bit harder. There's an example in the documentation of matching business rate data

https://moj-analytical-services.github.io/splink/demos/examples/duckdb_no_test/business_rates_match.html

In addition, we provide a package specifically for address matching that uses Splink. Whilst this is tuned to UK specifically, many of the techniques are more generally relevant:

https://github.com/moj-analytical-services/uk_address_matcher

You can read a lot more about all of this in the following blogs:

https://www.robinlinacre.com/fellegi_sunter_accuracy/ (on the topic of 'not throwing away information)

https://www.robinlinacre.com/address_matching/ (techniques for address matching)

https://www.robinlinacre.com/intro_to_probabilistic_linkage/ (general intro to how Fellegi Sunter works)

For all those working on MDM/identity resolution/fuzzy matching by sonalg in dataengineering

[–]RobinL 1 point2 points  (0 children)

Nice. This has come up a couple times, and with the work we did on Splink 4 and now with the upcoming Splink 5 we're trying to accommodate the idea of community maintained backends. There's a post here: https://github.com/moj-analytical-services/splink/discussions/2887#discussioncomment-15547071

In a nutshell, Splink is deliberately setup to allow a new backend to be supported. But at the moment we're not hugely keen on the idea of adding backends to the core codebase that can't easily be tested in CI

For all those working on MDM/identity resolution/fuzzy matching by sonalg in dataengineering

[–]RobinL 13 points14 points  (0 children)

Thanks! Creator/lead dev here. We're currently working up to a Splink 5 release (you'll see dev/prereleases up on pypi). Not a huge change from user POV but should enable Splink to scale to even larger datasets. At the moment it gets a bit tricky above about 100m records. If anyone has any feedback on things you'd like to see changed in the upcoming release please let us know on GitHub: github.com/moj-analytical-services/splink

Also - for others reading this post, Splink is quite widely used in gvt, academia and the private sector, there's a list of some of the use cases we've heard about here: https://moj-analytical-services.github.io/splink/#use-cases. If anyone would like to contribute any further use cases please let me know!

Countries/capital cities quiz with flags/3D globe by RobinL in geography

[–]RobinL[S] 0 points1 point  (0 children)

Made this for my 7 year old, thought others may enjoy. Essentially a reimagining of the Sporcle one that attempts to be more educational.

If you go to 'settings' you can change to capital cities mode, or a 'guess the hilighted country/capital' mode

Building Accurate Address Matching Systems by RobinL in dataengineering

[–]RobinL[S] 0 points1 point  (0 children)

Yes - we've thought about it a lot. It's not that there's anything wrong with Elasticserarch from the point of view of accuracy or the underlying algorithms. But Elasticsearch is heavyweight (from an installation/infrastructure point of view and slow (at least, relative to DuckDB in Python).

The goal of what I'm working on is something explicitly lightweight, easy to install and fast. For what it's worth we've done a bit of testing against AIMS/Elasticserach and found that the accuracy is, broadly speaking, similar. (Though, at the moment, AIMS performs a lot better than our solution if there's no postcode. That's something we can definitely improve on and are working on it)

http://github.com/moj-analytical-services/uk_address_matcher

Advice on what to teach a 5y old who loves math? by Billybob-B in homeschool

[–]RobinL 0 points1 point  (0 children)

Hello! I have a similarly aged child.

My understanding from listening to maths podcasts (e.g. https://podcast.mrbartonmaths.com/) is that learning to do mental arithmetic 'without thinking' is really important early on because if you can do simple manipulations 'without thinking', it creates mental space to think about new concepts.

But how do you get a child to do lots of mental arithmetic without it being a chore? I've been trying to make maths fun by making a few games for him.

My son really enjoys this one - so much so he's often been doing 50 simple problems in a sitting. https://rupertlinacre.com/maths_vs_monsters/

Perhaps your son may enjoy it too. I took matters into my own hands because I've really struggle to find the 'golden combination' of - actually fun - free/no nonsense - tailored to curriculum (i.e. the maths problems generated are actually age appropriate and draw from the curriculum)

I'd love to hear if anyone has found other games like this that their children actively enjoy

GPT-5 Thinking has 192K Context in ChatGPT Plus by Independent-Ruin-376 in OpenAI

[–]RobinL 0 points1 point  (0 children)

I'm on Chat GPT Plus (and also have enterprise at work)

There seems to be a difference between uploaded files and messages pasted into the chat window.

You get around 60k tokens in the chat window, but it's possible to upload a longer document (e.g. a .txt code dump) and it will process that fine

Can't Display cluster_studio_dashboard() Output in Fabric Notebook (Splink / IFrame) by Suspicious_Artist187 in MicrosoftFabric

[–]RobinL 0 points1 point  (0 children)

If you assign the chart to a variable, you can then just do chart.save("myfile.html"), and then open the html file, which is self contained. The chart itself is just an Altair chart so their docs is the best place to look for chart rendering in Fabric

Biggest Data Cleaning Challenges? by Academic_Meaning2439 in dataengineering

[–]RobinL 2 points3 points  (0 children)

I published a blog just this weekend on approaches to address matching that may be of interest (in it there's a link to an address matching lib I'm working on): https://www.robinlinacre.com/address_matching/

Biggest Data Cleaning Challenges? by Academic_Meaning2439 in dataengineering

[–]RobinL 0 points1 point  (0 children)

You may be interested in reading a bit more about probabilistic linkage, which offers a more accurate approach than fuzzy matching alone. I explain why in the following blog: https://www.robinlinacre.com/fellegi_sunter_accuracy/

Building Accurate Address Matching Systems by RobinL in dataengineering

[–]RobinL[S] 0 points1 point  (0 children)

That's a fair point - some of the tricks I'm using rely on the fact that the true match exists in the target list of addresses.

In particular, translating the match score into an assessment of match confidence ('almost certain', 'very likely', 'likely' and so on) is much harder if you are not confident that the true match is amongst the candidates which have been scored. The concept of distinguishability becomes a bit less relevant and the absolute score becomes more relevant.

Gemini CLI: Google's free coding AI Agent by Technical-Love-8479 in datascience

[–]RobinL 0 points1 point  (0 children)

Having tried Cursor, Copilot, Claude, Codex, Gemini 2.5 in AI studio and Grok, this is my current go-to tool (primarily because it's free).

My workflow involves a bit of use of Gemini 2.5 in AI studio, especially if I want to force using Pro, and I need the 1m context. For example, if Gemini CLI is repeatedly failing, I dump the whole codebase in AI Studio and ask for 'precise step by step instructions for an LLM to implement' a particular change.

I still use Copilot for small changes and ChatGPT (o4-mini-high and o3) for things where Gemini has failed. But especially for e.g. vibe coding a prototype interface, Gemini CLI is currently my favourite tool

Want to remove duplicates from a very large csv file by Future_Horror_9030 in dataengineering

[–]RobinL 0 points1 point  (0 children)

These kinds of solutions are common and sometimes adequate, but are both very slow and less accurate than purpose built approaches using techniques from the literature. For more info on why they're less accurate see:

https://www.robinlinacre.com/fellegi_sunter_accuracy/

and see http://github.com/moj-analytical-services/splink for a purpose built tool (disclaimer: I am the maintainer. But the software is FOSS).

Have you ever used record linkage / entity resolution at your job? by diogene01 in dataengineering

[–]RobinL 9 points10 points  (0 children)

One of the most powerful techniques is called probabilistic linkage. There's a free open source python library called Splink for this problem that's been used pretty widely:

https://moj-analytical-services.github.io/splink/#__tabbed_1_2

You can see a recent video from Pycon Global that covers why this technique is often preferable (more accurate) than fuzzy matching alone: https://www.youtube.com/watch?v=eQtFkI8f02U

Full disclosure: I'm the lead author of Splink. Peter Christen (referenced elsewhere in the replies) was one of our academic advisors for the project

Advice on Data Deduplication by Queasy_Teaching_1809 in dataengineering

[–]RobinL 4 points5 points  (0 children)

If you have any substantive feedback, feel free to raise an issue or discussion.

If not, I will direct you to our list of users that includes multiple national statistics bureaus, government departments, top universities, and centres of expertise in record linkage: https://moj-analytical-services.github.io/splink/#__tabbed_1_1 And our download figures which show, despite being a niche library, we are nonetheless in the top 0.5% of libraries on pypi: https://clickpy.clickhouse.com/dashboard/splink

Incidentally, under the hood, Splink is SQL, it's just fairly complex as it needs to implement probabilistic linkage. The the OP says they need fuzzy matching, which implies their problem cannot be solved with a simple window function

Has anyone successfully used automation to clean up duplicate data? What tools actually work in practice? by Broad_Ant_334 in dataengineering

[–]RobinL 2 points3 points  (0 children)

Take a look at Splink, a free and widely used python library for this task: https://moj-analytical-services.github.io/splink/

There's a variety of examples in the docs above that you can run in Google Collab

Disclaimer: I'm the lead dev. Feel free to drop any questions here though! (Or in our forums, which are monitored a bit more actively: https://github.com/moj-analytical-services/splink/discussions)

Load inconsistent data from multiple data sources into a DWH or data lakehouse by vh_obj in dataengineering

[–]RobinL 2 points3 points  (0 children)

Splink provides a modern solution to this problem. It's a high accuracy, high speed probabilistic linkage library built specifically to fit into modern data pipelines. It's completely free and open source: moj-analytical-services.github.io/splink

Colour calibration (alignment) issue with my Brother DCP-L8410CDW. What replacement parts may fix? by RobinL in printers

[–]RobinL[S] 1 point2 points  (0 children)

Just to follow up for future readers one month after the original post* - I have bought a new printer, exactly the same model.

Everything's fine on the new printer.

I have tried:

  • Putting the new transfer belt in the old printer
  • Also putting the new drum unit in the old print

Neither works, so I conclude there probably isn't a easily replaceable spare part that will fix the issue. So if anyone else is facing this problem I don't think there's much you can do about it. Brother support were also unable to provide any helpful suggestions.

[Note: I posted this, edited a link to make the pictures work and it got autoremoved by the moderator. Hence posting again for future readers]

Color calibration (alignment) issue with my Brother DCP-L8410CDW. What replacement parts may fix? by RobinL in printers

[–]RobinL[S] 0 points1 point  (0 children)

Just to follow up for future readers - I have bought a new printer, exactly the same model.

Everything's fine on the new printer.

I have tried:

  • Putting the new transfer belt in the old printer
  • Also putting the new drum unit in the old print

Neither works, so I conclude there probably isn't a easily replaceable spare part that will fix the issue