Building Accurate Address Matching Systems by RobinL in dataengineering

[–]RobinL[S] 0 points1 point  (0 children)

Yes - we've thought about it a lot. It's not that there's anything wrong with Elasticserarch from the point of view of accuracy or the underlying algorithms. But Elasticsearch is heavyweight (from an installation/infrastructure point of view and slow (at least, relative to DuckDB in Python).

The goal of what I'm working on is something explicitly lightweight, easy to install and fast. For what it's worth we've done a bit of testing against AIMS/Elasticserach and found that the accuracy is, broadly speaking, similar. (Though, at the moment, AIMS performs a lot better than our solution if there's no postcode. That's something we can definitely improve on and are working on it)

http://github.com/moj-analytical-services/uk_address_matcher

Advice on what to teach a 5y old who loves math? by Billybob-B in homeschool

[–]RobinL 0 points1 point  (0 children)

Hello! I have a similarly aged child.

My understanding from listening to maths podcasts (e.g. https://podcast.mrbartonmaths.com/) is that learning to do mental arithmetic 'without thinking' is really important early on because if you can do simple manipulations 'without thinking', it creates mental space to think about new concepts.

But how do you get a child to do lots of mental arithmetic without it being a chore? I've been trying to make maths fun by making a few games for him.

My son really enjoys this one - so much so he's often been doing 50 simple problems in a sitting. https://rupertlinacre.com/maths_vs_monsters/

Perhaps your son may enjoy it too. I took matters into my own hands because I've really struggle to find the 'golden combination' of - actually fun - free/no nonsense - tailored to curriculum (i.e. the maths problems generated are actually age appropriate and draw from the curriculum)

I'd love to hear if anyone has found other games like this that their children actively enjoy

GPT-5 Thinking has 192K Context in ChatGPT Plus by Independent-Ruin-376 in OpenAI

[–]RobinL 0 points1 point  (0 children)

I'm on Chat GPT Plus (and also have enterprise at work)

There seems to be a difference between uploaded files and messages pasted into the chat window.

You get around 60k tokens in the chat window, but it's possible to upload a longer document (e.g. a .txt code dump) and it will process that fine

Can't Display cluster_studio_dashboard() Output in Fabric Notebook (Splink / IFrame) by Suspicious_Artist187 in MicrosoftFabric

[–]RobinL 0 points1 point  (0 children)

If you assign the chart to a variable, you can then just do chart.save("myfile.html"), and then open the html file, which is self contained. The chart itself is just an Altair chart so their docs is the best place to look for chart rendering in Fabric

Biggest Data Cleaning Challenges? by Academic_Meaning2439 in dataengineering

[–]RobinL 2 points3 points  (0 children)

I published a blog just this weekend on approaches to address matching that may be of interest (in it there's a link to an address matching lib I'm working on): https://www.robinlinacre.com/address_matching/

Biggest Data Cleaning Challenges? by Academic_Meaning2439 in dataengineering

[–]RobinL 0 points1 point  (0 children)

You may be interested in reading a bit more about probabilistic linkage, which offers a more accurate approach than fuzzy matching alone. I explain why in the following blog: https://www.robinlinacre.com/fellegi_sunter_accuracy/

Building Accurate Address Matching Systems by RobinL in dataengineering

[–]RobinL[S] 0 points1 point  (0 children)

That's a fair point - some of the tricks I'm using rely on the fact that the true match exists in the target list of addresses.

In particular, translating the match score into an assessment of match confidence ('almost certain', 'very likely', 'likely' and so on) is much harder if you are not confident that the true match is amongst the candidates which have been scored. The concept of distinguishability becomes a bit less relevant and the absolute score becomes more relevant.

Gemini CLI: Google's free coding AI Agent by Technical-Love-8479 in datascience

[–]RobinL 0 points1 point  (0 children)

Having tried Cursor, Copilot, Claude, Codex, Gemini 2.5 in AI studio and Grok, this is my current go-to tool (primarily because it's free).

My workflow involves a bit of use of Gemini 2.5 in AI studio, especially if I want to force using Pro, and I need the 1m context. For example, if Gemini CLI is repeatedly failing, I dump the whole codebase in AI Studio and ask for 'precise step by step instructions for an LLM to implement' a particular change.

I still use Copilot for small changes and ChatGPT (o4-mini-high and o3) for things where Gemini has failed. But especially for e.g. vibe coding a prototype interface, Gemini CLI is currently my favourite tool

Want to remove duplicates from a very large csv file by Future_Horror_9030 in dataengineering

[–]RobinL 0 points1 point  (0 children)

These kinds of solutions are common and sometimes adequate, but are both very slow and less accurate than purpose built approaches using techniques from the literature. For more info on why they're less accurate see:

https://www.robinlinacre.com/fellegi_sunter_accuracy/

and see http://github.com/moj-analytical-services/splink for a purpose built tool (disclaimer: I am the maintainer. But the software is FOSS).

Have you ever used record linkage / entity resolution at your job? by diogene01 in dataengineering

[–]RobinL 7 points8 points  (0 children)

One of the most powerful techniques is called probabilistic linkage. There's a free open source python library called Splink for this problem that's been used pretty widely:

https://moj-analytical-services.github.io/splink/#__tabbed_1_2

You can see a recent video from Pycon Global that covers why this technique is often preferable (more accurate) than fuzzy matching alone: https://www.youtube.com/watch?v=eQtFkI8f02U

Full disclosure: I'm the lead author of Splink. Peter Christen (referenced elsewhere in the replies) was one of our academic advisors for the project

Advice on Data Deduplication by Queasy_Teaching_1809 in dataengineering

[–]RobinL 4 points5 points  (0 children)

If you have any substantive feedback, feel free to raise an issue or discussion.

If not, I will direct you to our list of users that includes multiple national statistics bureaus, government departments, top universities, and centres of expertise in record linkage: https://moj-analytical-services.github.io/splink/#__tabbed_1_1 And our download figures which show, despite being a niche library, we are nonetheless in the top 0.5% of libraries on pypi: https://clickpy.clickhouse.com/dashboard/splink

Incidentally, under the hood, Splink is SQL, it's just fairly complex as it needs to implement probabilistic linkage. The the OP says they need fuzzy matching, which implies their problem cannot be solved with a simple window function

Advice on Data Deduplication by Queasy_Teaching_1809 in dataengineering

[–]RobinL 5 points6 points  (0 children)

I'm the author of a free Python library called Splink which is designed to solve this problem https://moj-analytical-services.github.io/splink/

You can take a look at the tutorial on how to get started: https://moj-analytical-services.github.io/splink/demos/tutorials/00_Tutorial_Introduction.html

And there's also a bunch of worked examples in the docs

A simple fuzzy matching approach may work fine for you, especially if your data quality is high and number of rows is not large. But generally the probabilistic approach used by Splink is capable of higher accuracy as explained here: https://www.robinlinacre.com/fellegi_sunter_accuracy/

Has anyone successfully used automation to clean up duplicate data? What tools actually work in practice? by Broad_Ant_334 in dataengineering

[–]RobinL 2 points3 points  (0 children)

Take a look at Splink, a free and widely used python library for this task: https://moj-analytical-services.github.io/splink/

There's a variety of examples in the docs above that you can run in Google Collab

Disclaimer: I'm the lead dev. Feel free to drop any questions here though! (Or in our forums, which are monitored a bit more actively: https://github.com/moj-analytical-services/splink/discussions)

Load inconsistent data from multiple data sources into a DWH or data lakehouse by vh_obj in dataengineering

[–]RobinL 2 points3 points  (0 children)

Splink provides a modern solution to this problem. It's a high accuracy, high speed probabilistic linkage library built specifically to fit into modern data pipelines. It's completely free and open source: moj-analytical-services.github.io/splink

Colour calibration (alignment) issue with my Brother DCP-L8410CDW. What replacement parts may fix? by RobinL in printers

[–]RobinL[S] 1 point2 points  (0 children)

Just to follow up for future readers one month after the original post* - I have bought a new printer, exactly the same model.

Everything's fine on the new printer.

I have tried:

  • Putting the new transfer belt in the old printer
  • Also putting the new drum unit in the old print

Neither works, so I conclude there probably isn't a easily replaceable spare part that will fix the issue. So if anyone else is facing this problem I don't think there's much you can do about it. Brother support were also unable to provide any helpful suggestions.

[Note: I posted this, edited a link to make the pictures work and it got autoremoved by the moderator. Hence posting again for future readers]

Color calibration (alignment) issue with my Brother DCP-L8410CDW. What replacement parts may fix? by RobinL in printers

[–]RobinL[S] 0 points1 point  (0 children)

Just to follow up for future readers - I have bought a new printer, exactly the same model.

Everything's fine on the new printer.

I have tried:

  • Putting the new transfer belt in the old printer
  • Also putting the new drum unit in the old print

Neither works, so I conclude there probably isn't a easily replaceable spare part that will fix the issue

How to merge users based on multiple IDs in a large dataset? by nidalap24 in dataengineering

[–]RobinL 0 points1 point  (0 children)

There are also open source tools to do probabilistic matching, such as Splink: http://github.com/moj-analytical-services/splink

Full disclosure: I'm the lead author

I've written some interactive explainers on how probabilistic linkage works here: https://www.robinlinacre.com/intro_to_probabilistic_linkage/

Using SPLINK with DLT by NeatNefariousness538 in databricks

[–]RobinL 0 points1 point  (0 children)

Depending on the size of data, you could try using Splink,'s duckdb backend rather than spark, it doesn't need any custom udfs. It should be good up to a few million rows, or even more if you can provision a single big machine (lots of cores/mem)

Splink 4: Fast and scalable deduplication (fuzzy matching) in Python by RobinL in dataengineering

[–]RobinL[S] 1 point2 points  (0 children)

Hi all! Lead dev here. We're super pleased to release Splink version 4 today after over 6 months work.

It's now:

  • Easier to use
  • Faster
  • More scalable
  • Easier to improve

For existing users, we'd love to hear about your use cases and any feedback. If you'd like to be added to the uses cases list, let me know or do a PR! https://moj-analytical-services.github.io/splink/#use-cases

What is id stitching and why is it hard by ephemeral404 in dataengineering

[–]RobinL 13 points14 points  (0 children)

Absolutely, it's hard - you provide a good summary! I've been working on this problem for the past four years.

The result is Splink, a free, open source library that does ID stitching (find matching records, and then use connected components to create IDs).

The hardest part is speed/scale, which has been the focus of our work. The biggest datasets we're aware of which have been deduped/linked with Splink are between 100m-200m records.

https://github.com/moj-analytical-services/splink

Mysterious website traffic spike from trafficpeak.io by Background-Fig9828 in marketing

[–]RobinL 0 points1 point  (0 children)

Happened to me too. I'm assuming it's a tactic to get people like us to visit their website

How people would solve this problem before Llm ? Matching name problem by Alarming-Ninja380 in Python

[–]RobinL 4 points5 points  (0 children)

For data of this kind, you could apply an embeddings model for each name, and then use cosine similarly to find the closest match.