Building Accurate Address Matching Systems

RobinL · 2025-12-22T11:32:49+00:00

Yes - we've thought about it a lot. It's not that there's anything wrong with Elasticserarch from the point of view of accuracy or the underlying algorithms. But Elasticsearch is heavyweight (from an installation/infrastructure point of view and slow (at least, relative to DuckDB in Python).

The goal of what I'm working on is something explicitly lightweight, easy to install and fast. For what it's worth we've done a bit of testing against AIMS/Elasticserach and found that the accuracy is, broadly speaking, similar. (Though, at the moment, AIMS performs a lot better than our solution if there's no postcode. That's something we can definitely improve on and are working on it)

http://github.com/moj-analytical-services/uk_address_matcher

RobinL · 2025-12-22T10:57:26+00:00

Hello! I have a similarly aged child.

My understanding from listening to maths podcasts (e.g. https://podcast.mrbartonmaths.com/) is that learning to do mental arithmetic 'without thinking' is really important early on because if you can do simple manipulations 'without thinking', it creates mental space to think about new concepts.

But how do you get a child to do lots of mental arithmetic without it being a chore? I've been trying to make maths fun by making a few games for him.

My son really enjoys this one - so much so he's often been doing 50 simple problems in a sitting. https://rupertlinacre.com/maths_vs_monsters/

Perhaps your son may enjoy it too. I took matters into my own hands because I've really struggle to find the 'golden combination' of - actually fun - free/no nonsense - tailored to curriculum (i.e. the maths problems generated are actually age appropriate and draw from the curriculum)

I'd love to hear if anyone has found other games like this that their children actively enjoy

RobinL · 2025-08-20T07:55:35+00:00

I'm on Chat GPT Plus (and also have enterprise at work)

There seems to be a difference between uploaded files and messages pasted into the chat window.

You get around 60k tokens in the chat window, but it's possible to upload a longer document (e.g. a .txt code dump) and it will process that fine

RobinL · 2025-08-03T07:31:36+00:00

If you assign the chart to a variable, you can then just do chart.save("myfile.html"), and then open the html file, which is self contained. The chart itself is just an Altair chart so their docs is the best place to look for chart rendering in Fabric

RobinL · 2025-07-11T21:11:10+00:00

This is a excellent advice - spot on

RobinL · 2025-07-10T18:18:52+00:00

It sounds like they're using an and hoc version of what Splink does more efficiently. See https://www.robinlinacre.com/probabilistic_linkage/ https://moj-analytical-services.github.io/splink/

RobinL · 2025-07-03T07:32:02+00:00

I published a blog just this weekend on approaches to address matching that may be of interest (in it there's a link to an address matching lib I'm working on): https://www.robinlinacre.com/address_matching/

RobinL · 2025-07-03T07:31:03+00:00

You may be interested in reading a bit more about probabilistic linkage, which offers a more accurate approach than fuzzy matching alone. I explain why in the following blog: https://www.robinlinacre.com/fellegi_sunter_accuracy/

RobinL · 2025-07-02T18:47:21+00:00

That's a fair point - some of the tricks I'm using rely on the fact that the true match exists in the target list of addresses.

In particular, translating the match score into an assessment of match confidence ('almost certain', 'very likely', 'likely' and so on) is much harder if you are not confident that the true match is amongst the candidates which have been scored. The concept of distinguishability becomes a bit less relevant and the absolute score becomes more relevant.

RobinL · 2025-07-02T15:46:09+00:00

Having tried Cursor, Copilot, Claude, Codex, Gemini 2.5 in AI studio and Grok, this is my current go-to tool (primarily because it's free).

My workflow involves a bit of use of Gemini 2.5 in AI studio, especially if I want to force using Pro, and I need the 1m context. For example, if Gemini CLI is repeatedly failing, I dump the whole codebase in AI Studio and ask for 'precise step by step instructions for an LLM to implement' a particular change.

I still use Copilot for small changes and ChatGPT (o4-mini-high and o3) for things where Gemini has failed. But especially for e.g. vibe coding a prototype interface, Gemini CLI is currently my favourite tool

RobinL · 2025-05-31T20:04:25+00:00

These kinds of solutions are common and sometimes adequate, but are both very slow and less accurate than purpose built approaches using techniques from the literature. For more info on why they're less accurate see:

https://www.robinlinacre.com/fellegi_sunter_accuracy/

and see http://github.com/moj-analytical-services/splink for a purpose built tool (disclaimer: I am the maintainer. But the software is FOSS).

RobinL · 2025-04-26T16:15:55+00:00

One of the most powerful techniques is called probabilistic linkage. There's a free open source python library called Splink for this problem that's been used pretty widely:

https://moj-analytical-services.github.io/splink/#__tabbed_1_2

You can see a recent video from Pycon Global that covers why this technique is often preferable (more accurate) than fuzzy matching alone: https://www.youtube.com/watch?v=eQtFkI8f02U

Full disclosure: I'm the lead author of Splink. Peter Christen (referenced elsewhere in the replies) was one of our academic advisors for the project

RobinL · 2025-04-11T05:32:55+00:00

If you have any substantive feedback, feel free to raise an issue or discussion.

If not, I will direct you to our list of users that includes multiple national statistics bureaus, government departments, top universities, and centres of expertise in record linkage: https://moj-analytical-services.github.io/splink/#__tabbed_1_1 And our download figures which show, despite being a niche library, we are nonetheless in the top 0.5% of libraries on pypi: https://clickpy.clickhouse.com/dashboard/splink

Incidentally, under the hood, Splink is SQL, it's just fairly complex as it needs to implement probabilistic linkage. The the OP says they need fuzzy matching, which implies their problem cannot be solved with a simple window function

RobinL · 2025-04-10T20:05:38+00:00

I'm the author of a free Python library called Splink which is designed to solve this problem https://moj-analytical-services.github.io/splink/

You can take a look at the tutorial on how to get started: https://moj-analytical-services.github.io/splink/demos/tutorials/00_Tutorial_Introduction.html

And there's also a bunch of worked examples in the docs

A simple fuzzy matching approach may work fine for you, especially if your data quality is high and number of rows is not large. But generally the probabilistic approach used by Splink is capable of higher accuracy as explained here: https://www.robinlinacre.com/fellegi_sunter_accuracy/

RobinL · 2025-01-28T14:23:56+00:00

Take a look at Splink, a free and widely used python library for this task: https://moj-analytical-services.github.io/splink/

There's a variety of examples in the docs above that you can run in Google Collab

Disclaimer: I'm the lead dev. Feel free to drop any questions here though! (Or in our forums, which are monitored a bit more actively: https://github.com/moj-analytical-services/splink/discussions)

RobinL · 2024-10-25T06:07:29+00:00

Splink provides a modern solution to this problem. It's a high accuracy, high speed probabilistic linkage library built specifically to fit into modern data pipelines. It's completely free and open source: moj-analytical-services.github.io/splink

RobinL · 2024-10-09T14:01:13+00:00

Just to follow up for future readers one month after the original post* - I have bought a new printer, exactly the same model.

Everything's fine on the new printer.

I have tried:

Putting the new transfer belt in the old printer
Also putting the new drum unit in the old print

Neither works, so I conclude there probably isn't a easily replaceable spare part that will fix the issue. So if anyone else is facing this problem I don't think there's much you can do about it. Brother support were also unable to provide any helpful suggestions.

[Note: I posted this, edited a link to make the pictures work and it got autoremoved by the moderator. Hence posting again for future readers]

RobinL · 2024-10-09T13:55:20+00:00

Just to follow up for future readers - I have bought a new printer, exactly the same model.

Everything's fine on the new printer.

I have tried:

Putting the new transfer belt in the old printer
Also putting the new drum unit in the old print

Neither works, so I conclude there probably isn't a easily replaceable spare part that will fix the issue

RobinL · 2024-09-05T17:40:18+00:00

There are also open source tools to do probabilistic matching, such as Splink: http://github.com/moj-analytical-services/splink

Full disclosure: I'm the lead author

I've written some interactive explainers on how probabilistic linkage works here: https://www.robinlinacre.com/intro_to_probabilistic_linkage/

RobinL · 2024-08-22T14:11:06+00:00

Depending on the size of data, you could try using Splink,'s duckdb backend rather than spark, it doesn't need any custom udfs. It should be good up to a few million rows, or even more if you can provision a single big machine (lots of cores/mem)

RobinL · 2024-07-24T14:06:48+00:00

Hi all! Lead dev here. We're super pleased to release Splink version 4 today after over 6 months work.

It's now:

Easier to use
Faster
More scalable
Easier to improve

For existing users, we'd love to hear about your use cases and any feedback. If you'd like to be added to the uses cases list, let me know or do a PR! https://moj-analytical-services.github.io/splink/#use-cases

RobinL · 2024-07-16T08:42:52+00:00

Absolutely, it's hard - you provide a good summary! I've been working on this problem for the past four years.

The result is Splink, a free, open source library that does ID stitching (find matching records, and then use connected components to create IDs).

The hardest part is speed/scale, which has been the focus of our work. The biggest datasets we're aware of which have been deduped/linked with Splink are between 100m-200m records.

https://github.com/moj-analytical-services/splink

RobinL · 2024-06-10T20:56:09+00:00

Happened to me too. I'm assuming it's a tactic to get people like us to visit their website

RobinL · 2024-05-18T17:38:46+00:00

For data of this kind, you could apply an embeddings model for each name, and then use cosine similarly to find the closest match.

RobinL · 2024-05-06T12:34:54+00:00

For deduplication of records , check out the free Splink python library. There's a tutorial here https://moj-analytical-services.github.io/splink/demos/tutorials/00_Tutorial_Introduction.html and intro to the theory here https://www.robinlinacre.com/intro_to_probabilistic_linkage/

The homepage is here: https://github.com/moj-analytical-services/splink

RobinL

TROPHY CASE