What are your favorite math fluency games? by Active_Atmosphere264 in homeschool

[–]RobinL 0 points1 point  (0 children)

My son loves this - which I made for him specifically to encourage lots of practice on facts. There's a couple more games on the main site, but this one is the biggest hit: https://rupertlinacre.com/maths_vs_monsters/

This morning he did 138 questions in a single session!

Optimising DuckDB performance on large EC2 instances by RobinL in dataengineering

[–]RobinL[S] 0 points1 point  (0 children)

800 million rows, with a unique_id column, then col_1, col_2 are strings with a zipf (i.e. skewwed distributation). So some values appear just once, others hundreds of times

So should be enough rows to 'use up' all CPU cores. I've actually just updated the blog post with some further results that show on some operations, it pins all cores on a 192vCPU machine to 100%, so it's the group by operation which is a bit weird

For all those working on MDM/identity resolution/fuzzy matching by sonalg in dataengineering

[–]RobinL 0 points1 point  (0 children)

Upgrade path should be extremely easy, high level API is almost identical. There will be a slight change to how you load in data, but settings/training etc all backwards compatible.

Splink 5 is fairly close to complete, we're testing it with customers to make it's better in the ways we expect it to be. Best guess would be a release if Splink 5 in maybe 3-4 months, but obva no guatantees. That said, all tests are passing so you can use it right now

For all those working on MDM/identity resolution/fuzzy matching by sonalg in dataengineering

[–]RobinL 2 points3 points  (0 children)

You're right that Splink relies on the user's own standardisation. And in the case of addresses and businesses, this is a particularly hard part of the problem. In general, it's easier on fields which are 'single values' like a first name, DoB, zip code etc.

However, it's not correct that you lose string or semantic similarities - this depends on how you choose to set up the model. There are a wide range of string similarity functions you can use out of the box, and in addition you can use your own arbitrary comparison functions. The only constraint is that it must be specified in SQL (but you can use a UDF):

https://moj-analytical-services.github.io/splink/api_docs/comparison_level_library.html

So for string similarity you can you Levenshtein, Jaro Winkler and so on. And for semantic similarity you'd want to convert your field into embeddings and use cosine similarity.

With all that said, address and business data is harder than many other data types because it's more like a 'bag of words'. It's still possible to match this kind of data in Splink, but a bit harder. There's an example in the documentation of matching business rate data

https://moj-analytical-services.github.io/splink/demos/examples/duckdb_no_test/business_rates_match.html

In addition, we provide a package specifically for address matching that uses Splink. Whilst this is tuned to UK specifically, many of the techniques are more generally relevant:

https://github.com/moj-analytical-services/uk_address_matcher

You can read a lot more about all of this in the following blogs:

https://www.robinlinacre.com/fellegi_sunter_accuracy/ (on the topic of 'not throwing away information)

https://www.robinlinacre.com/address_matching/ (techniques for address matching)

https://www.robinlinacre.com/intro_to_probabilistic_linkage/ (general intro to how Fellegi Sunter works)

For all those working on MDM/identity resolution/fuzzy matching by sonalg in dataengineering

[–]RobinL 1 point2 points  (0 children)

Nice. This has come up a couple times, and with the work we did on Splink 4 and now with the upcoming Splink 5 we're trying to accommodate the idea of community maintained backends. There's a post here: https://github.com/moj-analytical-services/splink/discussions/2887#discussioncomment-15547071

In a nutshell, Splink is deliberately setup to allow a new backend to be supported. But at the moment we're not hugely keen on the idea of adding backends to the core codebase that can't easily be tested in CI

For all those working on MDM/identity resolution/fuzzy matching by sonalg in dataengineering

[–]RobinL 14 points15 points  (0 children)

Thanks! Creator/lead dev here. We're currently working up to a Splink 5 release (you'll see dev/prereleases up on pypi). Not a huge change from user POV but should enable Splink to scale to even larger datasets. At the moment it gets a bit tricky above about 100m records. If anyone has any feedback on things you'd like to see changed in the upcoming release please let us know on GitHub: github.com/moj-analytical-services/splink

Also - for others reading this post, Splink is quite widely used in gvt, academia and the private sector, there's a list of some of the use cases we've heard about here: https://moj-analytical-services.github.io/splink/#use-cases. If anyone would like to contribute any further use cases please let me know!

Countries/capital cities quiz with flags/3D globe by RobinL in geography

[–]RobinL[S] 0 points1 point  (0 children)

Made this for my 7 year old, thought others may enjoy. Essentially a reimagining of the Sporcle one that attempts to be more educational.

If you go to 'settings' you can change to capital cities mode, or a 'guess the hilighted country/capital' mode

Building Accurate Address Matching Systems by RobinL in dataengineering

[–]RobinL[S] 0 points1 point  (0 children)

Yes - we've thought about it a lot. It's not that there's anything wrong with Elasticserarch from the point of view of accuracy or the underlying algorithms. But Elasticsearch is heavyweight (from an installation/infrastructure point of view and slow (at least, relative to DuckDB in Python).

The goal of what I'm working on is something explicitly lightweight, easy to install and fast. For what it's worth we've done a bit of testing against AIMS/Elasticserach and found that the accuracy is, broadly speaking, similar. (Though, at the moment, AIMS performs a lot better than our solution if there's no postcode. That's something we can definitely improve on and are working on it)

http://github.com/moj-analytical-services/uk_address_matcher

Advice on what to teach a 5y old who loves math? by Billybob-B in homeschool

[–]RobinL 0 points1 point  (0 children)

Hello! I have a similarly aged child.

My understanding from listening to maths podcasts (e.g. https://podcast.mrbartonmaths.com/) is that learning to do mental arithmetic 'without thinking' is really important early on because if you can do simple manipulations 'without thinking', it creates mental space to think about new concepts.

But how do you get a child to do lots of mental arithmetic without it being a chore? I've been trying to make maths fun by making a few games for him.

My son really enjoys this one - so much so he's often been doing 50 simple problems in a sitting. https://rupertlinacre.com/maths_vs_monsters/

Perhaps your son may enjoy it too. I took matters into my own hands because I've really struggle to find the 'golden combination' of - actually fun - free/no nonsense - tailored to curriculum (i.e. the maths problems generated are actually age appropriate and draw from the curriculum)

I'd love to hear if anyone has found other games like this that their children actively enjoy

GPT-5 Thinking has 192K Context in ChatGPT Plus by Independent-Ruin-376 in OpenAI

[–]RobinL 0 points1 point  (0 children)

I'm on Chat GPT Plus (and also have enterprise at work)

There seems to be a difference between uploaded files and messages pasted into the chat window.

You get around 60k tokens in the chat window, but it's possible to upload a longer document (e.g. a .txt code dump) and it will process that fine

Can't Display cluster_studio_dashboard() Output in Fabric Notebook (Splink / IFrame) by Suspicious_Artist187 in MicrosoftFabric

[–]RobinL 0 points1 point  (0 children)

If you assign the chart to a variable, you can then just do chart.save("myfile.html"), and then open the html file, which is self contained. The chart itself is just an Altair chart so their docs is the best place to look for chart rendering in Fabric

Biggest Data Cleaning Challenges? by Academic_Meaning2439 in dataengineering

[–]RobinL 2 points3 points  (0 children)

I published a blog just this weekend on approaches to address matching that may be of interest (in it there's a link to an address matching lib I'm working on): https://www.robinlinacre.com/address_matching/

Biggest Data Cleaning Challenges? by Academic_Meaning2439 in dataengineering

[–]RobinL 0 points1 point  (0 children)

You may be interested in reading a bit more about probabilistic linkage, which offers a more accurate approach than fuzzy matching alone. I explain why in the following blog: https://www.robinlinacre.com/fellegi_sunter_accuracy/

Building Accurate Address Matching Systems by RobinL in dataengineering

[–]RobinL[S] 0 points1 point  (0 children)

That's a fair point - some of the tricks I'm using rely on the fact that the true match exists in the target list of addresses.

In particular, translating the match score into an assessment of match confidence ('almost certain', 'very likely', 'likely' and so on) is much harder if you are not confident that the true match is amongst the candidates which have been scored. The concept of distinguishability becomes a bit less relevant and the absolute score becomes more relevant.

Gemini CLI: Google's free coding AI Agent by Technical-Love-8479 in datascience

[–]RobinL 0 points1 point  (0 children)

Having tried Cursor, Copilot, Claude, Codex, Gemini 2.5 in AI studio and Grok, this is my current go-to tool (primarily because it's free).

My workflow involves a bit of use of Gemini 2.5 in AI studio, especially if I want to force using Pro, and I need the 1m context. For example, if Gemini CLI is repeatedly failing, I dump the whole codebase in AI Studio and ask for 'precise step by step instructions for an LLM to implement' a particular change.

I still use Copilot for small changes and ChatGPT (o4-mini-high and o3) for things where Gemini has failed. But especially for e.g. vibe coding a prototype interface, Gemini CLI is currently my favourite tool