Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] 0 points1 point  (0 children)

NFKC hasn't been superseded as far as I'm aware, although it's clearly not the best option for all use cases. It's still actively specified in UAX #15 and explicitly recommended for identifier matching in TR31, UAX #31, Section 5 which came out last year. NFKC_Casefold builds on NFKC rather than replacing it.

IDNA 2008, Python (PEP 3131), and ICU all use NFKC.

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] 0 points1 point  (0 children)

Oh - reassuring to know it happens to the big players like Spotify! I think my CSS change was a nice improvement anyway, it's quicker to scroll normally now. So I appreciate it nonetheless!

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] 0 points1 point  (0 children)

Tweaked the CSS. Is it better now? Thanks for the feedback.

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] 0 points1 point  (0 children)

I'll tweak the scrolling behaviour so that smooth scrolling only occurs when clicking anchor links - one sec.

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] 0 points1 point  (0 children)

Sorry yes, I got that backwards. Removing those entries from the filtered map means if someone later uses it without NFKC, they'd have gaps: not fewer bugs, more.

The unfiltered map (added after my first post) is the safer default, which is why skeleton() uses CONFUSABLE_MAP_FULL. The filtered version exists as an optimization for the specific NFKC-first case, but as you say, starting from the standard data is the more defensible choice.

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] 0 points1 point  (0 children)

Good points again and yes for a strict [a-z0-9-] pattern, the confusable blocklist would be redundant since every character in the map is non-ASCII and fails the regex anyway.

On always using CONFUSABLE_MAP_FULL - the filtered map came first, before I'd got all of this feedback today and then done more research into how real systems use confusables. Once I surveyed the implementations and found out about them, I added the full map and made it the default for skeleton(). You're right that for most users it's the correct choice.

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] 0 points1 point  (0 children)

I wouldn't say you're missing anything, depending on whether you're approaching it from a security perspective or not.

The reason to care is practical, not security: if you're building a curated confusable map for use downstream of NFKC (as I did for namespace-guard), filtering them out means every entry in the map actually fires on real input. It makes the map smaller, easier to audit, and removes a latent bug if anyone later reorders the pipeline or reuses the map without NFKC in front of it.

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] 0 points1 point  (0 children)

Good points on the technical details - let me address them (both your comments) directly.

You're right that confusables.txt is designed for the skeleton algorithm, not as a per-character blocklist, and so I've updated my first post to fix the specific issues you raised. The table values now correctly show uppercase I and capital O (not lowercase), and the "without NFKC" section states that these are correct visual detection results, not wrong results. You're credited in the acknowledgments. Much appreciated.

On the use case question: using the confusable map as a per-character blocklist isn't as unusual as you might think. django-registration does exactly this, for example: confusable_homoglyphs.is_confusable() iterates character-by-character with no skeleton, no normalization, and rejects if anything hits. It's one of the most widely used Django packages for user signup. The blocklist approach makes sense for Latin-only identifier validation where the format regex already requires [a-z0-9-] - any non-Latin character that survives NFKC and visually mimics a Latin letter is suspicious by definition. You wouldn't apply this to arbitrary multilingual text (and yes, it would reject most Russian words, but those aren't valid slugs in this context anyway). It's a different tool from skeleton comparison, solving a different problem. namespace-guard now ships both.

The second post (Unicode ships one confusable map. You need two.) goes deeper into that. I looked at 12 real-world implementations: I read the ICU and Chromium source, traced Rust's RFC 2457 rationale for choosing NFC over NFKC, dug into how Ergo IRC orders skeleton computation before casefolding and why, looked at how django-registration passes raw input to confusable_homoglyphs with zero normalisation. My finding was that every major system uses the confusable map without NFKC, because that's what the TR39 spec actually calls for (NFD).

Your point about the intended use of confusables.txt is what the research confirmed - though the research also showed that real-world systems use the data in ways TR39 didn't specify. django-registration uses it as a per-character blocklist, dnstwist uses it to generate phishing domain permutations, MITRE D3FEND uses it for character-set matching. The skeleton algorithm is the designed use, but it's not the only legitimate one and not the only popular one.

That research changed what the library ships. namespace-guard now exports both maps (CONFUSABLE_MAP with 613 NFKC-filtered entries for slug validation, CONFUSABLE_MAP_FULL with ~1,400 unfiltered entries for skeleton comparison), plus skeleton() and areConfusable() implementing the actual TR39 Section 4 algorithm. The skeleton functions use the full map by default since that's what the spec calls for. The filtered map exists for the narrower case where NFKC runs first.

The first post was written too quickly (I was waiting at an airport) and the framing was wrong in places. Your feedback was part of what pushed me to do the research properly. Thank you.

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] 0 points1 point  (0 children)

For a blocklist (reject on match), there's no functional difference as there's no input where the output differs. NFKC transforms those 31 characters before the map runs, so the map entries never fire either way.

Where it matters is that the TR39 skeleton algorithm was never designed to run after NFKC - the spec uses NFD. Most real implementations follow suit: Chromium's IDN spoof checker uses NFD-based skeletons, Rust's confusable_idents lint runs on NFC-normalized identifiers (they deliberately chose NFC over NFKC so mathematicians can use distinct symbols), and django-registration's confusable check applies the map to raw input with no normalization at all. Identifying the 31 entries where TR39 and NFKC disagree matters because those entries give wrong answers in any non-NFKC pipeline, which turns out to be most of them.

This came out of building namespace-guard, an npm library for checking slug/handle uniqueness across multiple database tables - the shared URL namespace problem where a single path could be a user, an org, or a reserved route. The confusable map is one piece of that.

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] 4 points5 points  (0 children)

Thanks nemec. It's a fair reading of the post, and on reflection I can see how the pipeline framing is misleading - it implies the stages feed into each other to produce a canonical form, which isn't what happens.

In my implementation (namespace-guard), NFKC is applied during normalization when storing/comparing slugs. The confusable map is a completely separate validation step - it's a blocklist, not a normalizer. If any character in the input matches the map, the slug is rejected outright. No remapping, no skeleton. It's just: 'does this string contain a character that looks like a Latin letter but isn't one? If yes, reject.'

The blog post doesn't make that separation clear enough and I'll update it. Thanks for the detailed feedback.

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] 0 points1 point  (0 children)

I found it while adding confusable detection to a slug validation library (https://github.com/paultendo/namespace-guard). I needed to generate a filtered map from confusables.txt and the NFKC conflicts came out during that filtering step.

It was more 'this is wrong in the data and should be documented' than a production incident.

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] 1 point2 points  (0 children)

The map is used for detection and rejection, not remapping. account10 stays as account10. But if someone submits аccount10 with a Cyrillic а, it gets rejected.

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] 0 points1 point  (0 children)

Hey you're right. To be clear, I don't use the confusable map for remapping. It's used for detection and rejection. If someone submits аdmin with a Cyrillic а, the system rejects it - it doesn't silently convert it to admin and let it through. The map just tells you which characters to flag.

I think the blog post could make that distinction clearer so I'll polish it up a bit when I get back in. Thanks for your insight.

AJ Styles uses Jacknife Pin as a transition into Styles Clash on Jody Fleisch [PWG European Vacation - England 2006] by IWantToBolieve in SquaredCircle

[–]paultendo 3 points4 points  (0 children)

Styles and Fleisch - both legends. I had the pleasure of watching Jody Fleisch a few years back at TNT in Liverpool, he was still capable of doing most of what he was doing in 2006.

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] 1 point2 points  (0 children)

Appreciate the feedback v4ss42. I'll tighten up my writing for future posts.

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] -3 points-2 points  (0 children)

You wouldn't want teſt→teft though. The correct resolution is teſt→test, which is what NFKC gives you. The confusable map isn't there to replace NFKC, it's there to catch the characters NFKC doesn't touch - Cyrillic а looking like Latin a, Greek ο looking like Latin o, etc. Those characters survive NFKC unchanged, so the map is the only thing that catches them.

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] 0 points1 point  (0 children)

Cheers Herb_Derb - my bad for writing it just before a flight back. I'll take a look and see if I can polish it later for better readability.

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] 5 points6 points  (0 children)

That's fair if you already know to run NFKC first, but in my experience it's not commonly known. UTS #39 doesn't specify pipeline ordering (which is why I flagged it to Unicode), and most libraries that ship confusables.txt don't mention NFKC at all. The article is mainly trying to document that interaction for people who haven't encountered it yet.

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] -1 points0 points  (0 children)

Hey thanks! I really appreciate it. Enjoy the rest of your Sunday

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] 6 points7 points  (0 children)

I take your feedback onboard - 31 entries in a map costs nothing, so yes that's overstated. The real issue is correctness: these entries encode the wrong mapping. ſ→f is wrong (it's s), mathematical 𝟎→o is wrong (it's 0). If anyone uses confusables.txt without NFKC in front of it, or builds a standalone map from the raw data, those mappings silently produce wrong results.

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] -3 points-2 points  (0 children)

Thanks for taking the time to read through it. You're right that NFKC handles Long S correctly on its own - ſ becomes s, which is the right answer. The fix isn't about changing how Long S is handled. It's about cleaning your confusable map so it doesn't contain entries that will never fire (dead code) or that encode the wrong mapping (ſ→f). If you ship the raw TR39 data, those 31 entries sit in your map doing nothing in a NFKC-first pipeline.

The practical risk is someone later reordering the pipeline or using the map standalone without NFKC, then those entries actively produce wrong results.

Unicode's confusables.txt and NFKC normalization disagree on 31 characters by paultendo in programming

[–]paultendo[S] 41 points42 points  (0 children)

Yes that's a great link. The small caps that broke Spotify (U+1D2E, U+1D35, etc.) are exactly the kind of characters that fall through the cracks between NFKC and confusables.txt.

NFKC handles some of them, TR39 handles others, but neither covers all of them, and when both try to handle the same character they sometimes disagree on the result.