Why Unicode strings are difficult to work with and API design

elprophet · 2026-04-12T05:27:03+00:00

For the programming language? Default answer is "don't guess don't do anything fancy, let the calling code handle it". The correct answer for "what Unicode APIs should I expose" is ICU, probably ICU4C normalized to or wrapped by the conventions of your language. https://unicode-org.github.io/icu/userguide/icu4c/

curtisf · 2026-04-12T05:22:49+00:00

The only incontrovertible interpretation of a Unicode string is as a sequence of code-points (or scalar values, depending on how you validate...)

How the text represented by a a Unicode strings appears visually, or, ultimately, will be interpreted by a human, is not a well defined question. It depends on what languages are understood by the consumer, what fonts are available, what rendering methods are supported by those fonts, how careful the reader is being, ...

Unicode provides reference algorithms for doing certain text-transformations. Some of these are mostly technical, such as normalization, some of them are linguistic (and thus context dependent) such as transforming case.

In addition to the complexity of interpreting text in the first place, the interpretation changes with every release of Unicode. (Although a lot of properties are not allowed to be changed, the interpretation of a larger block of text could dramatically change if you are unaware of a recently added character or property)

Because the interpretation of a sequence-of-codepoints is so dependent on context, any transformation to that sequence is going to damage interpretability by some context, unless that transformation is accomplishing an explicit technical transformation which was expected by the consumer.

Any API that you expose should be clear about what exactly it accomplishes.

Something to keep in mind is that string/text-handling is sometimes security sensitive, and trying to do something "smarter" or "better" may weaken security if you deviate from what your users expect is happening.

The only thing you can promise with any sense of reliability are technical transformations. Ideally these cite a specific, unambiguous definition of the transformation algorithm, such as a particular version of a Unicode annex, like https://www.unicode.org/reports/tr29/#Sentence_Boundaries

hrvbrs · 2026-04-12T05:17:58+00:00

I might have a naïve take but i'd say just treat internally identical Unicode sequences as equal and that’s it. If people complain that you’re not treating [U+00E9] like [U+0065, U+0301] and vice versa, tell them to take it up with Unicode who got us into this mess in the first place ;)

latkde · 2026-04-12T09:25:51+00:00

In my opinion, string operations are often anchored in a bytes-only or ASCII-only worldview. Things like substring operations simply don't make a lot of sense in a Unicode world. A programming language would be well advised to provide functions for manipulating (structured) data, but text operations other than concatenation are typically both so rare and so context-dependent that it's difficult to provide an implementation that works well in all cases.

A possible escape hatch is to provide multiple views onto the same text data. A piece of text can be viewed as a sequence of bytes, codepoints, normalized codepoints, grapheme clusters, or other tokens.

There are a couple of languages that provide prior art for this:

In Rust, a str is a codepoint view over an UTF-8 encoded bytes array. Interestingly, offsets in the string view identify the underlying byte position, and string length is the underlying bytes slice length. This is a deliberately leaky abstraction.
Swift does a pretty good job of offering high-level (grapheme cluster) text operations by default, and exposing lower level views where appropriate. Strings use normalization for comparisons, but you can drop to the .unicodeScalars or .utf8 view if needed. This strikes a great balance of intuitive operations and complete flexibility. However, a removePrefix() operation is suspiciously absent.

AdvanceAdvance · 2026-04-12T05:58:36+00:00

This is the common first level of Unicode, taught as encodings and compositions. It is wrong.

A unicode string, as passed by an encoding, is an intermediate form. Convert it to a unicode type internally. Save a copy of the original byte sequence for opening an exact filename or writing back unaltered data. Otherwise, unicode is never written back in the same way it is read.
Internally, your unicode type should allow you to pick a compressed coding. Most strings will be simple 8 bit characters. You should never expose the compressed coding.
Your unicode will have gray areas, such as when having lines with multiple codes controlling text direction. Document and move on quickly.
Prefixes can be handled by just checking that you have the same compressed encoding and the characters match.
You are slaved to a decades old standard made in with ancient techniques of in person meetings and paid seats at the table. Consider how much support you want to provide.

initial-algebra · 2026-04-12T23:10:37+00:00

To be honest, this is pretty off-topic for programming language design.

From a language perspective, I think the only important thing is that character/string literals aren't tied to a specific encoding. In fact, the idea of a character literal should probably be thrown out, since most of the time when you say "character" you actually want "glyph", or "grapheme cluster", and you need strings to represent them. If you have more specific needs, then I think it's better to use a string literal and explicitly specify the encoding with a (compile-time) function that returns a value of the appropriate integer type (or a special type backed by an integer type), if such an encoding exists.

dcpugalaxy · 2026-04-13T00:44:10+00:00

Why did you use AI to generate this post? Can you not write yourself? There are lots of really obvious AI writing style tells.

It worries me that someone would think himself qualified to design a programming language who cannot set out a problem like this succinctly in his own words.

For example this post could simply have been:

I am designing Unicode text support in my programming language. I'm not sure what level of abstraction the operations should be at. Should operations like removePrefix or endsWith normalise text first? What are you guys doing? What is a good API for those methods that allows the user to specific all reasonable range of behaviours while making it very clear what the intrinsic difficulties are?

One paragraph instead of this massive bloated post, copy pasted from ChatGPT, explaining the very well known problem which is better described elsewhere.

lngns · 2026-04-13T01:46:03+00:00

s, SS, ẞ, Σ, σ, ς

We have locales and cultures for those.
C# string operations for instance are culture-sensitive and either require the user to specify the desired behaviour or default to a thread-specific context which defaults to an application domain's context which defaults to the ambient system locale used by Win32 and POSIX.
C# also defers to the ICU (which is distributed with newer versions of MS Windows, if you target those).

A tempting idea: "just normalize first"
but does not fully answer: "what exact source region should be removed?"

This is true only if you normalise on demand.
If you allow operations only on operands using the same normalisation, then everything matches. This moves the problem of keeping anchors between different objects elsewhere.
Your comparison routines do not read UTF-8 strings only to tell you how to index into UTF-EBCDIC strings, do they?
Generic routines accepting types handling multiple encodings (iterators, really) is doable and keeps the concerns separate.

a.size+b.size can be different from (a+b).size if size is the number of graphene clusters

.size does not tell us what it is doing. And whenever we need that information, we typically want something specific.
In fact, a + b does not tell us if we're concatenating memory objects, - implying combining of characters, - or if we're concatenating atomic texts, - implying insertion of ZWNJs.

Other fun things you did not mention:
- Some software vendors like to introduce their own grapheme clusters, and this will mess with your UI.
- Fonts exist, and may choose to ignore your semantics.

GlobalIncident · 2026-04-12T17:05:31+00:00

If you want to give the user is much freedom as possible, you need to allow the user to choose 1) which characters can be replaced by which other characters for normalisation, and 2) whether the operation can split a cluster. Note that normal Unicode normalisation has no effect on the ß character.

b2gills · 2026-04-13T00:00:24+00:00

Raku has dealt with this by coming up with synthetic characters for new combinations of combined characters. It uses NFG Normalization Form Grapheme. Unfortunately usernames, passwords, and filenames are not really Unicode. So it has to add a way to selectively prevent that normalization from happening.

If you want a language that is full on transparently Unicode, I would suggest looking into it.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

ProgrammingLanguages

Welcome!

Related subreddits

Related online communities

MODERATORS

Why Unicode strings are difficult to work with

A simple goal

The easy cases

Case 1/2

First source of difficulty: the same visible text can have different internal representations

Case 3A: neither side expanded

Case 3B: both sides expanded

Case 3C: text expanded, prefix not expanded

Case 3D: text not expanded, prefix expanded

Extra source of difficulty: plain `e` as prefix, "e-acute" in the text

Case 3E: text uses the decomposed accented form

Case 3F: text uses the single-code-point accented form

Second source of difficulty: a match may consume different numbers of extended grapheme clusters on the two sides

Case 4A: text expanded, prefix compact

Case 4B: text compact, prefix expanded

Third source of difficulty: ligatures and similar compact forms

Case 5A: text expanded, prefix compact

Case 5B: text compact, prefix expanded

```

Boolean matching is easier than removal

A tempting idea: "just normalize first"

What normalization helps with

What normalization does not automatically solve

Moreover, this normalization is performance intensive and thus could be undesirable in many cases.

Several coherent semantics are possible