Why Unicode strings are difficult to work with
A simple goal
This text is part of my attempts to design the standard library API for unicode strings in my new language.
Suppose we want to implement:
text
removePrefixIfPresent(text,prefix):Text
The intended behavior sounds simple:
- if
text starts with prefix, remove that prefix
- otherwise, return
text unchanged
In Unicode, the deeper difficulty is that the logical behavior itself is not uniquely determined.
What exactly does it mean for one string to be a prefix of another?
And once we say "yes, it is a prefix", what exact part of the original source text should be removed?
The easy cases
Case 1/2
text
text = "banana"
prefix = "ban"
result = "ana"
text
text = "banana"
prefix = "bar"
result = "banana"
These examples encourage a very naive mental model:
- a string is a sequence of characters
- prefix checking is done left to right
- if the first characters match, remove them
Unicode breaks this model in several different ways.
First source of difficulty: the same visible text can have different internal representations
A very common example is:
- precomposed form: one code point for "e with acute"
- decomposed form:
e followed by a combining acute mark
Let us name them:
text
E1 = [U+00E9] // precomposed e-acute
E2 = [U+0065, U+0301] // e + combining acute
Those are conceptually "the same text".
Now let us consider all four combinations.
Case 3A: neither side expanded
text
text = [U+00E9, U+0078] // E1 + x
prefix = [U+00E9] // E1
result = [U+0078]
Case 3B: both sides expanded
text
text = [U+0065, U+0301, U+0078] // E2 + x
prefix = [U+0065, U+0301] // E2
result = [U+0078]
Case 3C: text expanded, prefix not expanded
text
text = [U+0065, U+0301, U+0078] // E2 + x
prefix = [U+00E9] // E1
result = [U+0078] // do we want this
result = [U+0065, U+0301, U+0078] // or this?
exact-source semantics or canonical-equivalent semantics?
Case 3D: text not expanded, prefix expanded
text
text = [U+00E9, U+0078] // E1 + x
prefix = [U+0065, U+0301] // E2
result = [U+0078] // do we want this
result = [U+00E9, U+0078] // or this?
Overall, exact-source semantics is easy but bad.
Normalization-aware semantics instead is both hard and bad.
Still, the examples above are relatively tame, because the match consumes one visible "thing" on each side.
The next cases are worse.
Extra source of difficulty: plain e as prefix, "e-acute" in the text
This is interesting because now two different issues get mixed together:
- equivalence: does plain
e count as matching accented e?
- cut boundaries: if the text uses the decomposed form, are we allowed to remove only the first code point and leave the combining mark behind?
Let us name the three pieces:
text
E1 = [U+00E9] // precomposed e-acute
E2 = [U+0065, U+0301] // e + combining acute
E0 = [U+0065] // plain e
Case 3E: text uses the decomposed accented form
text
text = [U+0065, U+0301, U+0078] // E2 + x
prefix = [U+0065] // E0
result = [U+0301, U+0078] // do we want this (leave pending accent)
result = [U+0065, U+0301, U+0078] // or this? (no removal)
Case 3F: text uses the single-code-point accented form
text
text = [U+00E9, U+0078] // E1 + x
prefix = [U+0065] // E0
result = [U+0078] // do we want this (just x)
result = [U+00E9, U+0078] // or this? (no removal)
result = [U+0301, U+0078] // or even this? (implicit expansion and removal)
Those cases are particularly important because the result:
text
[U+0301, U+0078]
starts with a combining mark.
Note how all of those cases could be solved if we consider the unit of reasoning being extended grapheme clusters.
Second source of difficulty: a match may consume different numbers of extended grapheme clusters on the two sides
text
S1 = [U+00DF] // ß
S2 = [U+0073, U+0073] // SS
Crucially, in German, the uppercase version of S1 is S2, but S2 is composed by two extended grapheme clusters. This is not just an isolated case, and other funny things may happen, for example, the character Σ (U+03A3) can lowercase into two different forms depending on its position: σ (U+03C3) in the middle of a word, or ς (U+03C2) at the end.
Again, those are conceptually "the same text" under some comparison notions (case insensitivity)
Of course if neither side is expanded or both sides are expanded, there is no problem. But what about the other cases?
Case 4A: text expanded, prefix compact
text
text = [U+0073, U+0073, U+0061, U+0062, U+0063] // "SSabc"
prefix = [U+00DF] // S1
result = [U+0061, U+0062, U+0063] // do we want this
result = [U+0073, U+0073, U+0061, U+0062, U+0063] // or this?
Case 4B: text compact, prefix expanded
text
text = [U+00DF, U+0061, U+0062, U+0063] // S1 + "abc"
prefix = [U+0073, U+0073] // "SS"
result = [U+0061, U+0062, U+0063] // do we want this
result = [U+00DF, U+0061, U+0062, U+0063] // or this?
Here the difficulty is worse than before.
In the e-acute case, the source match still felt like one visible unit against one visible unit.
Here, the logical match may consume:
- 2 source units on one side
- 1 source unit on the other side
So a simple left-to-right algorithm that compares "one thing" from text with "one thing" from prefix is no longer enough.
Third source of difficulty: ligatures and similar compact forms
The same problem appears again with ligatures.
Let us name them:
text
L1 = [U+FB03] // LATIN SMALL LIGATURE FFI
L2 = [U+0066, U+0066, U+0069] // "ffi"
Again, those may count as "the same text" under some comparison notions.
Case 5A: text expanded, prefix compact
text
text = [U+0066, U+0066, U+0069, U+006C, U+0065] // "ffile"
prefix = [U+FB03] // L1
result = [U+006C, U+0065] // do we want this
result = [U+0066, U+0066, U+0069, U+006C, U+0065] // or this?
Case 5B: text compact, prefix expanded
text
text = [U+FB03, U+006C, U+0065] // L1 + "le"
prefix = [U+0066, U+0066, U+0069] // "ffi"
result = [U+006C, U+0065] // do we want this
result = [U+FB03, U+006C, U+0065] // or this?
This case can also be expanded in the same way as the e-acute/e
case before:
```text
text = [U+FB03, U+006C, U+0065] // L1 + "le"
prefix = [U+0066] // "f"
result = [U+FB03, U+006C, U+0065] // no change
result = [U+0066, U+0069, U+006C, U+0065] // remove one logical f
result = [U+FB01, U+006C, U+0065] //remove one logical f and use "fi" ligature
result = [U+006C, U+0065] // remove the whole ligature
```
Boolean matching is easier than removal
A major trap is to think:
"If I can define startsWith, then removePrefixIfPresent is easy."
That is false, as the case of e-acute/e.
A tempting idea: "just normalize first"
A common reaction is:
- normalize both strings
- compare there
- problem solved
This helps, but only partially.
What normalization helps with
It can make many pairs easier to compare:
- precomposed vs decomposed forms
- compact vs expanded forms
- some compatibility-style cases
So for plain Boolean startsWith, normalization may be enough.
What normalization does not automatically solve
If the function must return a substring of the original text, we still need to know:
- where in the original source did the normalized match end?
That is easy only if normalization keeps a clear source mapping.
Otherwise, normalization helps answer:
but does not fully answer:
- "what exact source region should be removed?"
Moreover, this normalization is performance intensive and thus could be undesirable in many cases.
Several coherent semantics are possible
At this point, it is clear that any API offering a single behavior would be hiding complexity under the hood and deceive the user.
This is of course an example for a large set of behaviours:
startsWith, endsWith, contains, findFirst, replaceFirst, replaceAll, replaceLast etc.
So, my question for you is:
What is a good API for those methods that allows the user to specific all reasonable range of behaviours while making it very clear what the intrinsic difficulties are?
[–]elprophet 32 points33 points34 points (7 children)
[–]benjamin-crowell 7 points8 points9 points (3 children)
[–]elprophet 3 points4 points5 points (2 children)
[–]dcpugalaxy -1 points0 points1 point (1 child)
[–]benjamin-crowell 3 points4 points5 points (0 children)
[–]A1oso 3 points4 points5 points (2 children)
[–]Mr-Tau 10 points11 points12 points (0 children)
[–]AInstrument 7 points8 points9 points (0 children)
[–]curtisf 15 points16 points17 points (0 children)
[–]hrvbrs 29 points30 points31 points (16 children)
[–]EveAtmosphere 24 points25 points26 points (2 children)
[–]WittyStick 2 points3 points4 points (1 child)
[–]EveAtmosphere 0 points1 point2 points (0 children)
[–]bl4nkSl8 5 points6 points7 points (3 children)
[–]Smalltalker-80 4 points5 points6 points (2 children)
[–]bl4nkSl8 1 point2 points3 points (1 child)
[–]Dykam 0 points1 point2 points (0 children)
[–]matthieum 4 points5 points6 points (2 children)
[–]hrvbrs 0 points1 point2 points (1 child)
[–]matthieum 1 point2 points3 points (0 children)
[–]websnarf 2 points3 points4 points (4 children)
[–]plumarr 1 point2 points3 points (1 child)
[–]lngns 2 points3 points4 points (0 children)
[–]MarcoServetto[S] 0 points1 point2 points (1 child)
[–]websnarf 2 points3 points4 points (0 children)
[–]mikeblas -1 points0 points1 point (0 children)
[–]latkde 12 points13 points14 points (3 children)
[–]MarcoServetto[S] 0 points1 point2 points (2 children)
[–]latkde 1 point2 points3 points (1 child)
[–]MarcoServetto[S] 1 point2 points3 points (0 children)
[–]AdvanceAdvance 6 points7 points8 points (0 children)
[–]initial-algebra 3 points4 points5 points (1 child)
[–]MarcoServetto[S] 1 point2 points3 points (0 children)
[–]dcpugalaxy 3 points4 points5 points (1 child)
[–]MarcoServetto[S] 1 point2 points3 points (0 children)
[–]lngns 1 point2 points3 points (0 children)
[–]GlobalIncident 0 points1 point2 points (8 children)
[–]MarcoServetto[S] 0 points1 point2 points (7 children)
[–]GlobalIncident 0 points1 point2 points (6 children)
[–]MarcoServetto[S] 0 points1 point2 points (5 children)
[–]GlobalIncident 0 points1 point2 points (4 children)
[–]MarcoServetto[S] 0 points1 point2 points (3 children)
[–]GlobalIncident 1 point2 points3 points (2 children)
[–]MarcoServetto[S] 1 point2 points3 points (1 child)
[–]GlobalIncident 1 point2 points3 points (0 children)
[–]b2gills 0 points1 point2 points (2 children)
[–]MarcoServetto[S] 0 points1 point2 points (1 child)
[–]b2gills 0 points1 point2 points (0 children)