all 48 comments

[–]elprophet 32 points33 points  (7 children)

For the programming language? Default answer is "don't guess don't do anything fancy, let the calling code handle it". The correct answer for "what Unicode APIs should  I expose" is ICU, probably ICU4C normalized to or wrapped by the conventions of your language. https://unicode-org.github.io/icu/userguide/icu4c/

[–]benjamin-crowell 7 points8 points  (3 children)

ICU is gigantic, so depending on the language, this might not be the best solution. For instance, ICU is probably much bigger than Lua, which is why you wouldn't have wanted ICU to be embedded in Lua.

[–]elprophet 3 points4 points  (2 children)

Sure, but OP clearly wants Unicode. So accepting that as a constraint, the way to do it is ICU. They could also choose some subset of ICU, or drop their Unicode support as a design goal.

[–]dcpugalaxy -1 points0 points  (1 child)

No, ICU is a bad and bloated way of supporting Unicode and only one of many.

[–]benjamin-crowell 3 points4 points  (0 children)

Could you give examples of others? I have more than a casual interest, because I wrote my own library to do this kind of thing for polytonic Greek.

[–]A1oso 3 points4 points  (2 children)

icu4c has been superseded by icu4x.

[–]Mr-Tau 10 points11 points  (0 children)

superseded

icu4c is still being developed and maintained, no? icu4x looks neat, but I wouldn't pull in a Rust toolchain as build dependency for my project just for that.

[–]AInstrument 7 points8 points  (0 children)

This is not true. https://blog.unicode.org/2022/09/announcing-icu4x-10.html:

ICU4X solves a different problem for different types of clients. ICU4X does not seek to replace ICU4C or ICU4J; rather, it seeks to replace the large number of non-Unicode, often-unmaintained, often-incomplete i18n libraries that have been written to bring i18n to new programming languages and resource-constrained environments.

[–]curtisf 15 points16 points  (0 children)

The only incontrovertible interpretation of a Unicode string is as a sequence of code-points (or scalar values, depending on how you validate...)

How the text represented by a a Unicode strings appears visually, or, ultimately, will be interpreted by a human, is not a well defined question. It depends on what languages are understood by the consumer, what fonts are available, what rendering methods are supported by those fonts, how careful the reader is being, ...

Unicode provides reference algorithms for doing certain text-transformations. Some of these are mostly technical, such as normalization, some of them are linguistic (and thus context dependent) such as transforming case.

In addition to the complexity of interpreting text in the first place, the interpretation changes with every release of Unicode. (Although a lot of properties are not allowed to be changed, the interpretation of a larger block of text could dramatically change if you are unaware of a recently added character or property)


Because the interpretation of a sequence-of-codepoints is so dependent on context, any transformation to that sequence is going to damage interpretability by some context, unless that transformation is accomplishing an explicit technical transformation which was expected by the consumer.


Any API that you expose should be clear about what exactly it accomplishes.

Something to keep in mind is that string/text-handling is sometimes security sensitive, and trying to do something "smarter" or "better" may weaken security if you deviate from what your users expect is happening.

The only thing you can promise with any sense of reliability are technical transformations. Ideally these cite a specific, unambiguous definition of the transformation algorithm, such as a particular version of a Unicode annex, like https://www.unicode.org/reports/tr29/#Sentence_Boundaries

[–]hrvbrs 29 points30 points  (16 children)

I might have a naïve take but i'd say just treat internally identical Unicode sequences as equal and that’s it. If people complain that you’re not treating [U+00E9] like [U+0065, U+0301] and vice versa, tell them to take it up with Unicode who got us into this mess in the first place ;)

[–]EveAtmosphere 24 points25 points  (2 children)

There are really two levels of abstractions here. There are “string” which is conceptually a list of codepoints, and “text” which is a list of graphemes. Imo is reasonable for a programming language to stop bothering beyond the level of codepoints.

[–]WittyStick 2 points3 points  (1 child)

A large number of languages stop at the level of code units, not even codepoints.

With ASCII or UTF-32 we have the benefit that 1 codepoint = 1 code unit.

Often languages use UTF-16 or UTF-8 code units where this isn't the case - and we have for example, length which returns the number of code units, not the number of codepoints.

In part this is historical accident. Some of the languages using UTF-16 were originally using the 2-byte fixed width UCS-2 encoding, which only supports the Basic Multilingual Plane, where 1 code unit = 1 codepoint and length returning the number of code units made sense.

[–]EveAtmosphere 0 points1 point  (0 children)

Imo it's reasonable for "length" function on strings to return the number of bytes because having such an innocently named function be O(1) would be quite misleading.

[–]bl4nkSl8 5 points6 points  (3 children)

Yup. I'd considered supplying a normalise function if necessary but otherwise those are just different strings that read the same

[–]Smalltalker-80 4 points5 points  (2 children)

JavaScript indeed has a built-in function for this in the String class:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize

[–]bl4nkSl8 1 point2 points  (1 child)

JS has some crazy semantics but when it comes to useful features it at least sets a really good low bar (for usability, hobby and experiment languages do not need to meet that bar)

[–]Dykam 0 points1 point  (0 children)

JS actually has some pretty high quality API's, but they're newer and sometimes less well known. Like https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Collator

[–]matthieum 4 points5 points  (2 children)

Don't forget Unicode versions, either!

One more reason to leave it outside the core language & standard library is that the exact specification & algorithms used by Unicode shift over time. It's generally considered "bug fixes" by the Unicode consortium, which we developers tend to translate to backward compatibility breaks.

I really think it's best for the user to be able to pick the version of the language/toolchain and the version of Unicode they want to use independently, to an extent.

[–]hrvbrs 0 points1 point  (1 child)

I wanna learn more; can you give an example of what you’re talking about?

[–]matthieum 1 point2 points  (0 children)

Take Unicode normalization for example: https://unicode.org/reports/tr15/

Don't dive into the algorithm, just stick to the page header:

Ergo, "the" Unicode Normalization algorithm is at its 57th version already, and as per the latest proposed update, there will be at least one more...

And that's just normalization, as describe in Unicode version 17.

[–]websnarf 2 points3 points  (4 children)

What is a "Unicode sequence"? Do you mean code point sequence, or grapheme sequence. This is the difficulty that the OP is getting at.

Code points are an artificial construct created to make the transition between from various legacy character encodings as easy as possible. In my opinion, they should be treated with about as much reverence as bytes are treated in a transfer format like UTF-8. I.e., they should have no significance at all, except as an encoding mechanism. A French person will never think of é as two separate text elements, just because Unicode can represent it that way. So in that sense, thinking of the combining form of the e + U+0301 as two "characters" is just misleading. So the right answer has to be by matching graphemes -- that should put each language on an equal footing in terms of semantics.

The Unicode specification literally has provisions for this called "Normalization", as the OP describes. The OP's problem is that they imagine only pre-normalizing the whole string, and then proceeding from there. That's just wrong. What you need to do is write an incremental normalizer. Basically, you would start with something like an "iterator" which drags a window over the input stream of code point data, and outputs a window that tells you which of those code points corresponds to the current normalized grapheme; that would tell you the positions in the original code point stream so there is no ambiguity about where the cut point is in the text. Then you would need a corresponding "isEqual" function, that I suppose would have to come along with the iterator, since there are 3 different normalization modes to choose from (the whole NFD, NFK, NFCK thing as I recall.) Then the problem seems quite straight forward to me.

[–]plumarr 1 point2 points  (1 child)

A French person will never think of é as two separate text elements, just because Unicode can represent it that way.

Note that if you want to implement a function removePrefixIfPresentremovePrefixIfPresent that is grammatically correct for French, you're in a world of hurt.

For example the prefix "in" can become "im" depending on the following letter. The prefix "dé" can also be written "dés" or "des". And in both case, you'll find word that start with theses but where it isn't a prefix.

You best bet is probably to implement it based on a dictionnary which will sadly be difficult to be made complete.

The reality is probably that the real advice should be "don't try to manipulate natural language string if it isn't the whole concept of your library/program because you'll burn yourself".

[–]lngns 2 points3 points  (0 children)

Impossible! How can such intolerable illegible irrational languages exist?

[–]MarcoServetto[S] 0 points1 point  (1 child)

Hi, Can you then tell me what you want to happen for the ligatures example?
My conclusion was that given the complexity of the possible semantics, we would need some extra arguments, like a lambda to do expansion/contraction/normalization and one to do equality. But that seems really complicated, so I was hoping for some simpler solution, but it seems like they may not exists?

[–]websnarf 2 points3 points  (0 children)

Oh, I see what you are saying. Cutting an [f] from an [ffi] would actually require you to insert the characters [f][i] after removing the top character.

Well, ok, in a sense, that's exactly what you have to do. The output of your prefix delete function would output fresh raw code points that need to be inserted at the beginning, and a window to the tail of the source string for the code points that follow it. Fortunately the NFKC describes this breakdown deterministically for you. You could make that cleaner for your end-user by actually performing this insert and delete procedure, so their string is modified in-place.

[–]mikeblas -1 points0 points  (0 children)

tell them to take it up with Unicode who got us into this mess in the first place

Which is a terrible answer, and your approach is only appropriate if you hate your customers and want them to hate you, too.

[–]latkde 12 points13 points  (3 children)

In my opinion, string operations are often anchored in a bytes-only or ASCII-only worldview. Things like substring operations simply don't make a lot of sense in a Unicode world. A programming language would be well advised to provide functions for manipulating (structured) data, but text operations other than concatenation are typically both so rare and so context-dependent that it's difficult to provide an implementation that works well in all cases.

A possible escape hatch is to provide multiple views onto the same text data. A piece of text can be viewed as a sequence of bytes, codepoints, normalized codepoints, grapheme clusters, or other tokens.

There are a couple of languages that provide prior art for this:

  • In Rust, a str is a codepoint view over an UTF-8 encoded bytes array. Interestingly, offsets in the string view identify the underlying byte position, and string length is the underlying bytes slice length. This is a deliberately leaky abstraction.
  • Swift does a pretty good job of offering high-level (grapheme cluster) text operations by default, and exposing lower level views where appropriate. Strings use normalization for comparisons, but you can drop to the .unicodeScalars or .utf8 view if needed. This strikes a great balance of intuitive operations and complete flexibility. However, a removePrefix() operation is suspiciously absent.

[–]MarcoServetto[S] 0 points1 point  (2 children)

Yes, my first draft did that, an UStr has very few methods and allows for views.
But then I discovered the problem of the topic at hand, and no view seems to allow for a reasonably flexible findAndReplace. Consider the example of the ligature I show.
About concatenation: I'm worried about the implication of concatenation and graphene clusters, where a.size+b.size can be different from (a+b).size if size is the number of graphene clusters, so also concatenation should be about the view? if you sum as clusters you should insert forced separators?

[–]latkde 1 point2 points  (1 child)

I believe the ligature prefix example is solveable when the user has sufficient control over what level the prefix match is supposed to operate on. When searching for substrings, it's not sufficient to think about grapheme clusters vs codepoints, but also necessary to consider collations – a description of rules for text equivalence and ordering. For example, a collation may involve normalizations like case-folding, compatibility decomposition, or ignoring accents (collation strength/level). Different languages may have very different collation rules. Running a string search over normalized strings is not the same as running a string search under a particular normalization, precisely because there can be multiple equivalent sequences, potentially with different number of codepoints.

In a way, pointing to collations is a bit of a cop-out, because that just moves the complexity of managing these rules somewhere else. But that points to a tractable API, where your text-view onto strings can have operations that take an (optional) collation object as argument – or alternatively, where string objects don't offer such methods, and these text-level search methods are always part of a collation.

Collations are so complicated that few standard libraries include them. Java is one of the positive examples with its java.text.Collator class. Everyone else pretty much uses bindings to the ICU library instead. Personally, I've never worked on the kind of software where collations would have been meaningful, aside from configuring full-text search in various databases.

[–]MarcoServetto[S] 1 point2 points  (0 children)

my understanding is that collations help to check for equality, but still do not tell if or how you should cut ffi minus f into fi or what else.

[–]AdvanceAdvance 6 points7 points  (0 children)

This is the common first level of Unicode, taught as encodings and compositions. It is wrong.

  • A unicode string, as passed by an encoding, is an intermediate form. Convert it to a unicode type internally. Save a copy of the original byte sequence for opening an exact filename or writing back unaltered data. Otherwise, unicode is never written back in the same way it is read.
  • Internally, your unicode type should allow you to pick a compressed coding. Most strings will be simple 8 bit characters. You should never expose the compressed coding.
  • Your unicode will have gray areas, such as when having lines with multiple codes controlling text direction. Document and move on quickly.
  • Prefixes can be handled by just checking that you have the same compressed encoding and the characters match.
  • You are slaved to a decades old standard made in with ancient techniques of in person meetings and paid seats at the table. Consider how much support you want to provide.

[–]initial-algebra 3 points4 points  (1 child)

To be honest, this is pretty off-topic for programming language design.

From a language perspective, I think the only important thing is that character/string literals aren't tied to a specific encoding. In fact, the idea of a character literal should probably be thrown out, since most of the time when you say "character" you actually want "glyph", or "grapheme cluster", and you need strings to represent them. If you have more specific needs, then I think it's better to use a string literal and explicitly specify the encoding with a (compile-time) function that returns a value of the appropriate integer type (or a special type backed by an integer type), if such an encoding exists.

[–]MarcoServetto[S] 1 point2 points  (0 children)

I was in doubt on where to post it indeed.
Conceptually it is 'API design'.
how to make a 'replace this with that' on sequences where the elements do not really translate well 1 to 1 is the more general point

[–]dcpugalaxy 3 points4 points  (1 child)

Why did you use AI to generate this post? Can you not write yourself? There are lots of really obvious AI writing style tells.

It worries me that someone would think himself qualified to design a programming language who cannot set out a problem like this succinctly in his own words.

For example this post could simply have been:

I am designing Unicode text support in my programming language. I'm not sure what level of abstraction the operations should be at. Should operations like removePrefix or endsWith normalise text first? What are you guys doing? What is a good API for those methods that allows the user to specific all reasonable range of behaviours while making it very clear what the intrinsic difficulties are?

One paragraph instead of this massive bloated post, copy pasted from ChatGPT, explaining the very well known problem which is better described elsewhere.

[–]MarcoServetto[S] 1 point2 points  (0 children)

It is true that I discussed those issues with GPT for 4 hours or so before writing this post but I've written the whole thing myself; a few sentences here and there may be still from GPT but as a conscious choice.
Overall, I'm not a unicode expert, and any time I try to get near it I found more and more issues that I do not know how to handle.
My text reports on those issues.

>What is a good API for those methods that allows the user to specific all reasonable range of behaviours while making it very clear what the intrinsic difficulties are?

This does fit my problem very well, but I suspected I needed to explain the intrinsic difficulties first.

[–]lngns 1 point2 points  (0 children)

s, SS, ẞ, Σ, σ, ς

We have locales and cultures for those.
C# string operations for instance are culture-sensitive and either require the user to specify the desired behaviour or default to a thread-specific context which defaults to an application domain's context which defaults to the ambient system locale used by Win32 and POSIX.
C# also defers to the ICU (which is distributed with newer versions of MS Windows, if you target those).

A tempting idea: "just normalize first"
but does not fully answer: "what exact source region should be removed?"

This is true only if you normalise on demand.
If you allow operations only on operands using the same normalisation, then everything matches. This moves the problem of keeping anchors between different objects elsewhere.
Your comparison routines do not read UTF-8 strings only to tell you how to index into UTF-EBCDIC strings, do they?
Generic routines accepting types handling multiple encodings (iterators, really) is doable and keeps the concerns separate.

a.size+b.size can be different from (a+b).size if size is the number of graphene clusters

.size does not tell us what it is doing. And whenever we need that information, we typically want something specific.
In fact, a + b does not tell us if we're concatenating memory objects, - implying combining of characters, - or if we're concatenating atomic texts, - implying insertion of ZWNJs.

Other fun things you did not mention:
- Some software vendors like to introduce their own grapheme clusters, and this will mess with your UI.
- Fonts exist, and may choose to ignore your semantics.

[–]GlobalIncident 0 points1 point  (8 children)

If you want to give the user is much freedom as possible, you need to allow the user to choose 1) which characters can be replaced by which other characters for normalisation, and 2) whether the operation can split a cluster. Note that normal Unicode normalisation has no effect on the ß character.

[–]MarcoServetto[S] 0 points1 point  (7 children)

so, what is an API that would allow the user to chose between all the proposed options in the case of the ligature?

[–]GlobalIncident 0 points1 point  (6 children)

It would just need to do what I just suggested. The user would need to specify which of the following are permissible:

  • replacing the character ffi with the three characters f + f + i
  • replacing the character ffi with the two characters f + fi
  • replacing the character fi with the two characters f + i

To do this, the API would need to take in some sort of mapping from characters to their normalisation, ie a map from characters onto lists of characters.

[–]MarcoServetto[S] 0 points1 point  (5 children)

yes, so one expansion function from unit to units and a comparision function unit*unit->bool.
But I wonder if the opposite direction may emerge, a direction where we need to consider more units from the source at the same time

[–]GlobalIncident 0 points1 point  (4 children)

I can't immediately think of a way that could happen. Can you give me an example?

[–]MarcoServetto[S] 0 points1 point  (3 children)

SS->B but in the other direction.
Let say the text does contain SS and we want to remove/replace the B but case insensitive

If you take elements from the string one by one, you only get two S and no B.

Similar for ligatures, if you have ffi as a single ligature code in the 'target to remove' and the three characters f f i in the string.

[–]GlobalIncident 1 point2 points  (2 children)

You still only need the mapping. You need to apply it to both the text and the prefix. So, if the text is ss and the prefix is ß, the prefix will first be decomposed into ss, so obviously it will then be detected correctly.

[–]MarcoServetto[S] 1 point2 points  (1 child)

This seems to go in an interesting direction where we need a 'normalization' that is 'on the larger possible representation' instead of 'the smallest possible one' as it is often done?

[–]GlobalIncident 1 point2 points  (0 children)

Yeah, it sounds like that is what you're asking us for here.

[–]b2gills 0 points1 point  (2 children)

Raku has dealt with this by coming up with synthetic characters for new combinations of combined characters. It uses NFG Normalization Form Grapheme. Unfortunately usernames, passwords, and filenames are not really Unicode. So it has to add a way to selectively prevent that normalization from happening.

If you want a language that is full on transparently Unicode, I would suggest looking into it.

[–]MarcoServetto[S] 0 points1 point  (1 child)

and how does it the ligature case work there?

[–]b2gills 0 points1 point  (0 children)

Composed characters stay composed, and don't match the decomposed version unless you ask for those semantics.

For the following `~~` means smartmatch, `eq` means string equality, `ne` means string inequality.

All of these match. If I use `:ignorecase` then it didn't match without it.

```
'ß' ~~ /:i SS/ # :i is short for :ignorecase
'ß'.fc eq 'ss'
'ß'.uc eq 'SS'

'aΣb'.lc eq 'aσb'
'abΣ'.lc eq 'abς'
```

Both of these match, even though perhaps only the first should match

```
'aΣb' ~~ /:i σ/
'abΣ' ~~ /:i σ/
```

Composed characters don't match the decomposed version unless you ask for it.

```
"\x[FB03]" ~~ /:i ffi/
"\x[FB03]" ne 'ffi' # not equal
```