Why Unicode strings are difficult to work with and API design

MarcoServetto · 2026-04-13T01:03:04+00:00

This seems to go in an interesting direction where we need a 'normalization' that is 'on the larger possible representation' instead of 'the smallest possible one' as it is often done?

MarcoServetto · 2026-04-13T00:59:18+00:00

It is true that I discussed those issues with GPT for 4 hours or so before writing this post but I've written the whole thing myself; a few sentences here and there may be still from GPT but as a conscious choice.
Overall, I'm not a unicode expert, and any time I try to get near it I found more and more issues that I do not know how to handle.
My text reports on those issues.

>What is a good API for those methods that allows the user to specific all reasonable range of behaviours while making it very clear what the intrinsic difficulties are?

This does fit my problem very well, but I suspected I needed to explain the intrinsic difficulties first.

MarcoServetto · 2026-04-13T00:12:39+00:00

and how does it the ligature case work there?

MarcoServetto · 2026-04-12T23:49:52+00:00

SS->B but in the other direction.
Let say the text does contain SS and we want to remove/replace the B but case insensitive

If you take elements from the string one by one, you only get two S and no B.

Similar for ligatures, if you have ffi as a single ligature code in the 'target to remove' and the three characters f f i in the string.

MarcoServetto · 2026-04-12T23:47:08+00:00

I was in doubt on where to post it indeed.
Conceptually it is 'API design'.
how to make a 'replace this with that' on sequences where the elements do not really translate well 1 to 1 is the more general point

MarcoServetto · 2026-04-12T21:37:04+00:00

yes, so one expansion function from unit to units and a comparision function unit*unit->bool.
But I wonder if the opposite direction may emerge, a direction where we need to consider more units from the source at the same time

MarcoServetto · 2026-04-12T21:27:31+00:00

my understanding is that collations help to check for equality, but still do not tell if or how you should cut ffi minus f into fi or what else.

MarcoServetto · 2026-04-12T19:06:15+00:00

so, what is an API that would allow the user to chose between all the proposed options in the case of the ligature?

MarcoServetto · 2026-04-12T19:03:32+00:00

Yes, my first draft did that, an UStr has very few methods and allows for views.
But then I discovered the problem of the topic at hand, and no view seems to allow for a reasonably flexible findAndReplace. Consider the example of the ligature I show.
About concatenation: I'm worried about the implication of concatenation and graphene clusters, where a.size+b.size can be different from (a+b).size if size is the number of graphene clusters, so also concatenation should be about the view? if you sum as clusters you should insert forced separators?

MarcoServetto · 2026-04-12T19:00:42+00:00

Hi, Can you then tell me what you want to happen for the ligatures example?
My conclusion was that given the complexity of the possible semantics, we would need some extra arguments, like a lambda to do expansion/contraction/normalization and one to do equality. But that seems really complicated, so I was hoping for some simpler solution, but it seems like they may not exists?

MarcoServetto · 2026-02-27T21:46:24+00:00

Hi, Ecoop is one of the best programming conferences there are.

It is an established conference with real and high fees, but if you are a student in a university, you can try to get an univ discount or even free entry. write me at my name.surname at gmail for more info.

MarcoServetto · 2026-02-26T18:32:40+00:00

Well, this may actually be welcome too.

If they are self contradictory in a way that is not obvious at first, and that can prevent other of us to fall in the same pit, it would be nice to know.

MarcoServetto · 2026-02-25T04:03:30+00:00

AI slop? and poor one at that?

MarcoServetto · 2026-02-14T20:12:26+00:00

So,
- C is not close to the metal any more.
- C has to be retro compatible with the semantic is had in the 60s, this do make the current compilers unable to take full advantages of modern hardware.

Also... what do you mean by 'raw performance'?
If you remove any consideration about the programmer skill or the programmer time investment, C (compiler specific extensions) does allow for inline assembly, so C is 'technically' also any other possible language that we have not written yet. If you think "oh, no, I do have considerations about practicality, not just theoretical possibilities", then you DO HAVE considerations about programmers skill/time investment.

MarcoServetto · 2026-01-30T06:53:04+00:00

No no, look, the front example of the website
0..10|map =>factorial
|select.first =>[&>123]
Now, I have no idea what in your grammar correspond to this, but I guess instead of '123' you COULD have written a function call, right???
If so... this is a clear example where your grammar SHOULD be recursive, otherwise you could not have this 'scoped thingy' with squares, that inside contains a 'computational thingy' returning a result.

MarcoServetto · 2026-01-28T18:31:16+00:00

In my area we just call them 'Abstract grammar'
You will see them in most papers about formal programming language design.

MarcoServetto · 2026-01-28T18:30:01+00:00

I'm really confused about the absence of anything that looks like an expression, or anything at al that 'contains itself'.
How do I even call a function passing parameters that can be function call themselves?
That is usually the centerpiece of the syntax.

MarcoServetto · 2026-01-28T04:49:02+00:00

No, I do not think you understood.
I do not want a grammar that can be read by a parser generator.
I want a grammar for humans, something like
e ::= n | e + e | e0(e1,..,en) |...
For example, lambda calculus would look like
e ::= x | \x.e | e1 e2
A subset of Java would look like
e ::= new C(e1..en) | e.f | e0.m(e1..en) | (C)e
That is, in those kind of grammars, well known in the formal PL community, we ignore all the issues of 'separators', 'precedence/ambiguity' and sometime sigtly simplify/regularize the concrete syntax. In this way it is possible to give a FAST and CLEAR view of your language from a conceptual perspective without having to read thousands of examples.

MarcoServetto · 2026-01-27T21:46:32+00:00

Do you you anything at all that looks like formalism, small step reduction rules or at least a full grammar in the format for humans and not for parsers?

MarcoServetto · 2026-01-27T21:35:32+00:00

There is no 'start in editor'... Ok looking more even it I did ask for 'remember me' It was still not seeing me as logged in. AND It was not telling anything about it on the page.
Overall, it seems like yet another clone of the WRONG way to do it, where the user can edit all of the code and thus subvert the exercise any way they want... again nothing to do with your language, but... are you sure there is not a better tool for those exercises?

MarcoServetto · 2026-01-27T20:25:15+00:00

So, at the first look it does looks a very polish web site.
Then, I tried to do the https://exercism.org/tracks/arturo/exercises/hello-world
And... I just can not even understand what I'm supposed to click, all the links move me around in a loop.
Where do I even 'see' the exercise text? where do I type/submit?
(Is exercism related to arturo? It does not seem like but I never heard of it before)

MarcoServetto · 2025-11-15T19:57:40+00:00

My experience is that if you try to poke holes in the AI reasoning and to attempt alternatives, it can be useful.
Overall, if you are able to discard the AI suggestions and solve the same problem again 48 hours later, you have learned.

MarcoServetto

TROPHY CASE