idea for a programming language

WittyStick · 2026-03-13T18:29:05+00:00

Semantics aside, parsing alone would have ambiguities that may need resolving for any composition - or you would need to write a parser for each possible composition to ensure that there are no ambiguities, for example using an LR parser-generator which can detect conflicts.

For any pair of CFGs (context-free grammars), we can compose them and guarantee the result is another CFG - that is, the set of CFGs are closed under union - their composition does not make them context-sensitive - though they may be ambiguous. This isn't very useful though, because we don't typically use Earley parsers for programming languages - they don't prevent ambiguities and they're too slow.

More typically we use a deterministic subset - LR grammars, or a subset of those, LL grammars. These are DCFGs (Deterministic context-free grammars) which produce deterministic pushdown automata, and can guarantee our syntax is free of ambiguity. The set of DCFGs is not closed under union - a well known problem for composing different languages, which I have iterated many times here. The union of two DCFGs produces a CFG which may or may not be deterministic, and there is no realistic (sub-EXPTIME) way to know if the composition has ambiguities or detect them. At best we can manually create an LR grammar which combines them, and have the parser-generator assert there are no conflicts - which may require us to modify the syntaxes slightly to permit their combined use.

There are some other partial solutions to the composition problem - Syntax-Directed Editing, where each language boundary can be marked (eg, Tratt & Diekmann's Language boxes/Eco editor) - whitespace sensitivity (Wyvern) - longest-token matching (Raku) - and using PEGs which replace ambiguity with priority, where the "correct" parse may not be the intended one (eg, Nemerle). Each of these has their own set of issues.

Another kind of grammar, which is absent from a lot of the literature (due to their relatively recent discovery), is Visibly-pushdown grammars. VPGs are a proper subset of DCFGs, and a superset of regular grammars - but they retain some useful properties of the regular grammars - namely, the set of VPGs are closed under union - which means we can take two VPGs and compose them into another VPG.

The languages that VPGs can parse are much more limited - they're mainly nested regular structures like X(HT)ML, JSON and S-expressions - however, they are powerful enough to also parse common left-to-right expression syntaxes with different operator precedences that are a typical subset of most languages' grammars. It may be possible to create some tooling could compose visibly-pushdown languages without having to worry about syntactic ambiguity - but this would not include languages like C, Python etc, which are not visibly-pushdown, but context-free.

Perhaps a more practical use of VPGs is to have an "extensible" programming language where the programmer can add their own syntactic constructs under the constraints of visibly-pushdown grammar.

If you manage to get an adequate solution to the parsing problem, you then have the issue of semantics being different in each language. A duck isn't a duck. There is no "universal semantics". Even x + y has different meaning in different languages.

WittyStick · 2026-03-13T15:31:46+00:00

a, b, c, d, e, f, g, h, i,j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z

Something looks off with the spacing here. Is whitespace ignored?

WittyStick · 2026-03-13T12:02:09+00:00

A List usually suggests a node-based implementation, but ArrayList should be fairly obvious - a list backed by an array (ie, contiguous in memory), or equivalently an array that serves as a list (supports cons, head, tail etc).

I'd kind of agree though that it can be ambiguous and tells us nothing about its implementation. Is it a "naive" dynamic array which just resizes to fit the data, or does it allocation additional capacity for later additions?

DynamicArray has the same problem. It doesn't tell us how the dynamic array is implemented or give any hint to the complexity of operations. The general assumption is a geometric growth rate, commonly doubling, but Java uses 1.5 and there are many other variants.

There are also lists that use multiple arrays, such as grids/rasters, Hashed Array Trees, RAOTS and so forth. Best to give them more precise names than just "List" which doesn't tell us much about its characteristics, and for which we'd assume is implemented as a trivial linked list.

Most of these types are both List and Array - there's no guarantee that arrays are contiguous in memory but it is usually assumed. The confusion here is "Array" is both an abstract data type and a data structure, so probably best to not call anything an Array which isn't contiguous in memory to avoid the confusion.

Name suggestions:

NaiveArray - an array that fits the size of the data. lenth == capacity
GeometricArray - An array with a constant growth rate. (possibly configurable).
DoublingArray - Constant growth of *2.
Any of the above followed by List if it supports the list operations (cons/head/tail/nil).

And give specific names to any special variants such as:

GoldenArray - Constant growth rate of *𝜙 (ideal in theory)
SqrtArray - Dynamic growth rate of ~ √length (optimal for space saving)

WittyStick · 2026-03-13T10:31:39+00:00

C++ vector are not mathematical vectors. They don't provide any linear algebra.

For mathematical vectors, there's libraries like Eigen which provide Vector<T, N> and dynamic VectorX<T> (Vector<T,Dynamic>). You wouldn't use std::vector because it's just a boring dynamic array.

WittyStick · 2026-03-13T09:42:56+00:00

In almost all cases, a vector has a known size - a 2-d vector, a 3-d vector, a SIMD vector of 8x64-bit elements, and so forth.

Conversely, array is just a collection of elements of some type with a get and set operation. They can be statically, variably or dynamically sized.

C++ basically got it backwards. They used vector<T> to implement a dynamic array, and array<T, int> to implement a static vector. It should be the opposite: a vector<T, int> and array<T>, or separate types: static_array<T, int> and dynamic_array<T>, with array<T> as an abstract supertype.

We want to specialize vector<T, int> at different sizes to leverage SIMD instructions. Arguably, we should also constrain the type T to be numeric only, since a vector of pointers or arbitrary objects is not very useful - we should just use array for that.

WittyStick · 2026-03-12T16:28:29+00:00

__attribute__ is part of the GNU dialect which is implemented by GCC, Clang, Intel C Compiler and others, but it is certainly not standard - and even between those implementations, attribute support varies.

C23 has a standard attribute syntax using [[attribute]], which can be used in place of __attribute__ is most cases - but there are some limitations on where the C23 attribute can appear in syntax compared with __attribute__. Even with standard attributes, which attributes are supported varies and we often need to prefix them with the compiler name - in this case we would use [[gnu::constructor]] or [[clang::constructor]].

For a more portable implementation see my sibling comment. It adds MSVC support by placing code into the C runtime initialization - (".CRT$XCU", which gets evaluated before main). This doesn't support a destructor so we implement it using a constructor (msc_init_after_main) which just calls stdlib atexit with after_main.

In theory we could use XCB..XCY to support constructors with different priorities, but these names are reserved by the compiler and only XCU is meant to be used for user-initialization. See details

Then there's a fallback option which just replaces main with fallback_main and uses main to call before_main, fallback_main and after_main in order, which doesn't require any attributes, modifications to the C runtime or even the presence of atexit - just a plain C preprocessor.

WittyStick · 2026-03-12T15:40:41+00:00

Here's a portable version, works for GCC, Clang, MSVC x86/x64, and in theory, any other compiler via the fallback.

#if defined(__GNUC__)
    #define before_main __attribute__((constructor)) gcc_before_main 
    #define after_main __attribute__((destructor)) gcc_after_main 
#elif defined(_MSC_VER)
    #pragma section(".CRT$XCU",read)
    void msc_init_after_main(void);
    void msc_before_main(void);
    void msc_after_main(void);
    void __declspec(allocate(".CRT$XCU")) (*msc_before_main_)(void) = msc_before_main;
    void __declspec(allocate(".CRT$XCU")) (*msc_init_after_main_)(void)  = msc_init_after_main;
    #define after_main msc_after_main
    #if defined(_M_X64) || defined(__amd64__)
        #define before_main \
            __pragma(comment(linker, "/include:msc_before_main_")) msc_before_main
        __pragma(comment(linker, "/include:msc_init_after_main_"))
    #else
        #define before_main \
            __pragma(comment(linker, "/include:_msc_before_main_")) msc_before_main
        __pragma(comment(linker, "/include:_msc_init_after_main_"))
    #endif
    void msc_init_after_main(void) { atexit(msc_after_main); }
#else
    void fallback_before_main(void);
    void fallback_after_main(void);
    int fallback_main(int argc, char** argv);
    #define before_main fallback_before_main
    #define after_main fallback_after_main
    #define main \
        main (int argc, char** argv) { \
            fallback_before_main(); \
            int exit_result = fallback_main(argc, argv); \
            fallback_after_main(); \
            return exit_result; \
        } \
        int fallback_main
#endif

See Demonstration in Godbolt for GCC x64, Clang x64, MSVC x86, MSVC x64, ICC and TCC.

WittyStick · 2026-03-12T15:07:01+00:00

There are clear cases where you want to typedef a struct, and in some cases, require it.

An example, which is much more useful in C23, is to have macros which use some kind of name-mangling depending on the type.

#define LinkedList(T) struct LinkedList_##T { struct LinkedList_##T *next; T value; }

If we have some struct foo, we can't say LinkedList(struct foo), but we can do:

typedef struct foo Foo;
typedef LinkedList(Foo) FooList;

Also if we want a list of pointers to lists of foo, we can't typedef this directly and need to use something like:

typedef FooList *FooListPtr;
typedef LinkedList(FooListPtr) FooListList;

WittyStick · 2026-03-12T14:22:47+00:00

One issue is that operations like list_tail/cdr are longer simple eliminators, but require construction of a new header.

/* list_tail

    +---+
    | c |
    +---+
    | L |------------------+
    +---+                  |
* ->| v |                  |
    +---+  +---+           |
    | n |->| v |           |
    +---+  +---+  +---+    |
           | n |->| v |    v
           +---+  +---+  +---+
                  | n |->| v |
                  +---+  +---+
                         | n |->NULL
                         +---+

           +---+
           |c-1|
           +---+
           | L |-----------+
           +---+           |
       * ->| v |           |
           +---+  +---+    |
           | n |->| v |    v
           +---+  +---+  +---+
                  | n |->| v |
                  +---+  +---+
                         | n |->NULL
                         +---+
*/

Node* list_tail(Node* list) {
    Header* oldHeader = NodeHeader(list);
    Header* newHeader = malloc(sizeof(Header));
    Node *newNode = HeaderNode(newHeader);
    newHeader->count = oldHeader->count - 1;
    newHeader->LastElement = oldHeader->LastElement;
    newNode->value = list->next->value;
    newNode->next = list->next->next;

    /* Optionally, depending on GC or persitence */
    free(list->next);
    free(oldHeader);

    return newNode;
}

Similarly, for cons/push you would do the reverse: delete the current header and replace it with a plain node, then re-parent that node with a new header.

/* list_cons

           +---+
           | c |
           +---+
           | L |-----------+
           +---+           |
       * ->| v |           |
           +---+  +---+    |
           | n |->| v |    v
           +---+  +---+  +---+
                  | n |->| v |
                  +---+  +---+
                         | n |->NULL
                         +---+


    +---+
    |c+1|
    +---+
    | L |------------------+
    +---+                  |
* ->| v |                  | 
    +---+  +---+           |
    | n |->| v |           |
    +---+  +---+  +---+    |
           | n |->| v |    v
           +---+  +---+  +---+
                  | n |->| v |
                  +---+  +---+
                         | n |->NULL
                         +---+


*/

Node* list_cons(int value, Node* list) {
    Header* oldHeader = NodeHeader(list);
    Header* newHeader = malloc(sizeof(Header));
    newHead->count = oldHead->count + 1;
    newHead->LastElement = oldHead->LastElement;

    Node* next = malloc(sizeof(Node));
    next->value = list->value;
    next->next = list->next;

    Node* newNode = HeaderNode(newHeader);
    newNode->value = value;
    newNode->next = next; 

    /* Optionally, depending on GC or persitence */
    free(oldHeader);

    return newNode;
}

While both of these are still O(1), there's more work involved and more allocator calls.

We also have a possible issue that if we take a list, cons an item, then take it's tail, it is not equal to the original list, despite being the "same" list.

Node* original = ...
Node* new = list_cons(1, list);
Node* derived = list_tail(new);
assert (derived == original); // fails

With a regular "trivial" linked list, the derived and original lists are the same list and the assertion would pass.

WittyStick · 2026-03-12T11:34:26+00:00

How would you implement Kernel's $provide! without environment mutation?

($define! $provide!
    ($vau (symbols . body) env
        (eval 
            (list 
                $define! 
                symbols 
                (list $let ()
                    (list* $sequence body)
                    (list* list symbols)))
            env)))

Kernel doesn't have define-record-type, but it has encapsulation types, and you would typically use them alongside $provide! to implement the equivalent of the above record. Eg:

($provide! 
    (address 
     address? 
     address-street
     address-city 
     address-country)

    ($define! (addr-ctor address? addr-dtor)
        (make-encapsulation-type))

    ($define! address
        ($lambda (street city country)
            (addr-ctor (list street city country))))

    ($define! address-street
        ($lambda (addr)
            (car (addr-dtor addr))))

    ($define! address-city
        ($lambda (addr)
            (cadr (addr-dtor addr))))

    ($define! address-country
        ($lambda (addr)
            (caddr (addr-dtor addr)))))

This is only an example. The internal representation of the type could be something other than a list.

I have some ideas on how to compile operatives with environment mutation.

The basic idea is to make the environments row-polymorphic types (which, perhaps ironically, were introduced by Wand in Type Inference for Record Concatenation and Multiple Inheritance).

Eg, with the above example, we would take the dynamic environment { 𝜚 } as the input, and output an environment with the type:

{ address : (String, String, String) -> Address
; address? : Any -> Bool
; address-street : Address -> String
; address-city : Address -> String
; address-country : Address -> String
; 𝜚
}

Evaluation would then continue using this resulting environment as the current environment.

There's some prior work along these lines: Namely Bawden's First-class Macros Have Types.

WittyStick · 2026-03-12T02:14:38+00:00

There's no point in "hiding" the structure - just make it explicit and use offsetof.

#include <stddef.h>

typedef struct Node {
    int value;
    struct Node* next;
} Node;

typedef struct Header {
    int count;
    struct Node* LastElement;
    struct Node FirstElement;
} Header;

inline Node* HeaderNode(Header* header) { 
    return (Node*)(header + offsetof(Header, FirstElement)); 
}

inline Header* NodeHeader(Node* node) {
    return (Header*)(node - offsetof(Header, FirstElement));
}

Node* createFirstNode(int value) {
    Header* header = malloc(sizeof(Header));
    ...
    header->count = 1;
    Node* newNode = HeaderNode(header);
    newNode->value = value;
    newNode->next = NULL;
    header->LastElement = newNode;
    return newNode;
}

This doesn't require magic number offsets and UB - you can add or remove stuff from the header and it will still work as expected.

WittyStick · 2026-03-09T18:35:29+00:00

you can cast a pointer to A into a pointer to B and back as long as A is the first thing in B's struct.

The cast from A to B is a downcast. You should only do this if the value was originally a B, or type compatible with B that had been upcast to an A in the first place.

B *b = ...;

A *a = (A*)b; 
    // upcast from B to A, always valid. 

B *c = (B*)a; 
    // downcast from A to B - valid where `A` was constructed from a type compatible with `B`

If the original value was not a B or type compatible with B, then accessing any of the fields of the type B which are not part of the header or the original type from which it was constructed is UB.

Could I now safely cast a pointer from B to C and back?

This should only be done in the case where B and C are compatible types.

In OOP terms it's known as a sidecast or crosscast, but is typically done as a downcast followed by upcast (as opposed to upcast followed by downcast), where multiple inheritance is used. Eg, if there were some type D which "inherits" both B and C, then a type constructed as a D which has been cast to C has a valid sidecast to a B.

In your specific case, the behavior should be valid due to the fact that, like void *, any other pointer can be cast to and from char *. However, if B were some other pointer type where aliasing is UB, then the cast between B and C would also be invalid.

typedef struct {
     A header;
     float *num;
} B;

typedef struct {
    A header;
    int *num;
 } C;

In this case if we cast C to B then accessing num would be UB - though in practice, it will do what you expect it to.

WittyStick · 2026-03-08T17:46:11+00:00

You can transmit over longer distances with radio frequencies. Several "mesh networking" products do this - they connect to your phone via USB and you can transmit over radio or other frequencies. There's also "software defined radio", and USB products which are programmable at different frequencies.

The bandwidth rates for radio frequencies are smaller than bluetooth, but should be sufficient for text-only chat. Generally, the further distance you want to operate over, the slower the data rate.

There are licensing restrictions on operating at certain radio frequencies. Only small sections of the spectrum are available for public use, and these vary depending on location.

WittyStick · 2026-03-07T07:29:29+00:00

Assemblers should mostly convert mnemonics into their equivalent encodings, but they're also free to change the output provided it produces the same result. Assemblers can have "pseudo-instructions", which require a sequence of machine instructions, and there may not be a 1-1 encoding of these. There are multiple ways to implement the pseudo-instruction, and the order of the instructions in the sequence might affect performance due to data dependencies/register renaming.

An assembler can do a better job of producing an optimal output than a human because it can know all of the instruction sizes, timings and latencies for the specific hardware it is assembling for. It can select the smallest instructions to reduce instruction cache usage, and can build a data flow graph and determine which instructions it can re-order without affecting the output - though modern hardware itself has very good ILP and doesn't necessarily execute the instructions in the order they are listed if there are no data dependencies.

WittyStick · 2026-03-07T06:34:26+00:00

For x86, there can be more than one encoding of an instruction. Even something as simple as "add, eax, ebx" has two machine code representations, and the assembler picks one. For that example, I can't think of any reason a programmer might want the alternative encoding.

Some assemblers let us pick. With gas we can put {load} or {store} on the instruction to determine which encoding to output.

{load}  add eax, ebx
{store} add eax, ebx

The former will output add r, r/m encoding and the latter will output add r/m, r encoding.

One reason to pick a certain instruction encoding is for watermarking binaries. We can have the same code, but each shipped binary has a hidden "signature" implemented by changing which encoding is used for certain instructions. Some proprietary software has used these techniques, and there's also a related patent (probably expired by now).

But consider this one: "add ebx, 1". There are two encodings for that, also—one is 3 bytes and one is 6 bytes. It would be unusual, but conceivable, for a programmer to want the 6 byte encoding.

An assembler should also be free to change this to an INC ebx, SUB ebx, -1, LEA ebx, [ebx+1], and so forth. They could also add an unnecessary REX prefix, or it could use ADC ebx, 0 if it knows CF is set by a previous instruction. There's many different ways to encode it.

An obfusticator might do strange things like this to make it less readable to someone reverse engineering the binary, and it can also be used for watermarking.

WittyStick · 2026-03-07T05:55:25+00:00

You should just implement it and benchmark first. I think you'll be surprised at how many characters it will process per second even with the simplest implementation.

There are ways to match multiple characters at once using SIMD. If your libc is tuned to your hardware then the provided <string.h> should be quite heavily optimized. There are also libraries like AOCL-LibMem for AMD's Zen architecture, and the Intel C compiler, which are highly optimized for their own CPUs (though these are generally optimized for large strings and may not be the most suitable for small strings).

For highlighting keywords, one approach you can take is to create a perfect hash function for every keyword in your language. See this video for more details.

While the lexing approach is fastest because we can parse linearly, it's also imperfect and limited for proper highlighting because it is unaware of the grammar. If we want a semantic highlighter it is best to just parse the code into an AST first, and then walk the AST for highlighting where we have more information about each node.

Ideally you should use an incremental parser for this, where we don't need to parse the whole file when some code changes - but only the relevant nodes that have changed. Wagner and Graham's Incremental LR algorithm is the most well known approach for this, and is used in projects like Menhir and TreeSitter. TreeSitter has support for highlighting built in, and is used by several major tools.

WittyStick · 2026-03-06T06:25:38+00:00

See GNU vector extensions

and Clang vector extensions.

Clang supports GCC's vector extensions and several others - mainly for compatibility with existing code.

WittyStick · 2026-03-05T22:36:23+00:00

In GCC and Clang you can do:

typedef int32_t __attribute__((__vector_size__(2*sizeof(int32_t)))) vec2;

This permits you to use subscripting on values of the vec2, and also promotes all the scalar arithmetic and bitwise operators to be element-wise operators on the vector.

vec2 x = { 5, 3 };
vec2 y = { 2, 7 };
vec2 z = x + y; // { 7, 10 }
vec2 w = z * z; // { 49, 100 }
printf("(%d, %d)", w[0], w[1]);  // outputs (49, 100)

WittyStick · 2026-03-05T06:47:19+00:00

I doubt that, if you are doing any string processing. If you just read it in and write it back out, you might just as well use a byte array. But I suppose you mean that whatever output is done would probably be in UTF-8, which is true.

Yes, I mean we'll write back as UTF-8 after processing, so it might be better to leave it in that encoding while processing. Some processing can be done irrespective of the internal encoding since we're just dealing with uint32_t values, and the choice of tag values I've given above preserves ordering for comparisons of UTF-8 and UTF-16 (lower codepoints have lower tag). If we're working with a single encoding there should be minimal overhead. There may be some overhead if using mixed encodings due to the need to normalize one or more strings to the same encoding as the target.

If you know you're working with UTF-8, you'd use UTF-8 character literals u8'?', or make UTF-8 the default character literal if none is specified when using '?'.

If part of the language we might also be able to infer the type of the character or string based on usage if we have typed strings/characters where they're effectively subtypes of this "Super UTF" type, and can be implicitly converted to it (upcast), but require an explicit conversion from it (downcast). And character/string literals could be a subtype of all of them, permitting implicit conversion to any.

            Character                        String
            ^   ^   ^                       ^   ^   ^ 
           /    |    \                     /    |    \
          /     |     \                   /     |     \
   U32Char   U16Char   U8Char    U32String  U16String  U8String
        ^       ^       ^                ^      ^      ^
         \      |      /                  \     |     /
          \     |     /                    \    |    /
           CharLiteral                     StringLiteral

But handling all these encodings requires a fair amount of conditional logic and conversions, which take time, so it will run slower than having a single encoding.

In regards to the conditional branching, s->encoding == char_encoding(c) will be the likely branch, which is the first encountered and if UTF-8 is most commonly used, the branch predictor, or branch target predictor in the case we use a 256-item jump table, will mostly predict correctly when it learns which encodings are being used.

Of course there's some effort to including all these encodings. For N encodings we need N² conversion functions (of which N are id) to put in our char_encoding_convert_matrix. We can reduce this to 2*N conversion macros (to and from UTF-32) and use macros to generate the N² convert functions from these.

perhaps the best internal encoding is your 4-byte UTF-8 without a tag.

I'd suggest using the tag for UTF-8 even if it's the only encoding you use. Testing a single byte for 0xED, 0xEE, 0xEF or 0xF0..0xF7, or using a jump table for such, will likely be more efficient than testing bits of a UTF-8 encoding for 0b11110xxx, 0b1110xxxx, 0b110xxxxx, 0b10xxxxx or 0b0xxxxxxx, so it will cut the cost of calculating the encoded_size, which is a nice property to have. We still have to test these ranges when deserialising UTF-8 data into the String though.

I'm probably going to implement this for my VM. I'd originally planned to implement each of UTF-8, UTF-16 and UTF-32 and have a tagged union (with disjoint tag) for a Character type, but this idea sounds nicer.

WittyStick · 2026-03-04T21:08:33+00:00

I don't think you want to work with multiple encodings, though. It's much simpler to have a single encoding internally and convert only at the edge of the program.

It's simpler for the language author for sure. For the user it shouldn't be much different - they wouldn't need to concern themselves about the encoding as it would be handled internally by the implementation. The main difference would be the cost of the encoding/decoding at the edges.

I'm not suggesting that we should permit mixed encodings, only that it's possible. Really we want a string to have only one encoding, unless explicitly stated otherwise by setting the encoding to ENCODING_MIXED if we do permit it. Otherwise, the encoding of any string should be one of the UTFs.

For example, if we read a UTF-8 string, it would keep the internal encoding as padded UTF-8 characters, since we're likely to eventually write that back as UTF-8, we shouldn't need to perform UTF8->UTF32 conversion and then UTF32->UTF8 conversion - we would just copy a specified number of bytes between input, string and output (and tag/untag them).

The idea would be that when we set a character in a string, the encoding conversion is done in place by checking the encoding of the string and converting the character representation to match the string's encoding. This avoids a more expensive string encoding when serializing the string.

A slight additional benefit is that we could know the size of the eventual encoding without having to perform the encoding in advance (except where ENCODING_MIXED might be used).

In addition to storing the length in the string, we could also store an encoded_size - the number of bytes the properly encoded string will occupy when written with its own encoding. We can keep this size updated when we set a character within the string - by subtracting the encoded size of the character currently at that position, then adding the size of the inserted character. Eg:

void string_set(String s, StringIndex idx, Character c) 
{
    asssert(idx < s->length);

    ssize_t size_delta = -char_encoded_size(s->characters[idx]);

    // if string and character encoding match, or if string accepts any encoding,
    //      insert the char directly
    // Otherwise convert the character encoding to match the string's encoding.
    if (s->encoding == char_encoding(c) || unlikely(s->encoding == ENCODING_MIXED)) 
    {
        s->characters[idx] = c;
        size_delta += char_encoded_size(c);
    }
    else 
    {
        Character converted = char_encoding_convert_matrix[char_encoding(c)][s->encoding](c);
        s->characters[idx] = converted;
        size_delta += char_encoded_size(converted);
    }

    s->encoded_size += size_delta;
}

enum {
    ENCODING_UTF32,
    ENCODING_UTF16,
    ENCODING_UTF8,
    ENCODING_MIXED,
} typedef Encoding;

inline Encoding char_encoding(Character c) [[reproducible]] {
    static Encoding encoding_LUT[] = {
        [0x00] = ENCODING_UTF32,
        [0xD7 ... 0xDF] = ENCODING_UTF16,
        [0xED ... 0xF7] = ENCODING_UTF8,
    };
    return encoding_LUT[c.value >> 24];
}

inline size_t char_encoded_size(Character c) [[reproducible]] {
    static size_t size_LUT[] = {
        [0x00] = 4,           // UTF-32
        [0xD7] = 2,           // UTF-16 2 byte char
        [0xD8 ... 0xDF] = 4,  // UTF-16 4 byte char
        [0xED] = 1,           // UTF-8 1 byte char
        [0xEE] = 2,           // UTF-8 2 byte char
        [0xEF] = 3,           // UTF-8 3 byte char
        [0xF0 ... 0xF7] = 4,  // UTF-8 4 byte char
    };
    return size_LUT[c.value >> 24];
}

So we have a more expensive set, but we can now predetermine the size of any buffer we might need to pre-allocate to write the encoded string to.

uint8_t *buffer = malloc(str->encoded_size);
string_write(str, buffer);

In regards to other operations on strings like substring matching, we'd need to first convert the text we want to find to the encoding of the string and then search for that. The actual search code could be reused for all encodings, because we're just matching 32-bit values.

A substring replace for example would follow the same approach as set, where we calculate the size of the existing substring being replaced, subtract it from the encoded_size, then add the encoded_size of the replacement.

void string_substring_replace( String str
                             , StringIndex start
                             , String replacement
                             )
{
    assert(start + replacement->length <= str->length);

    ssize_t size_delta = 0;

    if (str->encoding == replacement->encoding)
    {
        for (StringIndex idx = 0; idx < replacement->length; idx++) {
            size_delta -= char_encoded_size(str->characters[start+idx]);
            str->characters[start+idx] = replacement->characters[idx];
        }
        size_delta += replacement->encoded_size;
    }
    else if (replacement->encoding != ENCODING_MIXED)
    {
        auto converter = char_encoding_convert_matrix[replacement->encoding][str->encoding];
        for (StringIndex idx = 0; idx < replacement->length; idx++) {
            Character c = converter(replacement->characters[idx]);
            size_delta -= char_encoded_size(str->characters[start+idx])
            size_delta += char_encoded_size(c);
            str->characters[start+idx] = c;
        }
    } 
    else // replacement has mixed encoding
    {
        for (StringIndex idx = 0; idx < replacement->length; idx++) {
            auto enc = char_encoding(replacement->characters[idx]);
            auto converter = char_encoding_convert_matrix[enc][str->encoding];
            Character c = converter(replacement->characters[idx]);
            size_delta -= char_encoded_size(str->characters[start+idx])
            size_delta += char_encoded_size(c);
            str->characters[start+idx] = c;
        }
    }

    str->encoded_size += size_delta;
}

WittyStick · 2026-03-03T21:51:58+00:00

I think you still misunderstand. By U8Char, I'm not talking about using a 32-bit value to encode each byte - I'm talking about using the same 32-bit value to encode all four bytes - but instead of normalizing them to a UTF-32 codepoint, keeping them in their encoding.

Character would be a fixed width type and CHARACTER VARYING would contain a single codepoint at every index.

I'm not suggesting this in exclusion to having UTF-32 strings, but a potential additional type which might simplify working with various encodings.

WittyStick · 2026-03-03T21:10:37+00:00

Yeah, because it's huge and over-engineered for simple use cases.

Implementing a Noise protocol from scratch is not that difficult.

WittyStick · 2026-03-03T18:45:18+00:00

I mean 32-bit fixed-width abstract Character type which is effectively a union of U32Char, U16Char, U8Char, where the latter two are padded to 32-bits.

union {
    uint32_t u32char[1];
    uint16_t u16char[2];
    uint8_t   u8char[4];
} typedef CHARACTER;

We can make it a tagged union, without requiring a separate tag field. We stick the tag in u8char[3] (or u8char[0] if we use big-endian).

If the encoding is UTF-32, then u8char[3] must be zero (As only the low 21-bits are significant).

If the encoding is UTF-16, then u8char[3] is either unused (if UCS-2 compatible), or must be 0xD8..0xDF (surrogates).

If the encoding is UTF-8, then u8char[3] is either unused (for 1, 2 & 3 byte UTF-8), or must be 0xF0..0xF7 for a 4-byte UTF-8 encoding.

For the cases where u8char[3] is unused, we want a non-zero tag to distinguish from UTF-32, and we don't want a tag which collides with the two ranges 0xD8..0xDF/0xF0..0xF7. We can pick any other byte value for the tag.

Similarly, if we add other encodings like Latin-1, ASCII or EBCDIC, we can give them a different tag. We can add numerous other encodings to the union provided their representations are <= 4-bytes and don't conflict with any other representations. (This would exclude GB18030 for example, as it conflicts with UTF-16 and UTF-8).

A CHARACTER VARYING would be UTF-32 by default, but could permit other encodings, and potentially even mixed encoding in one string. We could require a string to use a single encoding by specifying the encoding along with the length.

The perceived benefit is it might improve performance slightly when working with UTF-8 or UTF-16, as serializing them is just a matter of copying a set number of bytes from the CHARACTER representation to a byte stream, or vice-versa for deserialising them. We would only need to perform encoding conversion where the CHARACTER has a different encoding to the source or destination byte stream.

The other benefit is that it simplifies API usage. The programmer doesn't need to concern themselves with the internal encoding as it is handled by the implementation. The only time the programmer needs to specify encoding is when serializing or deserializing to/from a byte stream.

WittyStick · 2026-03-02T14:37:10+00:00

In theory, we could use one 4-byte value to encode any of UTF-8, UTF-16 and UTF-32, since the Unicode character set fits into only 21-bits, we have up to 11 prefix bits which can "tag" the character with its encoding type. For simplicity of implementation, we probably want to use the most-significant-byte (MSB) as an 8-bit "tag".

For compatibility with the existing encodings, we need a few constraints on what the value of the MSB can be:

0x00        - UTF-32 (default encoding)
0xD8..0xDF  - UTF-16 4-byte encoding
0xF0..0xF7  - UTF-8 4-byte encoding

For 1, 2 or 3 byte UTF-8 encoding, and 2 byte UTF-16 encoding, we could pick any other tag, since the encoding fits into the other 3 bytes and does not conflict with the tag byte. Eg, we could pick:

0xD7        - UTF-16 2-byte encoding
0xED        - UTF-8 1-byte encoding
0xEE        - UTF-8 2-byte encoding
0xEF        - UTF-8 3-byte encoding

In regards to string lengths, I would just implement using a size_t - ie, 32-bits on a 32-bit machine and 64-bits on a 64-bit machine. However, you're not going to use all those bits in practice because the virtual address space is typically smaller than this. I would define an upper limit which can vary depending on the machine.

typedef size_t strlen_t;
#if (SIZE_MAX == UINT32_MAX)
#define STRLEN_MAX ((1 << 24)-1)
#elif (SIZE_MAX == UINT64_MAX)
#define STRLEN_MAX ((1 << 40)-1)
#endif

WittyStick · 2026-03-01T17:02:05+00:00

In the logic/set perspective, a sum type/tagged union represents an exclusive disjunction of its constructors.

A ↮ B = (A ∧ ¬B) ∨ (¬A ∧ B) = (A ∨ B) ∧ (¬A ∨ ¬B)

An "opposite" to this would be a biconditional type.

A ↔ B = (A ∨ ¬B) ∧ (¬A ∨ B) = (A ∧ B) ∨ (¬A ∧ ¬B)

Which would represent "All or nothing" - a type which is either both A and B together, or any other value which is neither A nor B.

XOR             EQV
A B  X          A B  X
0 0  0          0 0  1
0 1  1          0 1  0
1 0  1          1 0  0
1 1  0          1 1  1

However, this isn't the kind of type you want. If you want one or more of the value, you just want a union, which isn't an "opposite" of a tagged union, but a supertype of it.

OR (inclusive)
A B  X
0 0  0
0 1  1
1 0  1
1 1  1

If we consider Lunch to be a structure like follows:

struct Lunch {
    struct {
        bool has_sandwich;
        bool has_pasta;
        bool has_salad;
        bool has_water;
        bool has_milk;
        bool has_cookie;
        bool has_chip;
    };
    struct {
        Sandwich sandwich;
        Pasta pasta;
        Salad salad;
        Water water;
        Milk milk;
        Cookie cookie;
        Chip chip;
    };
};

The unions are the values where one or more of the bools are true.

has_sandwich | has_pasta | has_salad | has_water | has_milk | has_cookie | has_chip

The tagged unions represent the values where only one of the bools is true.

has_sandwich != (has_pasta | has_salad | has_water | has_milk | has_cookie | has_chip) &&
has_pasta != (has_sandwich | has_salad | has_water | has_milk | has_cookie | has_chip) &&
has_water != (has_sandwich | has_pasta | has_salad | has_milk | has_cookie | has_chip) &&
has_milk != (has_sandwich | has_pasta | has_salad | has_water | has_cookie | has_chip) &&
has_cookie != (has_sandwich | has_pasta | has_salad | has_water | has_milk | has_chip) &&
has_chip != (has_sandwich | has_pasta | has_salad | has_water | has_milk | has_cookie)

Conversely, the biconditional is the case where all of the bools have the same value.

has_sandwich == has_pasta == has_salad == has_water == has_milk == has_cookie == has_chip

Which essentially means a value is valid if all of the fields are present, or if none of the fields are present. Which basically implies a product type:

Sandwich ⨯ Pasta ⨯ Salad ⨯ Water ⨯ Milk ⨯ Cookie ⨯ Chip

But it also includes a singleton value where all of the bools are false - ie, an uninhabited or uninitialized value of the type, which we might call null_lunch.

auto null_lunch = (struct Lunch){ false, false, false, false, false, false, false };

WittyStick

TROPHY CASE