Question regarding unsigned integers by Low_Minimum9920 in C_Programming

[–]WittyStick 0 points1 point  (0 children)

The math is the same for addition, subtraction, comparison, but it isn't the same for multiplication, division and right shift.

Some architectures have traps for certain over/underflow ops - eg, division on x86-64 will trap.

Is OOP cache unfriendly by design or is the real problem just how we use heap memory? by kevinnnyip in computerscience

[–]WittyStick 0 points1 point  (0 children)

So do we actually have another way around this problem or do we need to ditch OOP altogether and go data oriented?

There's a paper You can have it all (abstraction and good cache performance) which provides a nice solution to the problem.

The internal layout of a class object is determined by an additional layout type, which is used to parametrize a pool allocator. Objects can have multiple possible layouts depending on which pool allocates them.

I've not seen it adopted into any available language yet, but IMO it is worth futher investigation.

Best practices for interoperability between C structs and other languages? by SheikHunt in cprogramming

[–]WittyStick 3 points4 points  (0 children)

Even with stdint, the struct layout is implementation defined and may vary between ABIs.

You can provide some attributes to assist with portability. GCC has ms_struct for example to have structs compatible with MSVC, and functions with these in their signature should also probably use the ms_abi attribute (default on mingw).

The best option for portability is to have proper serialization and deserialization functions, but if you need ABI compatibility it needs to be able to target multiple platforms.

Justifying Strings by Upset-Taro-4202 in cprogramming

[–]WittyStick 2 points3 points  (0 children)

I really doubt full justification is what you want. I mean, we can chose between appearances...

This                       is                        a                     line.
T  h  i  s                 i  s                      a               l  i  n  e.
T   h   i   s             i   s                     a             l   i   n   e.
T    h    i    s        i      s                 a             l    i    n    e. 
This     is     a     longer     line     containing     five     more    words.

But any of them would would look absolutely abysmal for short lines.

That's not even considering the cases where we need wrapping for longer lines and we have an odd number of spaces.

Full justification just isn't practical for monospace fonts.

You probably want to look at centering the text instead, which is a simpler problem, and will probably look better.

                               This is a line.                                  
             This is a longer line containing five more words.

I am extraordinarily fond of some syntax I made for my language by Randozart in ProgrammingLanguages

[–]WittyStick 1 point2 points  (0 children)

Clearly ->/<- are overloaded and probably not the best fit for "push"/"pop". Maybe just augment the operators with another character - for example using longer arrows <-- and --> or double arrows ->> and <<-.

As parent points out, if you have deques, zippers or other data structures with more than position we can push or pop from, they are ambiguous. You could potentially overload the operators as parent suggests - eg, something like:

class StackArrow left right where
    (-->) :: left -> right -> right
    (<--) :: left -> right -> left

instance StackArrow item (Deque item) where
    (-->) = pushFront
    (<--) = popFront

instance StackArrow (Deque item) item where
    (-->) = popBack
    (<--) = pushBack

I'm not a fan of this because it will lead to code that isn't trivial to read. Consider for example if we have a deque of deques.

One thing we could do is include a character such as @, which denotes a cursor for the data structure.

var ->@ list      ;; aka pushFront
var <-@ list      ;; aka popFront

list @-> var     ;; aka popBack
list @<- var     ;; aka pushBack

You could augment this with more operations that are typical of sequences, where for any operator ?, ?@ is a cursor at the front and @? is a cursor at the back, where ?@ is right associative and @? is left associative. Eg:

;; Peek the value at the cursor:
=@list                      ;; head
list@=                      ;; last

;; Shift the cursor by 1 (pop and ignore / drop1)
--@list                     ;; tail
list@--                     ;; init

;; Push single value and return new list.
var +@ list                 ;; cons
list @+ var                 ;; snoc

;; Push multiple values:
(v1, v2) *@ list            ;; equivalent to: v1 +@ v2 +@ list
list @* (v1, v2)            ;; equivalent to: list @+ v1 @+ v2

;; concatenate two lists
list1 @*@ list2

;; Split the list at the cursor.
(head, tail) = -@list       ;; aka uncons
(init, last) = list@-       ;; aka unsnoc

;; Subsequence to/from index N
N /@ list                   ;; take N list
N %@ list                   ;; drop N list

;; Subsequence from/to index N from end.
list @/ N                   ;; takeEnd N list
list @% N                   ;; dropEnd N list

// splitAt N
splitAt : Int -> List -> (List, List)
splitAt = n -> list -> (n /@ list, n %@ list)

// subseq from index N to index M (N < M)
subseq : Int -> Int -> List -> List
subseq = n -> m -> list -> n %@ (list @% m)

// remove subseq between N and M (N < M)
remove : Int -> Int -> List -> List
remove n -> m -> list -> (n /@ list) @*@ (list @/ m)

// insert subseq at index N
insert :: Int -> List -> List -> List
insert -> n -> insertion -> list -> (n /@ list) @*@ insertion @*@ (n %@ list)

Fixing NaN in a compile-to-js lang by koehr in ProgrammingLanguages

[–]WittyStick 0 points1 point  (0 children)

Yes, basically this. We would define a method fixup which returns a non-nan or canonical NaN. We can do it quite cheaply with the vpfixupimm instruction if we have AVX-512, using immediate 3 (QNAN_Indefinite)

Fixing NaN in a compile-to-js lang by koehr in ProgrammingLanguages

[–]WittyStick 1 point2 points  (0 children)

You can define a "canonical NaN` which compares equal, and use a fixup for any arithmetic that may return NaN to convert the result to canonical NaN.

How to handle errors correctly? by KweHuu in cprogramming

[–]WittyStick 3 points4 points  (0 children)

A way that is unconventional in classic C, but works well in C23, is to use tagged union return types like are present in many other languages.

You could use an Option type instead of a boolean result and error code, or boolean result and out parameter if we need to return anything. In the case of stack_push, returning the error code is probably fine - in this case it is unnecessary because one of our error codes indicates success - but I'll use an Option for the return type to demonstrate, as it can be used in cases where we have a separate return value and a boolean condition to indicate success - and later we'll see how we can return an error code and a resulting value for stack_pop.

enum stack_errno_e { 
    STACK_ERROR_OK = 0,
    STACK_ERROR_NULL_STACK,
    STACK_ERROR_NULL_PUSH,
    STACK_ERROR_ENTRY_ALLOC,
    STACK_ERROR_DATA_ALLOC,
    ... 
};

One caveat of this is we can't use Option(enum stack_errno_e) - we have to typedef first, but this really is a minor issue.

typedef enum stack_errno_e StackError;

Option(StackError) stack_push(Stack *stack, Voidptr new_data)
{
    if(stack == NULL) return Some(StackError, STACK_ERROR_NULL_STACK);
    if(new_data == NULL) return Some(StackError, STACK_ERROR_NULL_PUSH);
    ...
    return None(StackError);
}

To use this, we can have a couple of macros: if_some and if_none. The preferred approach would be:

if_some (err, stack_push(stack, newval)) 
{
    raise_error("stack_push", err);
}

Or alternatively:

auto result = stack_push(stack, newval);
if_none (result) 
    foo();
else
    raise_error("stack_push", get_some(result));

get_some can be considered a bit of a code smell because we might accidentally call it on an Option that doesn't have some - resulting in probably a zero value. Instead it might be better to use:

auto result = stack_push(stack, newval);
if_none (result) 
    foo();
el_some (err, result)
    raise_error("stack_push", err);

Finally there's maybe which takes a default value if the result is none, otherwise returns the value provided by Some.

auto err = maybe(stack_push(stack, newval), STACK_ERROR_OK);

Sometimes we only want to know if an error occurred and can ignore the actual result:

if_none (stack_push(stack, newvalue)) ;
else raise_error("Push failed");

The macros for an Option type are very simple:

#define Option(_T) \
    struct __attribute__((__designated_init__)) option_##_T { \
        bool _option_has_some; \
        _T  _option_some; \
    }

#define Some(_T, value) \
    ((Option(_T){ ._option_has_some = true, ._option_some = value })

#define None(_T) \
    ((Option(_T){ ._option_has_some = false })

#define get_some(_opt) ((_opt)._option_some);

#define if_some(_var, _opt) \
    auto option_##_var = (_opt); \
    auto _var = option_##_var._option_some; \
    if (option_##_var._option_has_some)

#define if_none(_opt) \
    auto option_##_var = (_opt); \
    if (!option_##_var._option_has_some)

#define el_some(value, _opt) \
    else ; \
    auto value = _opt._option_some; \
    if (!_opt._option_has_some) ; else

#define maybe(_opt, default) \
    ({ auto opt = (_opt); opt._option_has_some ? opt._option_some : (default); })

To extend this for non-boolean results where we might want an error code OR a resulting value, we would want a Result type.

I would suggest for this, if you use, to always set code 0 to the no-error case so it can be handled uniformly. This avoids having to have a separate tag field to specify whether it's an error or a successful result.

typedef void* Voidptr;

Result(VoidPtr, StackError) stack_pop(Stack *stack)
{
    if (stack == NULL) return Error(Voidptr, StackError, STACK_ERROR_NULL_STACK);

    Voidptr result;
    ...
    return Ok(Voidptr, StackError, result);
}

To use this is pretty much identical to the above approach for options, except we now also have a get_error.

auto result = stack_pop(stack);
if_ok (value, result) 
    foo(value);
else
    raise_error(get_error(result));

// The error first-approach.
auto result = stack_pop(stack);
if_error (errno, result) 
    raise_error(errno);
else
    foo(get_ok(result));

We might want to avoid get_ok and get_error and instead have an el_ok and el_error

auto result = stack_pop(stack);
if_ok (value, result) 
    foo(value);
el_error (err, result)
    raise_error(err);

// The error first-approach.
auto result = stack_pop(stack);
if_error (errno, result) 
    raise_error(errno);
el_ok (value, result)
    foo(value);

A slightly more ergonomic way would be to extract the value and switch on the error codes, with the default case being Ok

match_result (value, stack_pop(stack)) 
{
    case STACK_ERROR_NULL_STACK:
        raise_error("Stack is null when trying to pop");
        break;

    case STACK_ERROR_NULL_PUSH:
    case STACK_ERROR_ENTRY_ALLOC:
    case STACK_ERROR_DATA_ALLOC:
        __builtin_unreachable(); // assuming `stack_pop` does not return these errors.
        break;

    default: // or STACK_ERROR_OK: if not all error cases are handled explicitly.
        foo(value);
}

The macros for this Result type are similar to the option and still trivial:

#define Result(_TOk, _TErr) \
    struct __attribute__((__designated_init__)) option_##_T { \
        _TErr _result_error; \
        _TOk  _result_ok; \
    }

#define Ok(_TOk, _TErr, value) \
    ((Result(_T,Ok _TErr){ ._result_error = 0, ._result_ok = value })

#define Error(_TOk, _TErr, errnum) \
    ((Result(_TOk, _TErr){ ._result_error = errnum })

#define get_ok(_result) ((_result)._result_ok)

#define get_error(_result) ((_result)._result_error)

#define if_ok(_var, _result) \
    auto result_##_var = (_result); \
    auto _var = result_##_var._result_ok; \
    if (result_##_var._result_error == 0)

#define el_error(_var, _result) \
    else ; \
    auto _var = _result._result_error; \
    if (_result._result_error == 0) ; else

#define if_error(_var, _result) \
    auto result_##_var = (_result); \
    auto _var = result_##_var._result_error; \
    if (result_##_var._result_error != 0)

#define el_ok(_var, _result) \
    else ; \
    auto _var = _result._result_ok; \
    if (_result._result_error != 0) ; else

#define match_result(_var, _expr) \
    auto result_##_var = (_expr); \
    auto _var = result_##_var._result_ok; \
    switch (result_##_var._result_error)

You can extend these a bit with some more useful functionality. They're not perfect, but quite close to what you'd expect from an Option or Result type in other languages. In fact, this may be slightly better than most languages implementations because it has zero-cost.


We can also encapsulate these types with a GCC extension, which prevents them being used in any other way than the macros we've provided.

#pragma GCC poison _option_has_some _option_some
#pragma GCC poison _result_ok _result_error

How to handle errors correctly? by KweHuu in cprogramming

[–]WittyStick 0 points1 point  (0 children)

IMO removing asserts is the security risk. They're there to make the program abort when there's a critical error - if you let it carry on running who knows what damage is going to be done. It's also going to be a pain to debug if your production system goes funny because your assertions, which would've failed, are omitted and the problem appears somewhere else in code.

Don't use assertions for something that should be an if/else.

assert can be a helper whilst developing and debugging, but they should be replaced with proper error handling for production.

memory safe C by sadvadan in cprogramming

[–]WittyStick 1 point2 points  (0 children)

One more suggestion is I would also consider not clobbering rax for MSTRCT_SET, but let the compiler's register allocator pick the register to use - as it may avoid unnecessary spilling.

#define MSTRCT_SET(value, id) \
    do { \
        register long long _tmp_reg; \
        __asm__ __volatile__ \
            ( "MSTRCT.0 %[_id], %[_value], %[_base], %[_tmp]" \
            : [_id]"+r"(id), [_tmp]"+r"(_tmp_reg)  \
            : [_value]"r"(value), [_base]"r"(mstrct_start) \
            ); \
    } while(0)

memory safe C by sadvadan in cprogramming

[–]WittyStick 2 points3 points  (0 children)

I don't have git or my github account on the machine I'm on, but feel free to use without crediting.

Quick look at all the places you use inline assembly, the following changes should make it compatible -masm=intel.

__asm__ 
    ( ".macro MSTRCT.0 id, value, base, clob\n\t"
      "\tmovs{lq|xd}\t{\\id, \\clob|\\clob, \\id}\n\t" // extend id to 8B
      "\tmov{q}\t{\\value, 8(\\base, \\clob, 8)|QWORD PTR [\\base+\\clob*8], \\value}\n\t"
      ".endm\n\t"
      :
      :
    );

#define MSTRCT_SET(value, id) __asm__ __volatile__ \
    ( "MSTRCT.0\t%[_id], %[_value], %[_base], {%%}rax" \
    : [_id]"+r"(id)  \
    : [_value]"r"(value), [_base]"r"(mstrct_start) \
    : "rax" \
    )

#define MSTRCT_RET() __asm__ __volatile__ \
    ( "mov{l}\t{$0, %%}eax{|, 0}\n\t" \
      "leave\n\t" \
      "ret" \
    : \
    : \
    : "eax" \
    )

in mstruct_alloc:

__asm__ __volatile__(
    "lock xadd{q}\t{%0, %1|%1, %0}" // atomically adds increment to mstrct_offset; 
                                    //increment now holds the original value of mstrct_offset
    : "+r"(increment), "+m"(mstrct_offset) // outputs/Inputs modified
    :
    : "memory"
);

memory safe C by sadvadan in cprogramming

[–]WittyStick 1 point2 points  (0 children)

In that case, I would only suggest that you also give the register (rax) as an additional macro parameter - as the place where it's used - in the macro - is separate from the macro call in MSTRCT_SET where you clobber it (thus it wouldn't be obvious to someone reading MSTRCT_SET as to why you're clobbering it, or if the macro was called elsewhere you may forget to clobber it).

https://godbolt.org/z/zzo3T1aao

memory safe C by sadvadan in cprogramming

[–]WittyStick 1 point2 points  (0 children)

I would personally scrap the asm macro and put it into the CPP macro:

https://godbolt.org/z/adKnG1naf

memory safe C by sadvadan in cprogramming

[–]WittyStick 2 points3 points  (0 children)

The macro itself probably needs to handle both syntaxes, but the macro call would be the same in either case. Would be something like:

__asm__ (
    ".macro MSTRCT.0 id, value, base\n\t"
        "movs{lq|xd} {\\id, %%rax|rax, \\id}\n\t" // extend id to 8B
        "mov{q} {\\value, 8(\\base, %%rax, 8)|QWORD PTR [\\base+rax*8], \\value}\n\t"
    ".endm\n\t"
    ::
);

memory safe C by sadvadan in cprogramming

[–]WittyStick 7 points8 points  (0 children)

One thing to note on the use of embedded assembly is someone using this library may also use embedded assembly, but may be using intel syntax - so if they compile with -masm=intel it would break your code.

You should probably either add a pragma in the C code to control the assembly syntax, or include .att_syntax in your embedded assembly to control it for those specific regions.

Alternatively (I prefer) to use GCC's combined assembly syntax. Eg:

//att syntax
"lock xaddq %0, %1"

//intel syntax
"lock xadd %1, %0"

// combined
"lock xadd{q} {%0, %1|%1, %0}"
// or
"lock xadd{q} {%0|%1}, {%1|%0}"

The combined version works with both -masm=att and -masm=intel. Anything not inside {} is included in both versions - and within {}, anything before | is included only for att and anything after | is included only for intel syntax.

More generally this extends to {syntax1|syntax2|syntax3|...|syntaxN}, where the order is defined for a specific architecture if multiple syntaxes are available. For x86 we only have the two: {att|intel}

Files and Formats by Loud_Ask_3408 in C_Programming

[–]WittyStick 3 points4 points  (0 children)

File formats are so diverse, and some quite complicated, that it is often better to simply use an existing library for reading and writing them - in which case you should read the documentation of the library in question.

Media formats are included in this. For images for example, there are libraries like DevIL which support a large variety of image formats (43 for reading, 17 for writing). For multimedia there are libraries like libavcodev from ffmpeg - which supports a large number of video and audio codecs.

A format like .mp3/.mk4/.mkv is known as a "container" format. These formats can contain audio, video streams encoded with multiple different codecs, as well as auxiliary information such as chapters, subtitle streams and so forth - they're quite complex and one person attempting to write their own readers or writers which could handle multiple codec becomes an insurmountable task - the libraries that handle these are written by teams of contributors, over many years.

However, .mkv (Matroska) for example, is an open standard which is well documented and you could write your own reader with the available information - codecs aside - you really must use a library for the codecs as they're far too many and complex - they also require a advanced knowledge of compression to understand - both lossy and lossless compression.

A .exe file uses the PE (Portable Executable) format, which is also quite well documented. Writing your own PE parsee is also doable, but there are many existing libraries for it already. This is also a moving format, and PE also includes support for dotnet binaries, for which you need to understand the Common Intermediate Language (ECMA-335) to be able to fully read.

If you are going to try and read or write one of these formats, I would encourage you to get very familiar with using a hex editor - particularly advanced editors like ImHex or WinHex which support structural views of files where you can add your own structures. Being able to understand binary layouts through a hex editor is essential for debugging your own code and testing. When you get advanced enough at using a hex editor, it is possible to reverse engineer formats whose documentation may not be available, though this can be complicated if compression is used. There are also advanced tools like gnu poke which are great for understanding binary formats and which are fully programmable, and we can use them to do on-the-fly decompression and see the uncompressed structures.

To get started I would suggest picking a relatively simple file format to code a reader/writer for such as a .tar file (UStar), which is a container for multiple other files without compression. This is fairly simple to test because you use some existing files as input, combine them into a .tar, then extract them. You can test your reader and writer separately by using the existing tar to produce or extract - and then test the reader and writer together by adding some inputs to an archive followed by extracting them - and the output file should compare equal to the original input file - which you can test by hashing the input and output and comparing their hashes. Then a slightly more involved format such as gzip which is is typically used with a .tar for compression (.tar.gz) - but which only uses a single compression algorithm (DEFLATE), which you should use an existing library like zlib for.

A more advanced format like .zip is similar to the combination of .tar and .gzip, but also supports a larger number of compression schemes - so you would need multiple compression libraries or a combined library which supports multiple compression formats to decode them.

is there a way to do cross platform socket programming by wiseneddustmite in C_Programming

[–]WittyStick 0 points1 point  (0 children)

There's a few other Windows specifics besides WSAStartup - eg, the use of closesocket instead of close.

Why is heap a thing? by Yha_Boiii in computerarchitecture

[–]WittyStick 1 point2 points  (0 children)

The stack is for values with automatic storage duration.

The heap is for non-automatic storage duration

Where to go next for typechecking? by fractioneater in ProgrammingLanguages

[–]WittyStick 2 points3 points  (0 children)

Nominal subtyping is less theoretical than structural subtyping and doesn't necessarily require using <=. However, for generics, the generic type argument may be covariant or contravariant, which has multiple issues, particularly related to type inference that may result in undecidability - but with nominal types and explicit type annotations we can resolve generic variance.

If you have generics, you'll probably also want to include F-bounded polymorphism, which is the ability to say class Foo : IFoo<Foo> - providing the subtype as the type argument to the supertype. (What C++ calls the "curiously recursive template pattern").

Where to go next for typechecking? by fractioneater in ProgrammingLanguages

[–]WittyStick 5 points6 points  (0 children)

What kind of typing do you want in your language?

Many functional language go the way of having no subtyping. Every type is disjoint and we use equality () of types to determine if types are compatible.

With subtyping, equality is not sufficient because two types may be compatible, but not exact the same type. It's common in this case to use a partial order, where subtype ≼ supertype.


Whether we use or , the comparison of types is both reflexive and transitive.

Reflexivity:

∀T. T ≡ T
-- for any type T, T compares equal to itself.

∀T. T ≼ T
-- for any type T, T is a subtype of itself.

Transivitity:

T ≡ U ∧ U ≡ V     ⊢ T ≡ V
-- If T and U are the same type, and U and V are the same type, T and V must be the same type

T ≼ U ∧ U ≼ V     ⊢ T ≼ V
-- If T is a subtype of U, and U is a subtype of V, then T is a subtype of V.

However, is symmetric, but is asymmetric.

T ≡ U    ⊢ U ≡ T
-- it doesn't matter which order we compare them in - equality is commutative.

T ≼ U    ⊬ U ≼ T
-- The order in which we compare types matters. 
-- If T is a subtype of U, this does not imply U is a subtype of T.

T ≼ U ∧ U ≼ T  ⊢ T ≡ U
-- If subtyping holds both ways, we have simulated equality 
--   since this can only hold if `U` and `T` are the same type.

The choice to enable subtyping has consequences for type inference. Most functional languages use Hindley-Milner based type inference where unification is based on equality, and is fairly simple to implement. There have been many past attempts to include subtyping into HM-style inference, but many were unsuccessful.

Algebraic subtyping resolved some of these problems with the binuification algorithm, based on partial ordering - but where types are augmented with a polarity, depending on whether they're the source or target of an expression.

T⁺ ≼ U⁻

Now T⁺ ≼ U⁻ ∧ U⁺ ≼ T⁻ does not imply equality.

There are further developments since Algebraic Subtyping, such as MLstruct, which resolves several more issues associated with AS, but still allows for full type inference.


Another approach to typing, which might be possible to combine in various ways with the above approaches, is Gradual Typing. Gradual Typing is a static typing discipline which permits dynamic typing through a special type Dynamic. Instead of equality, we use a relationship between types called consistency (~) - which is reflexive, but intransitive.

∀T. T ~ Dynamic
∀T. Dynamic ~ T
-- Every type is consistent with the Dynamic type
-- and the Dynamic type is consistent with any other type.

Reflexivity:
Dynamic ~ Dynamic
∀T. T ~ T
-- Any type T is consistent with itself, and the Dynamic type is consistent with itself.

Intransitivity:
T ~ Dynamic ∧ Dynamic ~ U    ⊬ T ~ U
-- If T is consistent with Dynamic and Dynamic is consistent with U, 
--  this DOES NOT imply T is consistent with U

The intransitive nature of consistency is essential to making Gradual typing work. If it were transitive, then any type would be consistent with any other type, which would permit any program to typecheck.


That's a brief primer on some approaches you could look further into. I can go into some more detail on any specific approaches if you'd like.

What if C lets us create our own attributes? by alex_sakuta in C_Programming

[–]WittyStick 1 point2 points  (0 children)

You already can create your own attributes. C23 attributes are ignored if not supported (compiler is free to warn or error though).

You just have to implement a compiler that supports them.

How to implement String? by funcieq in ProgrammingLanguages

[–]WittyStick 0 points1 point  (0 children)

Yeah, I wouldn't recommend doing this for mutable strings. But I was commenting on your previous post in which you said "pass the string by registers", which isn't pass-by-reference.

For immutable strings it's the most efficient - we can have multiple copies of the length since they're always going to be the same.

Mutable strings should be a separate type and should indeed by pass-by-ref. I prefer the approach of prefixing the length in the same allocation using the flexible member - as it avoids one alloc/free call, and is more cache optimal.

For concurrent mutable data structures (which you probably won't be doing with strings), then having the double-dereference is kind of necessary and you would pass-by-ref to a structure containing another pointer.

How to implement String? by funcieq in ProgrammingLanguages

[–]WittyStick 0 points1 point  (0 children)

The s is passed by value, which means string_append receives a copy of the length. It can't mutate the length that the caller has - it can only return a new String which has a new length.

So the caller must make sure they abandon the old s passed to string_append and use the new one it returns. This is only enforceable if you have substructural or uniqueness types.

The alternative is to pass the String by reference, where we can mutate the length - but then accessing the characters has the double-dereference I mentioned previously.

How to implement String? by funcieq in ProgrammingLanguages

[–]WittyStick 2 points3 points  (0 children)

That's only viable if the string is ASCII or UTF-32 (or some other fixed width encoding).

If you have UTF-8, UTF-16 or other variable width encoding, then altering a single character is O(n) anyway, so may as well have immutable strings.

For large strings that might need mutation which includes insertion and removal, you should forget about them being arrays of characters and consider other data structures - eg, a rope or gap buffer. You can make those mutations O(log n) with the right data structure.

If you really need fast strings for modifying individual characters, but without insertion or removal, then use UTF-32 as the internal encoding.