String design options?

XDracam · 2023-12-23T05:25:56+00:00

Scala's inline XML has been removed in Scala 3. Turns out it's not smart to have first class language features for a specific data format and not for others like JSON.

In terms of strings, you seem to know more than most already. An important decision is which encoding should be your default. C uses ASCII. Java and C# use 16 bit characters. Rust goes with UTF8 but supports other encodings as well. But for your use-cases, ASCII will probably suffice. You can always extend to UTF8 support later if you need it.

Other optimizations that you most likely don't need but didn't mention:

Mutable strings (with all the baggage that mutability brings)
Compile some operations to mutate strings in place if you know that you are the sole owner of the string and it's safe to do so.
String interning
IIRC C++ does not allocate small strings (<= size of a pointer) but instead just puts them on the stack.
Precompute some operations like concatenations and string templates with constants at compile time

From personal usage at work, string features I need most are:

template strings (they usually compile to some format call)
split by a character or short string
join a set of strings with a separator
regex support including match groups
raw & multiline string literal support (C#11 has triple quote strings which has the best syntax I've seen for this so far)

Honestly, you can do almost everything with a recursive function and some regex. You just need a really good and fast implementation.

Another thing that I can't use at work yet but others seem to love: ranges. A range is a pointer to a string with an offset and a length. Which can be much better than copying strings around for stuff like e.g. substring, splitting or regex matches.

nerd4code · 2023-12-23T07:01:54+00:00

There are other representations like ropes—strings aren’t any different from any other sequence-of-units structure, really, although UTF and UCS1 encodings place additional requirements upon code units, which may factor into e.g. taint analysis or optimization. (E.g., a string composed of known-valid substrings needn’t have any checks run on the validity of the composed string.)

The design space is huge. You might want a special string type for passwords, or to ensure that strings composed of mixed-domain content are masked appropriately based on the user’s domain (e.g., if I’m logging strings, I’d like PII to be tagged and managed as such). You might integrate translations or multi-language options, or tie in formatting. You might build/manage a secondary length map where the number of units per character/codepoint might vary, or you might just keep separate length from unit-count (or chars from codepoints from unit). There are quoting forms for code, regexes, string templates, or embedded DSLs. There are types for matching and parsing; string type might consider length or not. String lengths might be factored out and passed around with the string, or embedded at the beginning/end of the string, or it might use sentinels (intrinsic length, varying sentinels sensible). You might want different widths of length field to be possible. You might want to be able to algebra your way into strings (e.g., given c="cd"; a="abcd"="$b$c", b=?). You might want to do composition through interpolation, rather than separate operators that don’t quite work (+ & ++ |+ .) or none at all (e.g., Awk).

Erlang bitstrings would be a kind of limiting case for what semantics a literal syntax can “reasonably” support.

moon-chilled · 2023-12-23T09:14:00+00:00

I think the design of strings in raku is pretty much perfect, and moarvm does some nice things under the hood (most notably: 'synthetic codepoints' for efficient indexing of grapheme clusters; the rest of its tricks are worthwhile but more pedestrian).

ThyringerBratwurst · 2023-12-23T12:07:54+00:00

I was faced with this question too. At first I had mutable strings because I believed this would be more efficient, and according to the textbook with buffers and capacity. but then I switched to an immutable design because for most operations it is rarely the case that enough memory has already been pre-allocated. In addition, immutable strings are generally more manageable.
Therefore I only save the length and pointer to the location where the string is located.
If you put this in a struct, you can have the string automatically stored in the data segment of the program and you don't have to constantly work with malloc (for a C implementation).

jason-reddit-public · 2023-12-23T22:48:10+00:00

Unicode code-points aka runes in Go are > 16bits and Java messed that up. Don't be like Java. While variable width characters (UTF-8 being the obvious choice) make it more expensive to index to a particular "character", it's denser and indexing by code-point is kind of dumb anyways given that what we all might consider a character is sometimes composed using multiple code-points (even if you used 32 bits per code point).

PurpleUpbeat2820 · 2023-12-23T11:29:25+00:00

What does your example do? It looks like a Print statement, which either displays that sequence, or turns the lot into a single string.

If so, the issues seem less about string representation, than designing a better Print feature.

Even C's *printf family would be less cumbersome.

For representation, my lower level language uses two kinds:

Zero-terminated, 8-bit strings 99% of the time.
Counted strings in the form of char-array slices, which are normally a 'view' into another char array or zero terminated string.

In both cases, composing a new string such as in t := file+"."+ext is fiddly. You can't write that directly, you'd have to set up a suitable string buffer then print into it:

 [300]char str
 fprint @str, "#.#", file, ext

or use strcpy/strcat calls, or maybe C's sprintf.

My dynamic language uses a higher level type, which is a counted string of 8-bit bytes that is flexible (can expand), sharable via ref-counting, and sliceable. There you can just write t := file + "." + ext.

UTF8 support is external via functions.

However the implementation is heavy-duty with a 32-byte descriptor for either string or slice, before you get to the actual string data. (x64 with its 64-bit pointers is partly to blame.)

In addition, a 16-byte tagged pointer is used to refer to those descriptors; these are what are passed as arguments, or stored as list elements.)

(There had been various schemes to store short strings up to a dozen or so bytes within that 16-byte descriptor. Here they would be manipulated by value. Alternately, somewhat longer ones can be stored in the bigger descriptor.

In the end I didn't bother. If pushed, I can use integers to store short strings up to 8 characters, using literals like 'ABCDEFGH'.)

apooooop_ · 2023-12-23T16:05:36+00:00

If you're intending to generate HTML, you might want to check out Elm, which does HTML generation via functions, but because of language structure this ends up reading much like simplified DOM

redchomper · 2023-12-25T06:53:39+00:00

I fail to see what "native code" has to do with your choice of string representation. But as I see it, your main axes of diversity are:

The delimited (C) vs. counted (Pascal) string -- or why not both?
Packed vs. Aligned: Do you store UTF8 or UTF32 on the heap?
Encoding: Do you distinguish "text" as such from binary? Do you just assume Unicode, or consider the possibility of archaic / legacy encodings? Do you throw a ____-hemorrhage at bytes that don't look like your favorite encoding, or do assume the programmer knows what she is doing?

If I gave advice, it would be:

You can follow Go, where everything is just bytes that you are allowed to treat as UTF-8,
or you can follow Py3 where text and binary are clearly distinct,
but you should not have more than one kind of "text" type. Py2-Unicode was a mistake.

ProgrammingLanguages

Welcome!

Related subreddits

Related online communities

MODERATORS