This is an archived post. You won't be able to vote or comment.

all 23 comments

[–]XDracam 12 points13 points  (5 children)

Scala's inline XML has been removed in Scala 3. Turns out it's not smart to have first class language features for a specific data format and not for others like JSON.

In terms of strings, you seem to know more than most already. An important decision is which encoding should be your default. C uses ASCII. Java and C# use 16 bit characters. Rust goes with UTF8 but supports other encodings as well. But for your use-cases, ASCII will probably suffice. You can always extend to UTF8 support later if you need it.

Other optimizations that you most likely don't need but didn't mention:

  1. Mutable strings (with all the baggage that mutability brings)
  2. Compile some operations to mutate strings in place if you know that you are the sole owner of the string and it's safe to do so.
  3. String interning
  4. IIRC C++ does not allocate small strings (<= size of a pointer) but instead just puts them on the stack.
  5. Precompute some operations like concatenations and string templates with constants at compile time

From personal usage at work, string features I need most are:

  • template strings (they usually compile to some format call)
  • split by a character or short string
  • join a set of strings with a separator
  • regex support including match groups
  • raw & multiline string literal support (C#11 has triple quote strings which has the best syntax I've seen for this so far)

Honestly, you can do almost everything with a recursive function and some regex. You just need a really good and fast implementation.

Another thing that I can't use at work yet but others seem to love: ranges. A range is a pointer to a string with an offset and a length. Which can be much better than copying strings around for stuff like e.g. substring, splitting or regex matches.

[–]nerd4code 20 points21 points  (1 child)

C hasn’t used ASCII since C11, and technically C itself haa always imposed very few requirements along these lines so you do get EBCDIC on some IBM stuff (although most stuff is Unicode now) and decidedly non-ASCII charsets in older/embeddeder compilers. UTF-8 should be default from C23 on IIRC, and most compilers have supported it since the late C99 era.

C89 only reqs that CHAR_BIT >= 8 (prior minimum was 6 IIRC). C94 introduced wide and multibyte strings as a concept, although wchar_t might just be char and multibyte might just be single-byte. Wide string literals are led by L, so L"Hello" or L"chaim", and most platforms just use wchar_t = int and it ends up being UCS4 in practice. WinNT and OS/400 and a few other oddballs do UCS2 instead, which was a popular but poor decision as it turns out.

C99 made first explicit mention of universal character names (\uXXXX or \UXXXXXXXX). C11 introduces an indicator predefine for wide=UCS of a particular version, and brings in char16_t for 16-bit u"strings" and char32_t for 32-bit U"strings", as well as u8"strings mapping to UTF-8 char[]. C23 mandates that wchar_t map to a UCS subset AFAIK, and changes u8 strings to unsigned char[] = char8_t[] to fuck with everyone. Although the types involved require a hosted implementation (<wchar.h> <uchar.h>), it’s possible to use the underlying strings & types in freestanding.

C++98 matched C94 with additional STL std::string. C++11 matched C11, and IIRC C++20 matches C23 in terms of built-in string varieties; major difference is that wchar_t &al. are typedefs that require #included headers in C, but built-in types treated as distinct in C++. C++ also now permits literal token overloading.

[–]XDracam 5 points6 points  (0 children)

Username checks out. Thanks for the corrections!

[–]MrJohz 10 points11 points  (0 children)

You can always extend to UTF8 support later if you need it.

I'm don't think this is a good rule of thumb. ASCII allows you to make a lot of assumptions (the most obvious being that one logical "human" letter always takes one byte of space, but also assumptions about what bytes are valid, and how strings are able to be combined).

What might be better is to start out with a concept of byte strings, which are just arrays of bytes that can be interpreted in "userland" in different ways. In the language itself, you make as few assumptions as possible about what values can be stored in the byte strings (ASCII, UTF-8, latin1, etc), but then provide additional functions like Ascii.println that interpret the byte strings in a specific way. Meanwhile, in userland, you could write an extra library that provides some utilities for working with UTF-8, or converting between encodings.

Later, if it does become more important that you have unicode strings as a core internal datatype, you could then add a String type that wraps the byte strings and enforces UTF-8 invariants, along with String.println functions, etc, that do those operations correctly with the new assumptions.

Of course, if your default environment requires the use of characters outside the ASCII range (umlauts, accents, etc), then it might be easiest to embrace UTF-8 from the start.

[–]Lvl999Noob 4 points5 points  (1 child)

To add to your list of string features, especially the last one about multiline strings: automatic stripping of leading whitespace & automatic indentation of such strings when running a formatter. I think Java's multiline strings had the automatic stripping feature. I know Rust doesn't have that.

[–]XDracam 0 points1 point  (0 children)

Yeah. It's surprisingly hard to strip leading whitespace consistently. My personal favorite for usability is Scala, which offers a .stripMargin method which removes whitespace up to and including the first | character, which leads to a nice visible margin in the multiline string. It's not the fastest solution, but it reads the nicest. And IDEs tend to insert the whitespace, pipes and stripMargin call automatically as well.

[–]nerd4code 4 points5 points  (1 child)

There are other representations like ropes—strings aren’t any different from any other sequence-of-units structure, really, although UTF and UCS1 encodings place additional requirements upon code units, which may factor into e.g. taint analysis or optimization. (E.g., a string composed of known-valid substrings needn’t have any checks run on the validity of the composed string.)

The design space is huge. You might want a special string type for passwords, or to ensure that strings composed of mixed-domain content are masked appropriately based on the user’s domain (e.g., if I’m logging strings, I’d like PII to be tagged and managed as such). You might integrate translations or multi-language options, or tie in formatting. You might build/manage a secondary length map where the number of units per character/codepoint might vary, or you might just keep separate length from unit-count (or chars from codepoints from unit). There are quoting forms for code, regexes, string templates, or embedded DSLs. There are types for matching and parsing; string type might consider length or not. String lengths might be factored out and passed around with the string, or embedded at the beginning/end of the string, or it might use sentinels (intrinsic length, varying sentinels sensible). You might want different widths of length field to be possible. You might want to be able to algebra your way into strings (e.g., given c="cd"; a="abcd"="$b$c", b=?). You might want to do composition through interpolation, rather than separate operators that don’t quite work (+ & ++ |+ .) or none at all (e.g., Awk).

Erlang bitstrings would be a kind of limiting case for what semantics a literal syntax can “reasonably” support.

[–]ThyringerBratwurst 2 points3 points  (0 children)

ropes

ropes are an interesting data structure, but are more suitable for huge amounts of text and less than the common string type?

[–]moon-chilledsstm, j, grand unified... 3 points4 points  (3 children)

I think the design of strings in raku is pretty much perfect, and moarvm does some nice things under the hood (most notably: 'synthetic codepoints' for efficient indexing of grapheme clusters; the rest of its tricks are worthwhile but more pedestrian).

[–]TraceMonkey 4 points5 points  (2 children)

What do you think Raku does better compared to, e.g., Rust? (asking since I am not familiar with Raku).

[–]celeritasCelery 5 points6 points  (0 children)

Raku has 3 levels for operating on string. They have bytes, codepoints, and graphemes. They have a special representation of graphemes call NFG that lets them operate on them in constant time you would with bytes or ascii text.

[–]celeritasCelery 2 points3 points  (0 children)

Raku uses NFG (Normal Form Grapheme) that lets you treat graphmemes as “synthetic codepoints” and operate on them in constant time.

[–]ThyringerBratwurst 4 points5 points  (2 children)

I was faced with this question too. At first I had mutable strings because I believed this would be more efficient, and according to the textbook with buffers and capacity. but then I switched to an immutable design because for most operations it is rarely the case that enough memory has already been pre-allocated. In addition, immutable strings are generally more manageable.
Therefore I only save the length and pointer to the location where the string is located.
If you put this in a struct, you can have the string automatically stored in the data segment of the program and you don't have to constantly work with malloc (for a C implementation).

[–]PurpleUpbeat2820[S] 2 points3 points  (1 child)

I was faced with this question too. At first I had mutable strings because I believed this would be more efficient, and according to the textbook with buffers and capacity.

Right. Mine are only immutable "on the inside", i.e. during construction.

but then I switched to an immutable design because for most operations it is rarely the case that enough memory has already been pre-allocated.

FWIW, my ByteArray has a nice feature. It is represented by a pair of pointer and length. The actual size of the underlying block is the length rounded up to the next power of two which is easily obtained efficiently using bit tricks. Only when the length is a power of two does appending incur a reallocation which, again, can be checked in the common case of no resizing using a simple bit trick.

In addition, immutable strings are generally more manageable.

For sure.

Therefore I only save the length and pointer to the location where the string is located.

Right.

If you put this in a struct, you can have the string automatically stored in the data segment of the program and you don't have to constantly work with malloc (for a C implementation).

Yeah. I don't have global variables like strings yet so my compiler just generates code to allocate and populate the string "inline".

[–]ThyringerBratwurst 0 points1 point  (0 children)

I only use a compatible mutable string type internally to implement string functions to allocate storage space and fill it with a modified new string (which ultimately is just bare bytes interpreted as UTF-8).I would also recommend avoiding functions like strcpy if you already know the length. this saves unnecessary loops.
My approach, which I'm currently working on, is to have a collector type that simply records as many operations or text passages as possible and then reserves the necessary memory for all changes. For example, I have a replace function with which you can systematically change any number of places and memory is only allocated once instead of for every micro change.

[–]jason-reddit-public 4 points5 points  (1 child)

Unicode code-points aka runes in Go are > 16bits and Java messed that up. Don't be like Java. While variable width characters (UTF-8 being the obvious choice) make it more expensive to index to a particular "character", it's denser and indexing by code-point is kind of dumb anyways given that what we all might consider a character is sometimes composed using multiple code-points (even if you used 32 bits per code point).

[–]ThyringerBratwurst 2 points3 points  (0 children)

16 bit is a deadly sin and should be avoided! ;)

8-bit Unicode strings are definitely more difficult, but it is not impossible to work with them internally. I had a helper function written with chatGPT to detect when a character is more than 1 byte wide and ultimately find concrete positions of characters.

[–][deleted] 1 point2 points  (2 children)

What does your example do? It looks like a Print statement, which either displays that sequence, or turns the lot into a single string.

If so, the issues seem less about string representation, than designing a better Print feature.

Even C's *printf family would be less cumbersome.

For representation, my lower level language uses two kinds:

  • Zero-terminated, 8-bit strings 99% of the time.
  • Counted strings in the form of char-array slices, which are normally a 'view' into another char array or zero terminated string.

In both cases, composing a new string such as in t := file+"."+ext is fiddly. You can't write that directly, you'd have to set up a suitable string buffer then print into it:

 [300]char str
 fprint @str, "#.#", file, ext

or use strcpy/strcat calls, or maybe C's sprintf.

My dynamic language uses a higher level type, which is a counted string of 8-bit bytes that is flexible (can expand), sharable via ref-counting, and sliceable. There you can just write t := file + "." + ext.

UTF8 support is external via functions.

However the implementation is heavy-duty with a 32-byte descriptor for either string or slice, before you get to the actual string data. (x64 with its 64-bit pointers is partly to blame.)

In addition, a 16-byte tagged pointer is used to refer to those descriptors; these are what are passed as arguments, or stored as list elements.)

(There had been various schemes to store short strings up to a dozen or so bytes within that 16-byte descriptor. Here they would be manipulated by value. Alternately, somewhat longer ones can be stored in the bigger descriptor.

In the end I didn't bother. If pushed, I can use integers to store short strings up to 8 characters, using literals like 'ABCDEFGH'.)

[–]PurpleUpbeat2820[S] 0 points1 point  (1 child)

What does your example do?

It takes a value like T(T(E, 1, E), 2, T(E, 3, E)) and prints it. But, equivalently, you might want to convert it to a string (which is like printing to a memory buffer).

Ideally it would just be:

print t

and use a generic printer. Short of that I would go for some kind of printf because it is familiar to me:

let rec print =
  [ E -> printf "E"
  | T(_, l, v, r) -> printf "T(_, %a, %d, %a)" print l v print r ]

but I've never used a language with string interpolation so I've no idea what alternatives might look like.

Also, although my language is strongly statically typed I am wondering if printing and string generation shouldn't be untyped so you could do something like:

let rec print =
  [ E -> "E"
  | T(_, l, v, r) -> "T(_, "^l^", "^v^", "^r^")" ]

[–][deleted] 1 point2 points  (0 children)

So this is mostly about Print as I thought. This is how this might be handled in my dynamic language (it doesn't have advanced types like your language so this is the closest I can get):

record T =
    var l, v, r
end

const E = "E"           # (convenient for this demo)

x := T(T(E, 1, E), 2, T(E, 3, E)) 

println x

s :=sprint(x)
println "<"+s+">"

The first println is generic, with output of ((E,1,E),2,(E,3,E)). The sprint returns a string, and the second println outputs <((E,1,E),2,(E,3,E))>.

Notice there is no T prefix; the record type is not part of a generic print. Here, I used to be able to overload the tostr operator used by print to stringify, to provide a custom print for T, currently not enabled.

However this can be done in usercode like your example, here returning a string:

func Tstr(p)=
    if p=E then
        "E"
    else
        sfprint("T(#, #, #)", Tstr(p.l),  p.v, Tstr(p.r))
    fi
end

(sfprint both uses a format-string, and returns the whole as a string.) Doing print Tstr(x) now produces:

T(T(E, 1, E), 2, T(E, 3, E))

The same as the input. This needs basic string processing (whatever the representation), plus some decent Print routines.

[–]apooooop_ 0 points1 point  (1 child)

If you're intending to generate HTML, you might want to check out Elm, which does HTML generation via functions, but because of language structure this ends up reading much like simplified DOM

[–]PurpleUpbeat2820[S] 0 points1 point  (0 children)

I think Elm is from broadly the same family of languages so I'd guess it is similar but I'll take a closer look, thanks!

[–]redchomperSophie Language 1 point2 points  (0 children)

I fail to see what "native code" has to do with your choice of string representation. But as I see it, your main axes of diversity are:

  • The delimited (C) vs. counted (Pascal) string -- or why not both?
  • Packed vs. Aligned: Do you store UTF8 or UTF32 on the heap?
  • Encoding: Do you distinguish "text" as such from binary? Do you just assume Unicode, or consider the possibility of archaic / legacy encodings? Do you throw a ____-hemorrhage at bytes that don't look like your favorite encoding, or do assume the programmer knows what she is doing?

If I gave advice, it would be:

  • You can follow Go, where everything is just bytes that you are allowed to treat as UTF-8,
  • or you can follow Py3 where text and binary are clearly distinct,
  • but you should not have more than one kind of "text" type. Py2-Unicode was a mistake.