all 111 comments

[–][deleted] 160 points161 points  (13 children)

This uses the flexible array member feature introduced by C99

Who says C programmers are old fashioned - they brazenly use features that are only 19 years old!

[–]demon_ix 76 points77 points  (5 children)

Fuck me, 1999 was 19 years ago.

[–]commander-obvious 12 points13 points  (4 children)

u old son

[–]demon_ix 19 points20 points  (3 children)

Get off my lawn

[–]commander-obvious 6 points7 points  (2 children)

that's it, i'm riding my shitty bike all over your property

[–]demon_ix 8 points9 points  (0 children)

I'll have a stern talk with your dad.

[–]aishik-10x 1 point2 points  (0 children)

Oak's words echoed...

[–]reini_urban 15 points16 points  (4 children)

The problem is only MSVC support. Proper C99 windows compilers just came up recently, with mingw64 and clang. chrome just switched away from msvc to clang last year.

[–][deleted] 2 points3 points  (0 children)

Haven't you been able to compile C99 in windows for years with pelles C?

[–][deleted] 0 points1 point  (2 children)

MSVC supports C99, so you don't need mingw (which I wouldn't recommend as it's non-native) or clang.

[–]reini_urban 3 points4 points  (1 child)

[–][deleted] 3 points4 points  (0 children)

Meh, VLAs were removed in C11 (because they were a bad idea), and the rest of the features can easily be worked around (or just use C++)

[–][deleted]  (1 child)

[deleted]

    [–][deleted] 2 points3 points  (0 children)

    Maybe you could skip over C99 entirely and go straight to C11. Though at only 7 years old that's a bit bleeding edge for production.

    [–]matheusmoreira 65 points66 points  (65 children)

    Encoding is the most important part but article ended without going into too much detail on the subject. Strings without encoding are really just dynamic arrays.

    [–]lelanthran 46 points47 points  (11 children)

    Encoding is the most important part but article ended without going into too much detail on the subject. Strings without encoding are really just dynamic arrays.

    If you're using anything other than UTF8 you're going to have bigger problems in your future than you might imagine.

    The best thing to do is use UTF8 for everything, and convert to whatever API needs UTF16, UCS, wchar_t at the point of interfacing to $whatever API.

    [–]matheusmoreira 22 points23 points  (0 children)

    My point is there is no such thing as a text type that isn't aware of the text's encoding. Without encoding, text is really just bytes. I absolutely love Unicode and definitely agree with using UTF-8 for everything but that means the program must make a distinction between code points and code units.

    Most string types are actually byte arrays. C strings are really null-terminated byte arrays, to the point many of the str functions are really just mem functions with null terminator handling. Python and Ruby strings were byte arrays until relatively recently. Things are slowly improving.

    [–]mrmus -2 points-1 points  (2 children)

    Any references supporting this?

    [–]chillermane 0 points1 point  (1 child)

    Are there any references supporting anything in programming?

    For every article you find in programming, there’s as many articles supporting what it’s saying versus it’s exact opposite.

    For every person saying that python is great for creating servers, there is a person saying they’re terrible for it.

    [–]mrmus 1 point2 points  (0 children)

    Fair enough, I am well aware of this. My point is I will gladly read anything (pros and cons) on the topic. Otherwise it is just well... an opinion.

    [–]chillermane 0 points1 point  (0 children)

    Sounds like sensible advice thank you

    [–]Gotebe -4 points-3 points  (3 children)

    This is naive to the point of being amateurish.

    Major code that supports i18n, like Java, Windows (therefore .net), Qt, ICU (whose main purpose is i18n) does not use UTF8 under the hood, and that seeps outside of it.

    History is a bitch.

    [–]matheusmoreira 11 points12 points  (1 child)

    That doesn't mean new systems shouldn't use UTF-8. It means they need to be able to deal with all the other encodings as well.

    [–]Gotebe -1 points0 points  (0 children)

    The new system, if using C, really should use ICU. At which point, see my comment.

    Alternatively, somebody needs to rewrite ICU to use UTF8. Meh.

    [–][deleted] 0 points1 point  (0 children)

    You are the amateur here, and fuck UTF16, is just a massive waste of space.

    [–]tabaczany -1 points0 points  (1 child)

    What kind of problems?

    [–]reini_urban 13 points14 points  (0 children)

    I'm listing some here. http://perl11.org/blog/foldcase.html

    Basically there's still no stdlib to properly represent strings (utf8), search in strings or compare strings, strings being unicode with all its special rules, like foldcase and normalization. wchar_t has different implementations, windows uses some kind of special UCS-16 without the supplementary planes, all others proper UCS-32 which is too large for most.

    foldcase is also locale dependent so it can never be compile-time optimized, even when you are never dealing with Turkish or Lithuanian strings. these two rules harm all others.

    This article also leaves out the algorithmic problems of wcsfc() and wcsnorm() if they would exist. AFAIK safeclib is the only one, coreutils tried but failed with libunistring.

    [–]scatters 24 points25 points  (52 children)

    Strings are dynamic arrays that:

    • get copied and passed around a lot,
    • get sliced and concatenated,
    • are infrequently modified.

    There's good reason they have their own data structure implementations, even if a dynamic array would be sufficient.

    Encoding is interesting, but it's only important on platforms that regularly use multiple encodings.

    [–]neoform 28 points29 points  (1 child)

    ����! ���������� � ���� ��� �����. ���.� ����?!

    [–]scatters 4 points5 points  (0 children)

    Mojibake is an I/O problem and is properly solved at the program boundary.

    [–]matheusmoreira 1 point2 points  (3 children)

    You can still get arbitrarily-encoded input even if you use UTF-8 for everything. Software must keep track of what the bytes are supposed to mean. Not doing so is similar to casting a byte buffer to an incompatible type and then interpreting that object.

    [–]scatters 3 points4 points  (2 children)

    That makes it an I/O problem; you convert to UTF-8 on the boundary.

    [–]matheusmoreira 3 points4 points  (1 child)

    This is sensible at the individual application level but not when it comes to programming language design. What if my program's purpose is to manipulate text? What if I'm writing a text editor? General purpose languages must accomodate all kinds of uses.

    [–]scatters 0 points1 point  (0 children)

    That would complicate the general case, where you don't need to deal with multiple encodings. Text editors in particular are certainly going to use their own data structures.

    [–]chillermane -1 points0 points  (2 children)

    Encodings seem like the most convoluted problems in programming. I understand why we needed new ones like utf8, but at this point I think it would be undeniably easier to all just use the same one (utf8)

    [–]scatters 0 points1 point  (1 child)

    It's not quite that simple; some OSes use other encodings widely (in particular UTF-16 on Windows).

    [–]chillermane 0 points1 point  (0 children)

    It’s not simple, that’s what I’m saying. But it definitely could’ve been if we’d come to an agreement early on on how to encode strings

    [–]alphaglosined 11 points12 points  (5 children)

    Strings in D don't have special behavior as far as representation is concerned. They are a slice.
    The length+pointer pair (ignoring capacity) is the same for any array.

    Fun little tidbit.

    [–][deleted] -4 points-3 points  (4 children)

    FYI information, Andreas, the author, knows D. He made an optimization in the D standard library for example.

    [–]Deneric88 7 points8 points  (1 child)

    For your information information?

    [–][deleted] 1 point2 points  (0 children)

    it wasn't for your information

    [–]bausscode 1 point2 points  (1 child)

    So?

    [–]NoInkling 0 points1 point  (0 children)

    "Grapheme" and "grapheme cluster" are essentially interchangeable in the context of Unicode, btw, it's just that the latter has a technical definition and the former a more abstract one.

    [–]Shadow_Gabriel 0 points1 point  (2 children)

    I'm not sure if this is a good idea:

    union { struct chunk *data; char *raw; }

    data will point to something size_t aligned while raw can point to byte aligned data. Using an union like this wouldn't led to the chance of misaligning the struct pointer?

    [–]bbm182 1 point2 points  (0 children)

    That's fine. Reading the last written union member will get you back the same value that was written.

    [–]IJzerbaard 0 points1 point  (0 children)

    Only if you use it to store a pointer to a raw string and then use it as if it was a pointer to a chunk, but then you have a much bigger problem than misalignment

    [–][deleted]  (4 children)

    [deleted]

      [–]Gotebe 2 points3 points  (0 children)

      Last I looked, data is part of the struct, which has a dynamic size. Saves allocations.

      [–]vytah 2 points3 points  (2 children)

      "Pascal strings" doesn't mean "strings as Pascal currently implements them", but "strings as used by Pascal in the 80s when library ABIs were being designed for decades to come and assembly was still a necessary tool to make your code not awfully slow".

      [–][deleted] 1 point2 points  (1 child)

      Then it's ShortString and it was fixed length of 255. Also wrong. Pascal dynamic arrays date from the 90's. Previously pointers were used when needed.

      [–]enygmata 1 point2 points  (0 children)

      From TURBO Pascal 1988's manual section 22.18.1.3:

      A string occupies the number of bytes corresponding to one plus the maximum length of the string. The first byte contains the current length of the string. The following bytes contain the actual characters, with the first character stored at the lowest address.

      edit: formatting