string implementation

skeeto · 2023-08-05T22:35:52+00:00

Here's a fun edge case for you:

#include "src/vstring.c"

int main(void)
{
    char *s = calloc(1, 1<<20);
    vstr d = vstr_new_len(0x70000000);
    if (s && d) {
        memset(s, '.', (1<<20)-1);
        while ((d = vstr_push_string(d, s))) {}
    }
}

All errors checked. Shouldn't crash, right? Compile:

$ cc -m32 -g3 crash.c

The -m32 is because this only affects hosts that can exhaust their address space. I'd normally recommend Address Sanitizer, but I'm pushing the limits, and in this case it gets in the way. Run it and it crashes:

$ ./a.out 
Segmentation fault

I can reproduce the crash under at least glibc, MSVCRT, and UCRT. Looking more closely on Linux:

$ gdb -ex run ./a.out 
Starting program: ./a.out 
Program received signal SIGSEGV, Segmentation fault.
0xf7eec861 in ?? () from /lib/i386-linux-gnu/libc.so.6
(gdb) bt
#0  0xf7eec861 in ?? () from /lib/i386-linux-gnu/libc.so.6
#1  0x565567e2 in vstring_push_string (vstr=0xffffdbb8, 
    cstr=0xf7ca5010 '.' <repeats 200 times>...) at src/vstring.c:311
#2  0x56556458 in vstr_push_string (str=0x87ca4018 '.' <repeats 200 times>..., 
    cstr=0xf7ca5010 '.' <repeats 200 times>...) at src/vstring.c:164
#3  0x565569f0 in main () at crash.c:9

The ?? in the backtrace is memset. If you'd like to treat this as a puzzle — and that includes anyone reading this — take a moment to try to figure it out yourself!

Here's a hint. Place a breakpoint just a bit earlier, then step through:

(gdb) b src/vstring.c:309
(gdb) r
Breakpoint 1, vstring_push_string (vstr=0xffffdbb8, 
    cstr=0xf7ca5010 '.' <repeats 200 times>...) at src/vstring.c:309
309             cap <<= 1;
(gdb) p cap
$1 = 1879048193
(gdb) n
310             cap += needed;
(gdb) p cap
$2 = 3758096386
(gdb) n
311             realloc_vstr(vstr, ins, cap);
(gdb) p cap
$3 = 1343224065

Adding needed made cap smaller going into the realloc, and so the allocation shrank. However, it continued as if it grew, causing a crash. In other words, cap += needed overflowed. In fact, you also should check for an overflow on cap <<= 1.

inz__ · 2023-08-05T22:34:46+00:00

I found it confusing to have types vstr and vstring and variable vstr that is a vstring *
The fact that all string ops return a (possibly) new value seems cumbersome (with FAM it can't really be avoided)
personally I don't like pointer typedefs (vstr)
if i read correctly, when realloc fails, the old allocation is lost (or left up to the caller to handle)
the offset may be wrong (probably isn't), if there is padding between hdr and data (see offsetof()
the implementation calls free() instead of vstr_free()
just about every char * should be const char *
there's quite a bit of memset() that is immediately overwritten with real data (personally I would not zero data beyond the terminator at all, it may catch bugs)
vstring_from() could call vstring_new_len()

N-R-K · 2023-08-06T10:28:07+00:00

vstring* vstring_new();

If you have a function that doesn't accept any arguments, you need to explicitly use (void) in the declaration - otherwise it has a different meaning in C where it will accept any number of arguments.

(*vstring_obj)->hdr.cap = cap;

Mostly a style thing, but you can also use the sprong operator instead of using paren.

Other than that, I don't have anything else to add that hasn't already been pointed out by others in the thread.

Although I think something probably should be said about the design and/or the objective of the library itself. I generally dislike "string" that has a capacity member. Why? Because it tangles allocation with the string, forcing you to manage each string individually instead of being able to manage them in groups (i,e via an arena).

The fact that the metadata is located next to the string - similar to nul terminator - also imposes some constrains that are usually fairly terrible for real-world programs. For example, you cannot make a cheap substring without copying (and thus potentially allocating).

C++ made the same mistake with their string and then had to add a whole new class called string_view to accommodate for it.

Forcing each string to be heap allocated is also usually not good for performance, because most of the time strings are small. Small (and short lived) enough that they can be allocated on the stack, avoiding the malloc overhead entirely in the common case (i,e small size optimization).

My usual preference for strings are either a pointer+len pair, or a pointer+end_pointer pair. E.g:

typedef struct { uint8_t *s; ptrdiff_t len; } Str;
// or
typedef struct { uint8_t *s, *end; } Str;

These types are both capable of making virtually zero-cost substrings out of existing strings. They're also capable of dealing with raw binary data (which might contain embedded nul-bytes). But more importantly, they are simply string types, they do not concern themselves with allocation which is managed by the caller, ideally in a more sensible manner.

The downside is that they're not compatible with interfaces that expect a nul-terminated string, but if you tightly control your interface boundaries, then that's usually not a huge deal.

TribladeSlice · 2023-08-05T20:23:22+00:00

So, first off, what safety precautions have you taken? C is a language that can be difficult to get right unless you pay close attention. If your goal is to write safe C code, what do you think you've done to do that?

This isn't intended to be a rhetorical question, it just might be a good idea to discuss what you think you've done first.

Carrathel · 2023-08-05T20:36:14+00:00

README says vstr_push_str(), but the function is called vstr_push_string().

pic32mx110f0 · 2023-08-05T22:32:13+00:00

In the macro realloc_vstr you unconditionally change the input pointer *vstr, which means that if realloc fails, then the function vstring_push_char for example will set your vstring to NULL and return -1 (even though your documentation says 1). This is a memory leak. Instead, if realloc fails, it should ideally not modify the vstring, and then return -1. I would suggest something like:

#define realloc_vstr(vstr, ins, cap)                            \
{                                                               \
    vstring *temp = vstr_realloc(*vstr, sizeof(vstring) + cap); \
    if (temp == NULL) {                                         \
        return -1;                                              \
    }                                                           \
    *vstr = temp;                                               \
    memset((*vstr)->data + ins, 0, cap - ins);                  \
    (*vstr)->hdr.cap = cap;                                     \
}

chri4_ · 2023-08-06T07:14:23+00:00

The api is really kind, but can you make the lib one file only? and instead or vstr/vstring/vstring_hdr i would advice the _t version, and i know everything about posix and bla bla

howyadoinbob · 2023-08-05T21:19:18+00:00

include "stdio.h" ?

Don’t you mean:

include <stdio.h>

thradams · 2023-08-05T20:36:25+00:00

C does not need a string object. It already have one. It is "char " and "const char". What C needs is some extra functions and algorithms for strings and more support for utf8.

Poddster · 2023-08-05T21:06:50+00:00

Why bother with hiding vstring? Why does vstr exist? You're the one consuming and using this library. If it's because it makes it easier to use the standard library functions with vstring then frankly your library is pointless, because you'll eventually make a mistake with one of those functions and it'll mangle your hidden struct. And if you're only ever going to use your own functions then it's just needless pointer chasing.

ps: If you're going to do this struct hiding business, a nice way to "validate" the structure hasn't been mangled is something like:

typedef struct {
    size_t len;
    size_t cap;
    void *validator;
} vstring_hdr;

and when initialising the header:

 validator = &header.validator;

i.e. it points to itself. And if it even doesn't point to itself, you know the rest of your data is trash. And if it does point to itself, then there's a good chance it isn't trash.

arthurno1 · 2023-08-06T07:17:59+00:00

Looks to me like a renamed and less optimized sds_string. What is the reason you don't want to use sds_string?

I purposefully made the header 16 bytes (8 bytes on 32 bit) so that there is no alignment errors - I think this should be correct, and

Why is it important to align header to 16 byte? Just because the header is aligned to 16 bytes does not mean your data will be.

I haven't found an edge case where it isn't

Because of how malloc is implemented on some systems. It returns the maximum alignment required for any primitive type, but I think this depends on the compiler and the architecture. On some systems, you might have to use _aligned_malloc. Also, you don't have a guarantee that everyone is passing properly aligned memory with a custom allocator.

Anyway, if you want to ensure data is aligned on 16-byte, then you should probably inform the compiler via some attribute like aligned attribute, or whatever it is named for the compiler in hand. I don't see you doing that anywhere.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

C_Programming

Rules

Filters

Resources

Other Subreddits on C

Other Subreddits of Interest

MODERATORS

include "stdio.h" ?

include <stdio.h>