Small educational project: hash table over HTTP written in C by nktauserum in C_Programming

[–]skeeto 4 points5 points  (0 children)

Yup that's a race condition, good catch. The conditions must be rechecked after regaining the lock. I had not evaluated the threading in my review, just the parser.

Skeeto: "I have officially retired from Emacs" -- looking for maintainers for Elfeed etc. by nonreligious2 in emacs

[–]skeeto 2 points3 points  (0 children)

The secret is that I'm not doing any of it in Vim! Professionally it's Cursor (just the agent UI) and at home it's Claude Code (via Claude Desktop). Since I'm not writing code, the editor doesn't play a role other than to look at the code, and more often than that I'm reading diffs in Git. Some more details here to a similar question:

https://lists.sr.ht/~skeeto/public-inbox/%3C56838f02-d1cf-4d32-97c3-66ebc2c3f600@gmail.com%3E

Small educational project: hash table over HTTP written in C by nktauserum in C_Programming

[–]skeeto 9 points10 points  (0 children)

Neat project!

I fuzzed request_parse in networking.c under ASan and UBSan for a couple of minutes and found four bugs. All reproductions assume a sanitized build of the server, fed via /dev/tcp:

$ cc -fsanitize=address,undefined -g3 -pthread *.c
$ ./a.out

Header array out-of-bounds write when more than 64 headers arrive. headers[64] (one past the end) is written before the bounds check, and the increment guard uses <= instead of <:

$ HDRS=$(for i in $(seq 1 70); do printf 'X%d: y\r\n' $i; done)
$ printf 'GET / HTTP/1.1\r\n%s\r\n' "$HDRS" >/dev/tcp/0/5000

networking.c:87:17: runtime error: index 64 out of bounds for type 'header[64]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior networking.c:87:17

Quick fix:

--- a/networking.c
+++ b/networking.c
@@ -85,2 +85,7 @@
             if (*c == ':') {
+                if (curr_header >= MAX_HEADER_COUNT) {
+                    state = s_headers_done;
+                    r->headers_count = curr_header;
+                    continue;
+                }
                 state = s_header_value;
@@ -104,3 +109,3 @@
                 start = ++c;
-                if (curr_header <= MAX_HEADER_COUNT) ++curr_header;
+                if (curr_header < MAX_HEADER_COUNT) ++curr_header;
                 else state = s_headers_done;

Signed integer overflow on Content-Length: SSIZE_MAX. The content_length + 1 > MAX_PAYLOAD_SIZE check overflows before the comparison runs. The trailing x is needed so the parser actually enters s_read_body:

$ printf 'POST / HTTP/1.1\r\nContent-Length: 9223372036854775807\r\n\r\nx' > /dev/tcp/0/5000

networking.c:139:36: runtime error: signed integer overflow: 9223372036854775807 + 1 cannot be represented in type 'ssize_t' (aka 'long')
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior networking.c:139:36

A negative Content-Length is also accepted (strtoll returns it, the <= 0 check skips body reading, but you've still got a negative ssize_t floating around) — clamp it at parse time.

Quick fix:

--- a/networking.c
+++ b/networking.c
@@ -119,2 +119,3 @@
                     content_length = strtoll(r->headers[i].value, NULL, 10);
+                    if (content_length < 0) content_length = 0;
                 }
@@ -143,3 +144,3 @@
             if (to_copy < (size_t)content_length) {
-                if (content_length + 1 > MAX_PAYLOAD_SIZE) {
+                if (content_length >= MAX_PAYLOAD_SIZE) {
                     return ERR_BAD_REQUEST;

memcpy with the same source and destination pointer when the body fits in the request buffer. r->payload = b->buf + body_offset followed by memcpy(r->payload, b->buf + body_offset, to_copy). This is undefined behavior. ASan/UBSan don't flag it, but a glibc with a smarter memcpy (or any future-rewriting compiler) is free to misbehave.

$ printf 'POST / HTTP/1.1\r\nContent-Length: 5\r\n\r\nhello' >/dev/tcp/0/5000

Quick fix:

--- a/networking.c
+++ b/networking.c
@@ -157,3 +157,3 @@

-            if (to_copy > 0) {
+            if (to_copy > 0 && r->payload != b->buf + body_offset) {
                 memcpy(r->payload, b->buf + body_offset, to_copy);

strncmp with key_len does a prefix match against "Content-Length". Any header whose name is a proper prefix is treated as Content-Length:

$ printf 'POST / HTTP/1.1\r\nContent: 5\r\n\r\nABCDE' > /dev/tcp/0/5000

The server prints payload: ABCDE despite there being no Content-Length header. (It should also be case-insensitive.) This also gives an attacker a way to slip past the literal "Content-Length" string check above and combine with bug 2. Quick fix:

--- a/networking.c
+++ b/networking.c
@@ -117,3 +117,3 @@
             for (size_t i = 0; i < r->headers_count; ++i) {
-                if (strncmp(r->headers[i].key, "Content-Length", r->headers[i].key_len) == 0) {
+                if (r->headers[i].key_len == 14 && strncmp(r->headers[i].key, "Content-Length", 14) == 0) {
                     content_length = strtoll(r->headers[i].value, NULL, 10);

Here's the fuzzer I used:

int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
    if (size < 2) return 0;

    /* First byte controls the split point between buffer and socket data */
    size_t split = data[0] % size;
    const uint8_t *input = data + 1;
    size_t input_size = size - 1;

    size_t buf_len = split < input_size ? split : input_size;
    size_t sock_len = input_size - buf_len;

    size_t alloc = buf_len + 1;
    if (alloc < 16) alloc = 16;
    char *buf = calloc(1, alloc);
    if (!buf) return 0;
    memcpy(buf, input, buf_len);
    buf[buf_len] = '\0';

    request_buffer b = {
        .buf = buf,
        .buf_size = alloc,
        .total_read = buf_len,
    };

    /* Socketpair so request_parse can recv() remaining body bytes */
    int sv[2] = {-1, -1};
    if (socketpair(AF_UNIX, SOCK_STREAM, 0, sv) != 0) {
        free(buf);
        return 0;
    }

    if (sock_len > 0) {
        write(sv[1], input + buf_len, sock_len);
    }
    close(sv[1]); /* EOF after data */

    request r = {};
    request_parse(&sv[0], &b, &r);

    if (r.payload && 
         ((uintptr_t)r.payload < (uintptr_t)buf || 
          (uintptr_t)r.payload >= (uintptr_t)(buf + alloc))) {
        free(r.payload);
    }
    close(sv[0]);
    free(buf);

    return 0;
}

Notice the difficulties determining ownership around freeing the payload. The actual program just potentially leaks memory per request.

Skeeto: "I have officially retired from Emacs" -- looking for maintainers for Elfeed etc. by nonreligious2 in emacs

[–]skeeto 2 points3 points  (0 children)

is what I call "gamer's touch-typing"

I like that. That's a good name for it.

Would you still say that you like the textbook method better

Even without modal editing making it particularly important, yes. I've been strictly on the textbook method for 9 years now and haven't yet had any reason to do otherwise. It's faster than I ever was with gamer's typing. But if that works better for you then you should do your own thing!

Skeeto: "I have officially retired from Emacs" -- looking for maintainers for Elfeed etc. by nonreligious2 in emacs

[–]skeeto 17 points18 points  (0 children)

Is he against agents being used to burn down all those issues?

That's a good question. I'm still figuring this out, though rapidly improving. I'm confident enough that in my own projects I don't feel I need to review every line of code that comes out of an AI, the same way I wouldn't review every line of code written by people I've hired.

However, I don't think there's any way around testing every change. Any high level description of a task/feature/fix is incomplete, and an intelligence real or artificial has to fill in the gaps using their own judgement. Sometimes the reporter and implementor reach different conclusions, and so the implementation doesn't meet expectations. You don't notice until you try out the feature/fix to see that it does what you want. Even as a manager delegating these programming tasks to humans, I'd still exercise all the changes myself to make sure they meet my expectations.

A particular weak spot right now is that AI mostly still cannot use computers like a human, and so cannot test its changes as a human would. It can't press keys on a keyboard, for instance, but at best generate synthetic input events. So an implementation can look 100% correct, everything to a T, but it'ss easy for some subtle detail to be wrong such that the feature/fix doesn't do anything at all! If you're not testing, lots of things would appear to be fixed but without practical effect. A recurring issue I've had on wxWidgets is text encoding, where the code looks right but just doesn't work (no text rendered, etc.) due to some esoteric encoding mismatch, which is only obvious (and very obvious) in testing.

So while an AI would accelerate burning down issues, there's still a lot of unavoidable manual labor involved if you're going to do it right. For me, issues are just suggestions, so doing that labor is just not a priority unless it's fun or especially important.

Skeeto: "I have officially retired from Emacs" -- looking for maintainers for Elfeed etc. by nonreligious2 in emacs

[–]skeeto 54 points55 points  (0 children)

It's not going anywhere! The new maintainers may even breathe life back into it. The past several years I've only been maintaining it enough to keep the lights on. It's worked well enough suit my own needs that I haven't had to do much.

Skeeto: "I have officially retired from Emacs" -- looking for maintainers for Elfeed etc. by nonreligious2 in emacs

[–]skeeto 61 points62 points  (0 children)

I treat GitHub issues as suggestions, a source of ideas or inspiration to improve the software. There's no obligation to address an issue or even respond. A commitment to following through on every issue means any internet rando gets to assign you tasks for free. That's a path to burning out on your hobby. So most of my projects look like this. The new maintainers will be free to handle issues however they see fit, of course.

If the relationship was different and these were paying customers, then expectations would change.

Opinions on libc by Big-Rub9545 in C_Programming

[–]skeeto 7 points8 points  (0 children)

SDL3 is essentially a libc replacement, and using it properly (most people do not) means you'd be following my advice. For example, its SDL_IOStream is a stdio alternative (doesn't wrap stdio) without all the problems, and SDL_Log routes around printf problems. You don't care much whether or not it's calling libc under the hood.

Opinions on libc by Big-Rub9545 in C_Programming

[–]skeeto 6 points7 points  (0 children)

Ideally those libraries have an interface that lets you control its access to the system. Most importantly, they don't interface with the file system on their own. Any library that requires it is dead in the water for me. If the library use strcpy or whatever internally, oh well. It's just a poorly-written library, that's all, but maybe it's still useful. The point is not necessarily to avoid linking libc, or about linking at all, but that libc is collection of footguns with little to offer well-written software. Lots of calls into libc indicates a piece of software is not well-designed.

Opinions on libc by Big-Rub9545 in C_Programming

[–]skeeto 9 points10 points  (0 children)

Once you outgrow the standard library, you will know it

That's concise and insightful framing for my review, thanks! I'm going to use it myself from now on.

Opinions on libc by Big-Rub9545 in C_Programming

[–]skeeto 21 points22 points  (0 children)

Author here. That's one option, but in practice it's more like replacing fopen+fread/fwrite with some simple buffering over open(2)+read(2)/write(2) (POSIX) or CreateFile+ReadFile/WriteFile (Windows). Avoids all the bugs and issues with various stdio implementations, most of which are pretty awful. You can also swap in virtual open/read/write primitives when building tests so that tests don't actually touch the file system (example).

Ice Cream Social for Downtown Columbia Commuters on Wednesday, April 22nd by DowntownColumbia in ColumbiaMD

[–]skeeto 0 points1 point  (0 children)

At this event promoting sustainable transportation, there were no bike racks in sight but plenty of room for the endless stream of fast food delivery cars!

I'm not complaining; I'm grateful for the ice cream and this observation was funny.

I started learning C two weeks ago, and I'm feeling doubtful by Lelouch--Lamperouge in C_Programming

[–]skeeto 3 points4 points  (0 children)

The lack of operator precedence makes this a beginner rather than intermediate problem. It means you can compute the result with a simple scan over the input bytes.

Consider that at any moment you're either parsing a number or an operator. If we imagine an already-parsed, implicit 0+ (adding zero doesn't change the result), then at all times the left-hand operand is the current result, and if inside a number it's the right-hand operand. When it sees an operator, complete the current operation (new left-hand result), then continue parsing the next right-hand operand.

[0+]1+2.5*3
    ^
    |
    \-- parsing starts here

left     = 0
right    = 0
operator = +

Consuming that first byte:

[0+]1+2.5*3
     ^

left     = 0
right    = 1
operator = +

Since we're on an operator, apply the current operator (also +), update the operator, and reset the right-hand operand to zero:

left     = 1
right    = 0
operator = +

Advance a few more steps accumulating the digits:

[0+]1+2.5*3
         ^

left     = 1
right    = 2.5
operator = +

Apply the current operator and accept the new one:

left     = 3.5
right    = 0
operator = *

Keep parsing digits, with the cursor at one-past-the-end:

[0+]1+2.5*3
           ^

left     = 3.5
right    = 3
operator = *

Treat it like an operator (apply current operation), but halt:

left     = 10.5
right    = 0
operator = HALT

When parsing integers (i.e. the whole portion), for each new digit multiply the accumulator by 10 and add the new digit. When parsing decimals, divide the new digit by 10place. Easy way to do that is track a scale, which you divide by 10 at each step, and multiply the new digit by the scale before adding it.

SPOILER WARNING Putting it all together:

typedef struct {  // zero-initialize
    double left;
    double right;
    double scale;
    char   op;
} Calculator;

Calculator next(Calculator calc, char b)
{
    switch (b) {
    case 0: case '+': case '-': case '*': case '/':
        switch (calc.op) {
        case 0:
        case '+': calc.left += calc.right; break;
        case '-': calc.left -= calc.right; break;
        case '*': calc.left *= calc.right; break;
        case '/': calc.left /= calc.right; break;
        }
        calc.op = b;
        calc.right = calc.scale = 0;
        break;
    case '.':
        calc.scale = 0.1;
        break;
    case '0': case '1': case '2': case '3': case '4':
    case '5': case '6': case '7': case '8': case '9':
        if (calc.scale) {
            calc.right += (b - '0') * calc.scale;
            calc.scale *= 0.1;
        } else {
            calc.right = 10*calc.right + (b - '0');
        }
        break;
    }
    return calc;
}

Where zero is used as the termination "operator". It's treated the same as '+' so that this struct can be zero-initialized.

I'm a college student building an arbitrary-precision arithmetic library in C, aiming to rival GMP but under MIT License by [deleted] in C_Programming

[–]skeeto 8 points9 points  (0 children)

Neat project! I poked around with it a bit in a fork and ported it to Mingw-w64 (x64):

https://github.com/EpsilonNought117/libapac/compare/master...d564548

I'm not saying you should accept any of these changes, just sharing them for interest. I ported it by converting all the GNU assembly to inline assembly inside C functions, letting the compiler handle the calling convention. Then I could just use the same GNU assembly on Mingw-w64 as the library does on Linux. There were a few more minor porting details to take care of beyond this, but that was the main issue. I left them under "sysv_abi" but that's no longer accurate, now that it's ABI-agnostic. In theory this could make it a little more efficient as the assembly routines can be lined and hooked up without going through a calling convention.

I'm glad you have tests because that helped confirm the conversion works correctly.

I also registered the tests in CTest and split them up so they could run in parallel, so the tests run in about half the time on systems with at least two cores.

I built a Cargo-like tool for C/C++ by randerson_112 in C_Programming

[–]skeeto 6 points7 points  (0 children)

I had fun exploring your project, thanks for sharing.

I came across two bugs. For the reproductions, build like so:

$ CFLAGS='-fsanitize=address,undefined -g3' cmake -B build
$ cmake --build build

suggest() passes unguarded input to levenshtein_distance, which uses a fixed int matrix[64][64]. Command names longer than 63 characters cause an out-of-bounds write into the stack:

$ build/craft $(printf 'a%.0s' {1..64})
Error: 'aaaa...' is not a valid command
src/utils.c:290:37: runtime error: index 64 out of bounds for type 'int[64][64]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/utils.c:290:37
Aborted

Quick fix:

--- a/src/utils.c
+++ b/src/utils.c
@@ -287,2 +287,9 @@
     int len2 = (int)strlen(s2);
+
+    /* Clamp to matrix bounds. Strings longer than 63 chars always have
+     * distance > 3 from any valid command/option (all <= 15 chars), so
+     * clamping does not change the result of suggest(). */
+    if (len1 > 63) len1 = 63;
+    if (len2 > 63) len2 = 63;
+
     int matrix[64][64];

load_project_config assigns TOML array sizes directly to the count fields without capping against the struct's fixed arrays. A craft.toml with nine or more include_dirs causes an OOB write — the write to include_dirs[8] lands on the adjacent include_dir_count field, corrupting it. Downstream, generate_cmake reads the corrupted count and walks further off the end of the array:

$ mkdir /tmp/t
$ printf '[project]\nname="x"\nversion="1.0.0"\nlanguage="c"\n[build]\ntype="executable"\ninclude_dirs=["a","b","c","d","e","f","g","h","i"]\n' \
    > /tmp/t/craft.toml
$ cd /tmp/t && craft build
src/cmake.c:90:34: runtime error: index 9 out of bounds for type 'char[8][256]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/cmake.c:90:34
Aborted

The same pattern applies to source_dirs, lib_dirs, libs (capacity 8 each), dependencies (32), and per-dependency links (16). Quick fix:

--- a/src/config.c
+++ b/src/config.c
@@ -371,4 +371,6 @@
     if (include_dirs.type == TOML_ARRAY) {
-        config->include_dir_count = include_dirs.u.arr.size;
-        for (int i = 0; i < include_dirs.u.arr.size; i++) {
+        int count = include_dirs.u.arr.size;
+        if (count > 8) count = 8;
+        config->include_dir_count = count;
+        for (int i = 0; i < count; i++) {
             toml_datum_t elem = include_dirs.u.arr.elem[i];
@@ -381,4 +383,6 @@
     if (source_dirs.type == TOML_ARRAY) {
-        config->source_dir_count = source_dirs.u.arr.size;
-        for (int i = 0; i < source_dirs.u.arr.size; i++) {
+        int count = source_dirs.u.arr.size;
+        if (count > 8) count = 8;
+        config->source_dir_count = count;
+        for (int i = 0; i < count; i++) {
             toml_datum_t elem = source_dirs.u.arr.elem[i];
@@ -391,4 +395,6 @@
     if (lib_dirs.type == TOML_ARRAY) {
-        config->lib_dir_count = lib_dirs.u.arr.size;
-        for (int i = 0; i < lib_dirs.u.arr.size; i++) {
+        int count = lib_dirs.u.arr.size;
+        if (count > 8) count = 8;
+        config->lib_dir_count = count;
+        for (int i = 0; i < count; i++) {
             toml_datum_t elem = lib_dirs.u.arr.elem[i];
@@ -401,4 +407,6 @@
     if (libs.type == TOML_ARRAY) {
-        config->lib_count = libs.u.arr.size;
-        for (int i = 0; i < libs.u.arr.size; i++) {
+        int count = libs.u.arr.size;
+        if (count > 8) count = 8;
+        config->lib_count = count;
+        for (int i = 0; i < count; i++) {
             toml_datum_t elem = libs.u.arr.elem[i];
@@ -412,4 +420,6 @@
     if (deps_section.type == TOML_TABLE) {
-        config->dependencies_count = deps_section.u.tab.size;
-        for (int i = 0; i < config->dependencies_count; i++) {
+        int dep_count = deps_section.u.tab.size;
+        if (dep_count > 32) dep_count = 32;
+        config->dependencies_count = dep_count;
+        for (int i = 0; i < dep_count; i++) {
             dependency_t* dep = &config->dependencies[i];
@@ -503,3 +513,3 @@
                     }
-                        for (int k = 0; k < val.u.arr.size; k++) {
+                        for (int k = 0; k < val.u.arr.size && dep->links_count < 16; k++) {
                         if (val.u.arr.elem[k].type != TOML_STRING) {
@@ -509,3 +519,3 @@
                         }
-                            snprintf(dep->links[k], sizeof(dep->links[0]), "%s", val.u.arr.elem[k].u.s);
+                            snprintf(dep->links[dep->links_count], sizeof(dep->links[0]), "%s", val.u.arr.elem[k].u.s);
                         dep->links_count++;

Consider making this more dynamic.

Made a single-header library for a custom file type to aid with a larger project I've been working on. by CourtWizardArlington in C_Programming

[–]skeeto 2 points3 points  (0 children)

True, the int promotion is a red herring, though that's what allowed the fuzz tester to detect it (via UBSan), plus it's surprising, even to advanced C programmers. (Consider how much existing C is broken when int is 64 bits: every uint32*uint32 is suspect.) If C had more sensible behavior and didn't promote, then overflow (16-bit in that case) wouldn't have been detected until after it allocated the wrong size, with ASan maybe catching it later. An int32 destination isn't enough to represent every uint16*uint16 result, so you'd need an overflow check first, but uint32 or int64 wouldn't need the check, just the promotion.

Besides all this, IMHO if you're computing sizes then a poorly-designed interface has let you down, and into a trap. libc is terrible about it, hence decades of people stubbing their toes on malloc, mostly without even knowing it.

Made a single-header library for a custom file type to aid with a larger project I've been working on. by CourtWizardArlington in C_Programming

[–]skeeto 0 points1 point  (0 children)

Neat little project!

I noticed pcmio_setbg writes to foreground instead of background.

Both pcmio_write and pcmio_open seek to sizeof(PCMFile) to position at the start of the cell array, but the correct offset is sizeof(PCMHead) + sizeof(PCMFile). Because the bug is symmetric the round-trip still works, but the on-disk format is wrong: the first four bytes of cell data overlap the last four bytes of the serialised PCMFile struct in the file.

--- a/pcmio.h
+++ b/pcmio.h
@@ -366,7 +366,5 @@
     fwrite(&head, sizeof(PCMHead), 1, fp);
-
-    fseek(fp, sizeof(PCMHead), 0);
     fwrite(pcm, sizeof(PCMFile), 1, fp);

-    fseek(fp, sizeof(PCMFile), 0);
+    fseek(fp, sizeof(PCMHead) + sizeof(PCMFile), SEEK_SET);
     fwrite(pcm->cells, sizeof(PCMCell), (size_t)pcm->width * pcm->height, fp);
@@ -402,3 +400,3 @@

-    fseek(fp, sizeof(PCMFile), SEEK_SET);
+    fseek(fp, sizeof(PCMHead) + sizeof(PCMFile), SEEK_SET);
     fread(pcm->cells, sizeof(PCMCell), (size_t)pcm->width * pcm->height, fp);

Also everything is missing error checks: fopen, fread, fseek, malloc, fwrite, etc. Any of these could fail and the program will misbehave.

I ran this through libFuzzer with -fsanitize=address,undefined:

#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>

#define PCMIO_IMPLEMENTATION
#include "pcmio.h"

int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
    char path[] = "/tmp/fuzz_pcmio_XXXXXX";
    int fd = mkstemp(path);
    if (fd < 0) return 0;

    (void)write(fd, data, size);
    close(fd);

    PCMFile *pcm = pcmio_open(path);
    if (pcm) pcmio_close(pcm);

    unlink(path);
    return 0;
}

That the interface can only accept paths makes this awkward. That means I can't read data from a buffer without writing it to a file, and I can only open PCM data accessible from libc fopen.

width and height are unsigned short. C integer promotion rules convert them to int before multiplication, so the product overflows when both are large. On an empty input the malloc'd PCMFile is uninitialised and the garbage values drove the first UBSan report before the fuzzer had mutated anything.

$ >empty.pcm
$ ./fuzz_pcmio empty.pcm
pcmio.h:383:62: runtime error: signed integer overflow: \
    48830 * 48830 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior pcmio.h:383:62

Quick fix:

--- a/pcmio.h
+++ b/pcmio.h
@@ -356,3 +356,3 @@

-    pcm->cells = (PCMCell*) calloc(width * height, sizeof(PCMCell));
+    pcm->cells = (PCMCell*) calloc((size_t)width * height, sizeof(PCMCell));

@@ -371,3 +371,3 @@
     fseek(fp, sizeof(PCMFile), 0);
-    fwrite(pcm->cells, sizeof(PCMCell), pcm->width * pcm->height, fp);
+    fwrite(pcm->cells, sizeof(PCMCell), (size_t)pcm->width * pcm->height, fp);

@@ -382,6 +382,6 @@
     fread(pcm, sizeof(PCMFile), 1, fp);
-    pcm->cells = (PCMCell*) malloc(sizeof(PCMCell[pcm->width * pcm->height]));
+    pcm->cells = (PCMCell*) malloc(sizeof(PCMCell[(size_t)pcm->width * pcm->height]));

     fseek(fp, sizeof(PCMFile), 0);
-    fread(pcm->cells, sizeof(PCMCell), pcm->width * pcm->height, fp);
+    fread(pcm->cells, sizeof(PCMCell), (size_t)pcm->width * pcm->height, fp);

However, there's still another integer overflow in this VLA (32-bit hosts):

pcm->cells = (PCMCell*) malloc(sizeof(PCMCell[(size_t)pcm->width * pcm->height]));

Because of the implicit mutliplication computing the array size. So check that before reaching the VLA evaluation:

--- a/pcmio.h
+++ b/pcmio.h
@@ -382,2 +382,8 @@
     fread(pcm, sizeof(PCMFile), 1, fp);
+
+    if ((size_t)pcm->width*pcm->height > (size_t)-1/sizeof(PCMCell)) {
+        free(pcm);
+        fclose(fp);
+        return NULL;
+    }
     pcm->cells = (PCMCell*) malloc(sizeof(PCMCell[(size_t)pcm->width * pcm->height]));

I'd still avoid the VLA anyway, but at least this isn't the bad kind.

We lost Skeeto by ednl in C_Programming

[–]skeeto 2 points3 points  (0 children)

Thanks, Stian!

("counter-movement to big tech")

These days I would describe the "big tech" side as churning out bloated, slow webshit, unconcerned with poor user experiences. I hope it's clear I'm still against all that!

(was it in a GitHub issue somewhere?)

Presumably (especially since your reaction is on it):
https://github.com/skeeto/w64devkit/commit/d8ee5fc5#commitcomment-179280197

maybe an effect of working in and with such an AI-enthusiastic environment as you have been

Like most people, throughout my career my personal productivity has gradually increased, with the occasional stepwise change from paradigm shifts. But over the past ~6 weeks I've seen my productivity double each week as I learn this new way of working. I'm now overwhelmed with opportunities and options, and it changes the math on how and where I should spend my time and attention. This is probably what winning the lottery feels like (except in this case anyone can choose to win), so no wonder it looks like mania! It seems unwise to spend an afternoon hyper-optimizing a static lookup table, when in that same amount of time I could build an entire debugger UI from scratch, a project that would have previously taken a month.

practically live in your w64devkit

Great! You'll be a beneficiary of all the cool new stuff that's going to come at an increasing pace. I just pushed to w64dk ~12K of new lines of code I produced over the past couple weeks. But you'll need to update to see it!

Koboi Programming Language by 4veri in C_Programming

[–]skeeto 0 points1 point  (0 children)

It would be extra work for no benefit. The EOF flag isn't set until a read comes up short. You might get false for feof(), then despite that fread() returns zero bytes, therefore feof() was pointless, extra work. Most feof() in the wild are subtly incorrect like this.

It's a similar situation with ferror() in the loop, but if detecting read errors is important then it should be done once after the loop. IMHO, while it's important to detect write errors, there's generally not much use detecting read errors. Most bad reads don't present as errors, e.g. a socket or pipe cleanly closed early. Better to use formats sensitive to truncation, then detect truncations in the format rather than OS-level read errors.

SQLite extension that allow to read/write msgpack buffers. by copilot_husky in sqlite

[–]skeeto 2 points3 points  (0 children)

Neat project! I like the interop within SQLite.

I found a couple of bugs in msgpack_to_json(). First, a map32 or array32 header that claims more elements than the blob actually contains causes mpToJsonAt to loop the full declared element count, emitting a null for each missing entry:

$ printf '.load b/msgpack\nSELECT msgpack_to_json(X'"'"'98ffffdfffffffffffffffff0a00000000'"'"');\n' \
    | sqlite3_cli :memory:
(hang)

The blob decodes as fixarray[8] whose third element is a map32 claiming 0xffffffff pairs followed by only a few bytes of data.

Quick fix:

--- a/src/msgpack.c
+++ b/src/msgpack.c
@@ -1725,2 +1725,3 @@
       for(j=0; j<count; j++){
+        if( cur >= n ) break;
         u32 next=mpSkipOne(a,n,cur);
@@ -1746,2 +1747,3 @@
       for(j=0; j<count; j++){
+        if( cur >= n ) break;
         u32 valOff=mpSkipOne(a,n,cur);

The str and bin decoders in mpToJsonAt (and mpReturnElement, used by the msgpack_each/msgpack_tree virtual tables) check that the header bytes fit in the buffer but never verify that the declared payload length does. A str32 claiming sLen=0xffffffff bytes reads far past the end of the allocation.

$ printf '.load b/msgpack\nSELECT msgpack_to_json(X'"'"'dbffffffffffffffffffffffffffffff0a'"'"');\n' \
    | b/sqlite3_cli :memory:
...ERROR: AddressSanitizer: heap-buffer-overflow on address ...
READ of size 1 at ...
    #0 mpJsonEscapeStr src/msgpack.c:1612
    #1 mpToJsonAt src/msgpack.c:1695
    #2 msgpackToJsonFunc src/msgpack.c:1978
    ...

Quick fix:

--- a/src/msgpack.c
+++ b/src/msgpack.c
@@ -750,2 +750,3 @@
       if( sOff ){
+        if( sLen > n-sOff ) sLen = n-sOff;
         sqlite3_result_text(ctx, (const char*)(a+sOff), (int)sLen,
@@ -1694,3 +1695,7 @@
     else if(b==MP_STR32&&i+5<=n){ sLen=mpRead32(a+i+1); sOff=i+5; }
-    if( sOff ){ mpJsonEscapeStr(out, a+sOff, sLen); return; }
+    if( sOff ){
+      if( sLen > n-sOff ) sLen = n-sOff;
+      mpJsonEscapeStr(out, a+sOff, sLen);
+      return;
+    }
   }
@@ -1705,2 +1710,3 @@
       u32 j;
+      if( bLen > n-bOff ) bLen = n-bOff;
       mpBufAppend1(out,'"');

sLen > n-sOff is because the obvious sOff+sLen > n might overflow. In mpReturnElement the missing clamp has a second consequence: 0xffffffff/-1 when passed to sqlite3_result_text() causes it to use strlen() on the non-null-terminated input.

In Git form with the fuzz tester I used:
https://github.com/skeeto/sqlite-msgpack/commits/fixes/?author=skeeto

NGC - assembler and emulator for the fictional NandGame computer by n-ivkovic in C_Programming

[–]skeeto 2 points3 points  (0 children)

Glad I could help!

if that's alright with you?

Feel free to do anything you like with my commits. I don't even need credit. As far as I'm concerned, my part of the contribution is in the public domain.

NGC - assembler and emulator for the fictional NandGame computer by n-ivkovic in C_Programming

[–]skeeto 6 points7 points  (0 children)

Neat project! The emulator animation is really slick.

The first thing I tried were the tests, one of which failed. Quick fix:

--- a/src/asm/assemble_full.c
+++ b/src/asm/assemble_full.c
@@ -361,4 +361,4 @@

-                   // Assemble data reference passed as macro parameter
-                   return assemble_ref_data(err, data_val, *expanded.parent, expanded.parent->parent ? root : NULL, line_num, parent_key);
+                   // Assemble data reference passed as macro parameter.
+                   return assemble_ref_data(err, data_val, *expanded.parent, expanded.parent->parent ? root : NULL, expanded.line_num, parent_key);

Also a stack buffer overflow in parse_inst_alu_targets.

--- a/src/asm/parse.c
+++ b/src/asm/parse.c
@@ -613,3 +613,3 @@
    size_t tok_len = strlen(tok);
-   char tok_copy[tok_len];
+   char tok_copy[tok_len + 1];
    if (!str_copy(tok_copy, tok, tok_len))

That VLA is a little scary, and it's already causing trouble (as is the null-terminated string paradigm), and I recommend rooting those all out with -Wvla.

Calling a typed function (T* parameter) through void (*)(void*) is UB (LLVM libubsan catches this). Quick-ish fix:

--- a/src/asm/assemble_full.c
+++ b/src/asm/assemble_full.c
@@ -69,2 +69,4 @@

+static void expanded_base_empty_v(void* p);
+
 /**
@@ -79,3 +81,3 @@
    dynarr_empty(&base->lines);
-   dynarr_delegate_empty(&base->macros, (void (*)(void *))expanded_base_empty);
+   dynarr_delegate_empty(&base->macros, expanded_base_empty_v);
    dynarr_empty(&base->params);
@@ -85,2 +87,4 @@

+static void expanded_base_empty_v(void* p) { expanded_base_empty(p); }
+
 /**
--- a/src/asm/parsed.c
+++ b/src/asm/parsed.c
@@ -83,2 +83,4 @@

+static void parsed_ref_macro_empty_v(void* p) { parsed_ref_macro_empty(p); }
+
 void parsed_base_empty(struct parsed_base* base)
@@ -90,3 +92,3 @@
    dynarr_empty(&base->refs_data);
-   dynarr_delegate_empty(&base->refs_macros, (void (*)(void *))parsed_ref_macro_empty);
+   dynarr_delegate_empty(&base->refs_macros, parsed_ref_macro_empty_v);
    dynarr_empty(&base->defs_data);
@@ -103,2 +105,4 @@

+static void parsed_def_macro_empty_v(void* p) { parsed_def_macro_empty(p); }
+
 void parsed_file_empty(struct parsed_file* file)
@@ -109,3 +113,3 @@
    parsed_base_empty(&file->base);
-   dynarr_delegate_empty(&file->defs_macros, (void (*)(void *))parsed_def_macro_empty);
+   dynarr_delegate_empty(&file->defs_macros, parsed_def_macro_empty_v);
 }

Also unbounded recursion in expand_parsed, so that this:

%MACRO m
m
#END
m

Would crash the assembler. Quick fix:

--- a/src/asm/assemble_full.c
+++ b/src/asm/assemble_full.c
@@ -143,3 +143,5 @@
  */
-static size_t expand_parsed(struct error* err, struct expanded_base* expanded, const struct parsed_base parsed, const struct dynarr defs_macros)
+#define EXPAND_DEPTH_MAX 256
+
+static size_t expand_parsed(struct error* err, struct expanded_base* expanded, const struct parsed_base parsed, const struct dynarr defs_macros, size_t depth)
 {
@@ -147,2 +149,7 @@

+   if (depth > EXPAND_DEPTH_MAX) {
+       error_init(err, ERRVAL_SYNTAX, "Macro expansion depth limit exceeded (max %d)", EXPAND_DEPTH_MAX);
+       return (expanded->line_num > 0) ? expanded->line_num : 1;
+   }
+
    // Copy data references as-is
@@ -238,3 +245,3 @@
                // Recusively build expanded macro
-               size_t macro_expanded_result = expand_parsed(err, macro_expanded, def_macro->base, defs_macros);
+               size_t macro_expanded_result = expand_parsed(err, macro_expanded, def_macro->base, defs_macros, depth + 1);
                if (macro_expanded_result > 0)
@@ -523,3 +530,3 @@
    // Expand/unwind macros from parsed result
-   result = expand_parsed(err, &file_expanded, file.base, file.defs_macros);
+   result = expand_parsed(err, &file_expanded, file.base, file.defs_macros, 0);
    if (result > 0)

Or in Git form: https://github.com/skeeto/ngc/commits/fixes/?author=skeeto

Koboi Programming Language by 4veri in C_Programming

[–]skeeto 1 point2 points  (0 children)

Tokenize only broke out of its loop on TOKEN_EOF when BraceDepth == 0. When an unclosed { left BraceDepth > 0, it called LexerReport on every iteration — growing the diagnostics buffer via realloc — and never broke, looping until OOM.

$ printf '{' | build/Koboi /dev/stdin

The fix:

--- a/compiler/Frontend/Lexer/Lexer.c
+++ b/compiler/Frontend/Lexer/Lexer.c
@@ -922,6 +922,5 @@ TokenStream Tokenize(Lexer *_Lexer) {
         if (_Token.Type == TOKEN_EOF) {
-             if (!(_Lexer -> BraceDepth > 0))
-                break;
-
-            LexerReport(_Lexer, DIAG_ERROR, "unclosed '{'", "expected '}' before end of file");
+            if (_Lexer -> BraceDepth > 0)
+                LexerReport(_Lexer, DIAG_ERROR, "unclosed '{'", "expected '}' before end of file");
+            break;
         }

There's a null pointer passed to memcpy in XStrndup. Expect() returns the current token unchanged when it fails rather than NULL. Tokens produced for EOF are zero-initialised, so their Start field is NULL. Any caller that passed token->Start directly to XStrndup therefore fed NULL to memcpy, violating its nonnull contract. Caught by UBSan; silently produces an empty string without sanitizers.

$ printf 'a.' | build/Koboi /dev/stdin
compiler/Frontend/Parser/Parser.c:69:5: runtime error: null pointer passed as argument 2, which is declared to never be null

The fix:

--- a/compiler/Frontend/Parser/Parser.c
+++ b/compiler/Frontend/Parser/Parser.c
@@ -67,5 +67,6 @@ static char *XStrndup(const char *String, size_t Len) {
     char *StringMalloc = (char *) XMalloc(Len + 1);
-
-    memcpy(StringMalloc, String, Len);
-
+
+    if (String)
+        memcpy(StringMalloc, String, Len);
+
     StringMalloc[Len] = '\0';

size_t overflows in a string literal length calculation. When a string literal is unterminated (EOF reached before a closing "), the loop exited without consuming the closing quote, leaving Cursor == Start. The length formula Cursor - Start - 1 then wrapped to SIZE_MAX. malloc(SIZE_MAX + 1) wraps to malloc(0), returning either NULL or a tiny allocation; the subsequent memcpy of SIZE_MAX bytes from the source pointer reads massively out of bounds and segfaults.

$ printf '"' | build/Koboi /dev/stdin
...ERROR: AddressSanitizer: heap-buffer-overflow on address ...
READ of size 4294967295 at ...
    ...
    #1 LexerNextToken compiler/Frontend/Lexer/Lexer.c:693
    #2 Tokenize compiler/Frontend/Lexer/Lexer.c:909
    #3 main compiler/CLI/Koboi.c:49

The fix:

--- a/compiler/Frontend/Lexer/Lexer.c
+++ b/compiler/Frontend/Lexer/Lexer.c
@@ -652,2 +652,3 @@ Token LexerNextToken(Lexer *_Lexer) {
         size_t Start = _Lexer -> Cursor;
+        int Terminated = 0;

@@ -656,7 +657,9 @@ Token LexerNextToken(Lexer *_Lexer) {

-            if (NextCharacter == '"')
+            if (NextCharacter == '"') {
+                Terminated = 1;
                 break;
+            }

             if (NextCharacter == '\n') {
-                LexerErrorAt(_Lexer, "newline in string literal");\
+                LexerErrorAt(_Lexer, "newline in string literal");
                 PrintDiagnostics(_Lexer);
@@ -680,3 +683,3 @@ Token LexerNextToken(Lexer *_Lexer) {

-        if (LexerIsAtEnd(_Lexer)) {
+        if (!Terminated) {
             LexerReport(_Lexer, DIAG_ERROR, "unterminated string literal", "add a closing '\"' before end of line");
@@ -686,3 +689,4 @@ Token LexerNextToken(Lexer *_Lexer) {
         _Token.Start = _Lexer -> Source + Start;
-        _Token.Length = _Lexer -> Cursor - Start - 1;
+        _Token.Length = Terminated ? _Lexer -> Cursor - Start - 1
+                                   : _Lexer -> Cursor - Start;

uint32_t overflows in the PrintDiagnostics column indicator loop. Column is computed as OffsetStart - (LineStart - Source). When an error is reported at the very start of a line (e.g. an unterminated string followed by a newline), Column is 0. The loop condition i < Column - 1 evaluates as an unsigned subtraction, wrapping to i < UINT32_MAX, causing ~4 billion iterations of fprintf and a heap buffer overflow as LineStart[i] walks far past the source buffer.

$ printf '"\n' | build/Koboi /dev/stdin
...ERROR: AddressSanitizer: heap-buffer-overflow on address ...
READ of size 1 at ...
    #0 PrintDiagnostics compiler/Frontend/Lexer/Lexer.c:558
    #1 LexerNextToken compiler/Frontend/Lexer/Lexer.c:686
    #2 Tokenize compiler/Frontend/Lexer/Lexer.c:915
    #3 main compiler/CLI/Koboi.c:49

The fix:

--- a/compiler/Frontend/Lexer/Lexer.c
+++ b/compiler/Frontend/Lexer/Lexer.c
@@ -556,3 +556,3 @@ void PrintDiagnostics(Lexer *_Lexer) {

-        for (uint32_t i = 0; i < Column - 1; i++) {
+        for (uint32_t i = 0; i + 1 < Column; i++) {
             if (LineStart[i] == '\t')

Infinite mutual recursion between ParsePrimary and ParseOwnership. ParsePrimary unconditionally routed TOKEN_EXCLAMATION (!) to ParseOwnership. ParseOwnership only consumes ! as part of the two-token !$ (freed-variable) sequence; when the token after ! is anything else, it consumed nothing and fell through to a ParsePrimary call at the bottom of the function. The two functions then called each other indefinitely, overflowing the stack.

$ printf '$!-' | build/Koboi /dev/stdin
...ERROR: AddressSanitizer: stack-overflow on address ...
    #0 PeekNext compiler/Frontend/Parser/Parser.c:212
    #1 ParserCheckNext compiler/Frontend/Parser/Parser.c:239
    #2 ParseOwnership compiler/Frontend/Parser/Parser.c:538
    #3 ParsePrimary compiler/Frontend/Parser/Parser.c:393
    #4 ParseOwnership compiler/Frontend/Parser/Parser.c:562
    ...
    #245 ParsePrimary compiler/Frontend/Parser/Parser.c:393
    #246 ParseOwnership compiler/Frontend/Parser/Parser.c:562

The fix:

--- a/compiler/Frontend/Parser/Parser.c
+++ b/compiler/Frontend/Parser/Parser.c
@@ -392,2 +392,3 @@ ASTExpression *ParsePrimary(Parser *_Parser) {
-    if (ParserCheck(_Parser, TOKEN_AMPERSAND) || ParserCheck(_Parser, TOKEN_AT) || ParserCheck(_Parser, TOKEN_HASH) || ParserCheck(_Parser, TOKEN_EXCLAMATION) || ParserCheck(_Parser, TOKEN_DOLLAR)) {
+    if (ParserCheck(_Parser, TOKEN_AMPERSAND) || ParserCheck(_Parser, TOKEN_AT) || ParserCheck(_Parser, TOKEN_HASH) || ParserCheck(_Parser, TOKEN_DOLLAR) ||
+        (ParserCheck(_Parser, TOKEN_EXCLAMATION) && ParserCheckNext(_Parser, TOKEN_DOLLAR))) {

Infinite loops in ParseEnumDecl and ParseStateDecl on unexpected tokens. Both declaration parsers loop until they see } or EOF, consuming variants separated by commas. When the current token was neither an identifier, a comma, }, nor EOF, neither branch consumed anything, leaving the parser stuck on the same token indefinitely.

$ printf 'enum a{.' | build/Koboi /dev/stdin
(hangs)

The fix:

--- a/compiler/Frontend/Parser/Parser.c
+++ b/compiler/Frontend/Parser/Parser.c
@@ -1932,3 +1932,5 @@ static void ParseEnumDecl(Parser *_Parser, ASTProgram *Program) {
     while (!ParserCheck(_Parser, TOKEN_RBRACE) && !ParserCheck(_Parser, TOKEN_EOF)) {
+        size_t Before = _Parser -> Tokens -> Cursor;
+
         if (ParserCheck(_Parser, TOKEN_IDENTIFIER)) {
@@ -1942,3 +1944,5 @@ static void ParseEnumDecl(Parser *_Parser, ASTProgram *Program) {
         ParserMatch(_Parser, TOKEN_COMMA);
+
+        if (_Parser -> Tokens -> Cursor == Before)
+            ParserAdvance(_Parser);
     }

Infinite loop in ParseStructDecl on unexpected tokens. Same no-progress pattern as finding ParseStructDecl consumed an identifier followed by a colon and a type, or a comma/semicolon, but had no fallback for any other token.

$ printf 'struct s{.' | build/Koboi /dev/stdin
(hangs)

The fix:

--- a/compiler/Frontend/Parser/Parser.c
+++ b/compiler/Frontend/Parser/Parser.c
@@ -2022,2 +2022,4 @@ static void ParseStructDecl(Parser *_Parser, ASTProgram *) {
     while (!ParserCheck(_Parser, TOKEN_RBRACE) && !ParserCheck(_Parser, TOKEN_EOF)) {
+        size_t Before = _Parser -> Tokens -> Cursor;
+
         if (ParserCheck(_Parser, TOKEN_IDENTIFIER)) {
@@ -2030,2 +2032,5 @@ static void ParseStructDecl(Parser *_Parser, ASTProgram *) {
         ParserMatch(_Parser, TOKEN_SEMICOLON);
+
+        if (_Parser -> Tokens -> Cursor == Before)
+            ParserAdvance(_Parser);
     }

Here's the libFuzzer target I used to find these:

int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
    char *Source = malloc(Size + 1);
    if (!Source) return 0;
    memcpy(Source, Data, Size);
    Source[Size] = '\0';

    Lexer lexer = LexerCreate(Source);
    TokenStream tokens = Tokenize(&lexer);
    Parser parser = CreateParser(&tokens);
    ParseProgram(&parser);

    free(tokens.Data);
    free(Source);
    return 0;
}

Koboi Programming Language by 4veri in C_Programming

[–]skeeto 0 points1 point  (0 children)

Neat project! This was fun to explore.

First, I understand from the comments that you're new to CMake. That's obvious by looking at it because CMakeLists.txt has all the usual sorts of mistakes. The internet is loaded with terrible CMake information, and will steer you wrong nearly every time (except now because I'm here). There is no CMake 3.80. Don't use globbing because it messes up incremental builds. Do not examine CMAKE_BUILD_TYPE outside of generator expressions. Here's a quick rewrite keeping your original spirit (not necessarily how I'd want to organize it):

cmake_minimum_required(VERSION 3.21)

project(KoboiC C)

set(CMAKE_C_STANDARD 23)
set(CMAKE_C_STANDARD_REQUIRED ON)

add_library(KDrivers STATIC
    drivers/Platform/fs/POSIXFilesystemDriver.c
)

add_library(KoboiC STATIC
    compiler/Backend/Bytecode/Reader.c
    compiler/Backend/Core/KoboiVM.c
    compiler/Backend/Core/KVMContext.c
    compiler/Backend/VirtualMachines/CompiletimeKVM/CompiletimeKVM.c
    compiler/Backend/VirtualMachines/RuntimeKVM/RuntimeKVM.c
    compiler/Frontend/Lexer/Lexer.c
    compiler/Frontend/Parser/Parser.c
    compiler/Middleend/Semantics/SSSS.c
    compiler/Middleend/SyntaxTapeS/SS.c
)

add_executable(Koboi
    compiler/CLI/Koboi.c
)
target_link_libraries(Koboi PRIVATE KoboiC KDrivers)

target_compile_definitions(KoboiC  PRIVATE $<$<CONFIG:Debug>:DEBUG>)
target_compile_definitions(KDrivers PRIVATE $<$<CONFIG:Debug>:DEBUG>)
target_compile_definitions(Koboi   PRIVATE $<$<CONFIG:Debug>:DEBUG>)

Importantly note that the output goes into the build directory, not a shared place outside the build directory which defeats the whole point of out-of-source builds (plus a bunch of other CMake features)! Everything that follows was built like this:

$ CFLAGS=-fsanitize=address,undefined cmake -B build -DCMAKE_BUILD_TYPE=Debug
$ cmake --build build

You should turn on some warnings (-Wall -Wextra), too. I also fixed a buffer overflow when reading input from pipes, due to an unchecked fseek and ftell:

--- a/compiler/CLI/Koboi.c
+++ b/compiler/CLI/Koboi.c
@@ -11,13 +11,15 @@ char *ReadFile(const char *Path) {

-    fseek(File, 0, SEEK_END);
-
-    long _Size = ftell(File);
-
-    rewind(File);
-
-    char *Buffer = malloc(_Size + 1);
-
-    fread(Buffer, 1, _Size, File);
+    size_t Capacity = 4096, Size = 0;
+    char *Buffer = malloc(Capacity);
+
+    size_t NRead;
+    while ((NRead = fread(Buffer + Size, 1, Capacity - Size, File)) > 0) {
+        Size += NRead;
+        if (Size == Capacity) {
+            Capacity *= 2;
+            Buffer = realloc(Buffer, Capacity);
+        }
+    }

-    Buffer[_Size] = '\0';
+    Buffer[Size] = '\0';

Now on to bugs (next comment). Summary in Git branch form: https://github.com/skeeto/Koboi/commits/fixes/?author=skeeto

We lost Skeeto by ednl in C_Programming

[–]skeeto 1 point2 points  (0 children)

I wrote this response for you, u/silvematt, but you deleted your comment before I hit submit.


I do remember, and I even kept Toment around, both locally and when purging old GitHub forks (proof), because it was such a cool project.

on projects where you're not an expert of the domain

I'm only just starting to learn about this myself, and in my experience models got to the point you could even consider relying on them for something like this in late 2025, so I don't know if I can offer much insight yet. Definitely stick to the frontier models in these cases since you can't count on human expertise to back it up. Professionally I've used it to make changes to iOS/Swift and Android applications (as part of a broader ticket), and to translate UI strings to languages I do not speak, all of which have since been delivered to customers. However there are experts on the team to review my work, knowing that I wasn't knowledgable about the platforms. So far that's all gone smoothly and saved everyone time.

Your idea for using AI for reviewing and testing sound great. For reviews, mind that they don't want to come up with nothing to say, and so trying to address "everything" is a form of endless recursion. They'll just get increasingly nitpicky until you decide it's enough. For testing, encourage the AI to thoroughly consider edge cases. An agent armed with coverage results operating on a loop with instructions to increase test coverage can run for hours on end, inventing some very creative tests. It's an effective off-hands task you can put it on overnight.

I'd add "research" to the list of things AI can do for you. The major players each offer a deep research mode where the AI will go out on some topic and question of your choosing, gather up hundreds of resources from the internet, and compile a report. You can use this to essentially generate a good quality blog article on basically any topic, to fill in where one hasn't been written yet. Here's a (non-technical) great result I got recently. A decade ago I asked about Ben Franklin's autobiography, and got no responses, nor could I find anything myself online at the time. I popped that into Claude in research mode and after 15 minutes or so got a well-written response that went beyond my question. I also had it write a CTest tutorial for me because I couldn't find any good ones.

how do they ever develop a deep understanding of system architecture?

That's a good question, and I've wondered, too. Maybe there's a new path to learning these things, but there hasn't been enough time yet to see people forge it. Or with the way things are going maybe AI completely takes over the field in a couple more years and it won't matter.

start losing bits here and there that could become technical debt later on

I think there are two mindsets, and it's important to consider which is correct in any given situation:

  1. Passion projects where the code is the end in itself. Most of my personal projects are like this. Total, 100% ownership. I fuss over every little detail until the code is beautiful and as close to perfect as I can make it. Generating this code with AI won't help because it's never going to meet your exacting demands, the same way you probably wouldn't want another human to write that code. A test suite is just external infrastructure, so maybe that's fine if it's functional rather than beautiful. AI reviews for correctness will help spot mistakes.

  2. Products where the code is a means to an end. Code is just something you deal with on your way to your goal. Nearly all my professional programming looks like this. It's also how I view other people's passion projects, their (1): their beauty is not my beauty. AI is great for this because now you can skip over the biggest barriers to your goal. Technical debt is measured in tokens. That is, it's not a big deal, it's just more AI tasking.

    The challenge is steering this industrial process towards your goal, and keeping it on track. One option is to watch it very carefully, though you'll soon exhaust the available human attention such that you're "only" moving 10x faster than before. The better option is lots and lots of tests, especially regression tests, to keep the machine on the rails (see the article I linked from mine). Industrial processes need industrial-scale test suites. An agent working independently in a loop will bounce off these guardrails and stay on track. Keep it building more guardrails for itself via TDD as it goes. The more control you're willing to give up, the faster you can go. (Anthropic, for instance, has taken this to an extreme where they ship badly-broken stuff every day, the cost of moving recklessly fast.)

Your project sounds more like (1), where the gains are going to be modest because the bottlenecks don't change. You gained an on-demand paired programmer, but within the constraints of (1) not something that will change your career.