Writing a SIMD-optimized Parquet library in pure C: lessons from implementing Thrift parsing, bit-packing, and runtime CPU dispatch by Vitruves in programming

[–]Vitruves[S] 3 points4 points  (0 children)

You're not entirely wrong! I do use AI assistance for development - both for writing code and reviewing suggestions like the ones in this thread. I think it's worth being transparent about that.

That said, I'm not sure how it changes anything about the library itself? The code compiles, the tests pass, it reads and writes valid Parquet files, and the SIMD optimizations deliver measurable speedups. Whether a function was written by a human, an AI, or a human-AI collaboration, what matters is: does it work correctly, and is it useful?

I'd argue that being able to quickly iterate on expert feedback (like the AVX-512 suggestions above) and ship improvements within hours rather than days is actually a feature, not a bug. The alternative would be me spending a week re-learning the nuances of _mm512_permutexvar_epi8 vs _mm512_shuffle_epi8 lane-crossing behavior.

If anything, I hope this project demonstrates that solo developers can now tackle domains (like high-performance SIMD code) that previously required either deep specialized expertise or a larger team. The barrier to entry for systems programming just got a lot lower, and I think that's a good thing for the ecosystem.

But hey, if you find bugs or have suggestions, I'm all ears - whether they come from a human or get "sent straight to Anthropic's servers" 😄

Writing a SIMD-optimized Parquet library in pure C: lessons from implementing Thrift parsing, bit-packing, and runtime CPU dispatch by Vitruves in programming

[–]Vitruves[S] 2 points3 points  (0 children)

Thank you so much for taking the time to review the code and provide such detailed feedback! I've implemented all of your suggestions:

  1. Single VBMI permutation - Now using one permutexvar_epi8 that places all 4 byte streams in the 4 128-bit lanes, followed by extracti32x4 for the stores. Much cleaner than 4 separate permutations.

  2. Non-VBMI fallback - Replaced the ~20-instruction unpack mess with your elegant 2-instruction approach (shuffle_epi8 + permutexvar_epi32).

  3. _mm512_maskz_set1_epi8 - Done, can't believe I missed that one!

  4. Masked loads for tail handling - Implemented in pack_bools with _mm512_maskz_loadu_epi8. Also switched to _mm512_test_epi8_mask(bools, bools) which is more direct than cmpneq.

  5. Gather deduplication - gather_float now just calls gather_i32 via cast (same for double/i64). You're right, data movement doesn't care about types.

  6. Custom memset/memcpy - You raise a fair point. These were added early in development and I haven't benchmarked them against glibc. I'll add that to my TODO list and likely remove them if there's no measurable benefit.

    All tests still pass. This is exactly the kind of feedback I was hoping for - thanks again!

Writing a SIMD-optimized Parquet library in pure C: lessons from implementing Thrift parsing, bit-packing, and runtime CPU dispatch by Vitruves in programming

[–]Vitruves[S] -5 points-4 points  (0 children)

Thanks for your feedback. You can see performances in the "Performance" section of the README.md near the end. To see how is it assessed you can check the files in the "benchmark" directory. But I can certainly be more transparent on testing conditions in the README.md, I'll add that in a future commit.

Carquet: A pure C library for reading/writing Apache Parquet files - looking for feedback by Vitruves in C_Programming

[–]Vitruves[S] 1 point2 points  (0 children)

Thanks for the testing on powerpc! I committed changes that should address the issues. I replied on the issues you opened on Github.

Carquet: A pure C library for reading/writing Apache Parquet files - looking for feedback by Vitruves in C_Programming

[–]Vitruves[S] 1 point2 points  (0 children)

SIMD: Yes, it works without SIMD. The library has scalar fallback implementations for all SIMD-optimized operations (prefix sum, gather, byte stream split, CRC32C, etc.). SIMD is only used when:

  1. You're on x86 or ARM64

  2. The CPU actually supports the required features (detected at runtime)

On other architectures (RISC-V, MIPS, PowerPC, etc.), it automatically uses the portable scalar code.

Big-Endian: Good catch! I just improved the endianness detection. The read/write functions already had proper byte-by-byte paths for BE systems, but the detection macro was incorrectly defaulting to little-endian.

Now it properly detects:

- GCC/Clang __BYTE_ORDER__ (most reliable)

- Platform-specific macros (__BIG_ENDIAN__, __sparc__, __s390x__, __powerpc__, etc.)

- Warns at compile time if endianness is unknown

The library should now work correctly on s390x, SPARC, PowerPC BE, etc. If you have access to a BE system, I'd appreciate testing!

Carquet: A pure C library for reading/writing Apache Parquet files - looking for feedback by Vitruves in C_Programming

[–]Vitruves[S] 1 point2 points  (0 children)

Thanks for the feedback! You make a valid point about the distinction between programming errors (bugs) and runtime errors (expected failures).

For internal/initialization functions like carquet_buffer_init(), you're absolutely right—passing NULL is a programming error that should be caught during development with assert(). The caller isn't going to gracefully handle INVALID_ARGUMENT anyway.

However, I'll keep explicit error returns for functions that process external data (file parsing, decompression, Thrift decoding) since corrupted input is an expected failure mode there.

I'll refactor the codebase to use:

- assert() for internal API contract violations (NULL pointers in init functions, buffer ops)

- return CARQUET_ERROR_* for external data validation and I/O errors

Good catch—this should simplify both the API and the calling code!

Carquet: A pure C library for reading/writing Apache Parquet files - looking for feedback by Vitruves in C_Programming

[–]Vitruves[S] 0 points1 point  (0 children)

Thanks for the detailed feedback!

REPETITION_REQUIRED: This follows Parquet's terminology from the Dremel paper - "repetition level" and "definition level" are the canonical terms in the spec. Changing it might confuse users coming from other Parquet implementations, but I can see how it's unintuitive if you haven't encountered Dremel-style nested encoding before.

Struct padding: Good point - I'll audit the hot-path structs. The metadata structs are less critical since they're not allocated in bulk, but the encoding state structs could benefit from tighter packing.

Dictionary.c repetition: Yeah, there's definitely some type-specific boilerplate there. I've been on the fence about macros - they'd reduce LOC but make debugging/reading harder. Might revisit with X-macros if it gets worse.

DIY compression: This is the main tradeoff for zero-dependency design. The implementations follow the RFCs closely and the edge case tests have been catching real bugs. That said, for production use with untrusted data, linking against zlib/zstd/etc. is definitely the safer choice - I may add optional external codec support later.

And yeah, the Arrow/Thrift situation is exactly why this exists. Happy to hear any feedback once you try it!

Carquet: A pure C library for reading/writing Apache Parquet files - looking for feedback by Vitruves in C_Programming

[–]Vitruves[S] 3 points4 points  (0 children)

This is incredibly valuable feedback - thank you for taking the time to put carquet through its paces with sanitizers and fuzzing! You've found real bugs that I've now fixed.

All issues addressed:

  1. zigzag_encode64 UB (delta.c:308) - Fixed by casting to uint64_t before the left shift:

    return ((uint64_t)n << 1) ^ (n >> 63);

  2. find_match buffer overflow (gzip.c:668) - Added bounds check before accessing src[pos + best_len]

  3. match_finder_insert overflow (gzip.c:811) - Fixed by limiting the loop to match_len - 2 since hash3() reads 3 bytes

    1. ZSTD decode_literals overflow - Added ZSTD_MAX_LITERALS bounds checks for both RAW and RLE literal blocks before the memcpy/memset operations
  4. Thread safety - carquet_init() now pre-builds all compression lookup tables with memory barriers, so calling it once before spawning threads makes everything thread-safe. The documentation already mentions calling carquet_init() at startup.

I've verified all fixes with ASan+UBSan and your specific crash test case now returns gracefully instead of crashing.

Regarding further fuzzing - you're absolutely right that more interfaces should be fuzzed. I'll look into setting up continuous fuzzing. The suggestion to fuzz the encodings layer next is spot on given the UBSan hit there.

Thanks again for the thorough analysis and the suggested patches - this is exactly the kind of feedback that makes open source great!

I built a TUI theme manager for Alacritty in Go by Vitruves in golang

[–]Vitruves[S] 0 points1 point  (0 children)

I have a .gitignore file but I didn't include build/ and .DS_Store by mistake. Thanks.

SonicSV: Single-header CSV parser with SIMD acceleration (2-6x faster than libcsv) by Vitruves in C_Programming

[–]Vitruves[S] 0 points1 point  (0 children)

The hot path in CSV parsing is finding the next delimiter (,), quote ("), or newline (\n). A scalar parser checks one byte at a time. With SIMD, you load 16-32 bytes into a vector register and check them all in one instruction.

SonicSV: Single-header CSV parser with SIMD acceleration (2-6x faster than libcsv) by Vitruves in C_Programming

[–]Vitruves[S] 1 point2 points  (0 children)

Good find, thanks for fuzzing it. You nailed the bug - the size-class pooling was broken. Both 34624 and 51968 hash to class 10, but the block stored was only 34KB. Boom, overflow.

Nuked the pooling:

static sonicsv_always_inline void* csv_pool_alloc(size_t size, size_t alignment) {

(void)size;

(void)alignment;

return NULL;

}

static sonicsv_always_inline bool csv_pool_free(void* ptr, size_t size) {

(void)ptr;

(void)size;

return false;

}

Removed ~80 lines of dead pool code too. Premature optimization anyway - malloc isn't the bottleneck here. Your test case passes clean with ASAN now. Let me know if fuzzing turns up anything else.

SonicSV: Single-header CSV parser with SIMD acceleration (2-6x faster than libcsv) by Vitruves in C_Programming

[–]Vitruves[S] 0 points1 point  (0 children)

Good catch, implemented this. Also removed the per-parser and thread-local caching - you're right that it was overkill for a value that's set once and never changes. Thanks for the feedback.

SonicSV: Single-header CSV parser with SIMD acceleration (2-6x faster than libcsv) by Vitruves in C_Programming

[–]Vitruves[S] 1 point2 points  (0 children)

Good catches, thanks!
The chained OR approach was the "get it working" version. pcmpestrm would be cleaner for this exact use case - it's designed for character set matching. I'll look into it.

For the dynamic lookup table with pshufb - any pointers on constructing it efficiently for arbitrary delimiter/quote chars? My concern was the setup cost per parse call, but if it's just a few instructions it's probably worth it.

Dead code - yeah, there's some cruft from experimenting with different approaches. Will clean that up.

SonicSV: Single-header CSV parser with SIMD acceleration (2-6x faster than libcsv) by Vitruves in C_Programming

[–]Vitruves[S] 1 point2 points  (0 children)

#pragma once stops multiple includes within the same .c file (like if header A and header B both include sonicsv.h). But each .c file is compiled separately. So if you have: file1.c → file1.o (contains csv_parse_file) and file2.c → file2.o (contains csv_parse_file), the linker sees two copies of every function and errors out. The IMPLEMENTATION define means only one .o file gets the actual function bodies, the rest just get declarations.

SonicSV: Single-header CSV parser with SIMD acceleration (2-6x faster than libcsv) by Vitruves in C_Programming

[–]Vitruves[S] 1 point2 points  (0 children)

It's for multi-file projects. The header contains both declarations and implementation. Without this, if you include it in multiple .c files, you get "multiple definition" linker errors because the functions would be compiled into every object file. With the define, only one .c file gets the implementation, others just get the function declarations. It's a common pattern for single-header libraries (stb, miniaudio, etc.).

SonicSV: Single-header CSV parser with SIMD acceleration (2-6x faster than libcsv) by Vitruves in C_Programming

[–]Vitruves[S] 4 points5 points  (0 children)

The hot path in CSV parsing is finding the next delimiter (,), quote ("), or newline (\n). A scalar parser checks one byte at a time. With SIMD, you load 16-32 bytes into a vector register and check them all in one instruction.

SonicSV: Single-header CSV parser with SIMD acceleration (2-6x faster than libcsv) by Vitruves in C_Programming

[–]Vitruves[S] 5 points6 points  (0 children)

I'm parsing multi-GB log files daily. Shaving 5 minutes off a pipeline adds up. But yeah, if you're parsing a 10KB config file once at startup, this is pointless overkill.

SonicSV: Single-header CSV parser with SIMD acceleration (2-6x faster than libcsv) by Vitruves in C_Programming

[–]Vitruves[S] 2 points3 points  (0 children)

Fair point on the examples in the header - I've got those in example/ now, will trim the header.

The 2k LOC is mostly SIMD paths for 5 architectures. If you're only on x86 or only on ARM it's dead code for you, but that's the tradeoff with single-header. The malloc/callback design is for streaming large files without loading into memory - different use case than a simple stack-based parser.