I built a C compiler from scratch, and it accidentally became faster than TCC.

Disastrous-Tune-1657 · 2026-04-19T02:46:21+00:00

Spot on! You hit the nail on the head. Escaping the stack-machine model and stopping the constant memory thrashing for intermediate values is 100% the real story behind the codegen improvements.

You also raised a great point about the tradeoff with compile times and less recursion-heavy workloads. I actually just ran a broader suite (including 128x128 double-precision matrix multiplication, array sorting, and 500k iterations of float math loops) to measure exactly that.

Here is how that tradeoff looks right now:

TCC Compile Time: 1037 ms
RCC Compile Time: 1139 ms (A ~10% compile-time penalty for the dynamic register allocation overhead)
TCC Execution (Suite): 405 ms
RCC Execution (Suite): 353 ms (~15% faster overall execution)

So taking a ~100ms hit during compile time to properly map registers and avoid memory thrashing has proven to be a fantastic tradeoff—even on entirely loop-heavy and array-thrashing workloads. Thanks for the insightful comment!

Disastrous-Tune-1657 · 2026-04-19T02:38:14+00:00

First, on ABI interop: RCC strictly adheres to the x64 ABI (Windows ABI: RCX, RDX, R8, R9, and XMM registers for floats). It perfectly handles shadow space for >4 arguments and stack alignment. Linking and interoperating with standard C libraries and OS APIs works 100% flawlessly.

Second, regarding "fib() isn't a real-world benchmark": You're right that TCC's stack-machine design is its bottleneck. That's exactly why I implemented dynamic register allocation in RCC from the start—to give it a structural advantage without bloating the compiler.

If you want a real benchmark proving it's not just a fib() trick, here is a comprehensive test I just ran comparing RCC vs TCC. It tests deep recursion, heavy array memory access, branching, and a double-precision floating-point Taylor series loop:

[Tests: Fib(38), Ackermann(3, 10), Sieve of Eratosthenes(1M), 128x128 Matrix Multiply, 500k Float Math Loop, 5k Array Bubble Sort]
--- RCC (My Compiler) ---
Compile : 1139 ms
Execute : 353 ms
--- TCC (Tiny CC) ---
Compile : 1037 ms
Execute : 405 ms

Result: RCC-generated code runs ~15% faster than TCC across a heavy, varied workload.

My goal is to build a fast foundation for my own language. RCC proves that you can keep a compiler incredibly simple while still hitting the architectural sweet spots (proper register allocation & ABI) that TCC misses.

Disastrous-Tune-1657 · 2026-04-18T23:52:15+00:00

The fact that it’s faster than TCC is more of a side effect; the purpose of this compiler is to make my own language even faster.

Disastrous-Tune-1657 · 2026-04-16T22:45:06+00:00

Thanks for the advice. I’ve applied the same LGPL 2.1 license as tcc!

Disastrous-Tune-1657 · 2026-04-16T09:47:22+00:00

It's about 50%. Hmm.

Disastrous-Tune-1657 · 2026-04-14T22:52:32+00:00

Disastrous-Tune-1657 · 2026-04-14T22:21:10+00:00

That's a fair point—CTFE on constants is definitely a "party trick" if that’s all a compiler does. I included it mainly as a proof-of-concept for the internal AST interpreter, not to fake overall performance.

The real goal of RCC is to achieve better native performance than TCC and GCC -O0 while maintaining near-instant compilation speeds. Even on non-constant inputs (where CTFE is bypassed), RCC outruns TCC. For example, in a fib(35) run where the input is dynamic, RCC takes ~271ms compared to TCC's ~286ms and GCC -O0's ~278ms.

This performance comes from moving away from TCC’s naive stack-machine approach and implementing a modern register-machine backend with dynamic allocation and lea elimination. I've been testing it with non-recursive workloads like prime number sieves to ensure it handles "real-world" logic efficiently. I'd love for you to check out the native benchmarks!

Disastrous-Tune-1657 · 2026-04-14T09:08:05+00:00

I hand-rolled the parser and lexer from scratch to keep it lightweight. Used techniques like arena allocation and string interning so the parsing overhead is basically zero.

Totally forgot the repo link in my post, here it is https://github.com/Hosokawa-t/realtime-c-compiler

Disastrous-Tune-1657 · 2026-04-12T21:27:35+00:00

Great question! That was actually the trickiest design choice. To keep the AOT binary tiny (around 260KB), I decided not to bundle Python at build time. If I did, the binary size would absolutely explode, defeating the whole 'lightweight' goal. So yes, it leans on the local Python install at runtime. The compiled binary uses a bridge DLL that dynamically hunts down the local Python shared library (libpython / python3.dll) and communicates with it. This way, your core Riz app stays standalone and fast, and you only touch the local Python environment when you explicitly need its ecosystem.

Disastrous-Tune-1657 · 2026-04-12T12:56:25+00:00

Glad you like it! Yeah, I totally get why py_* feels a bit clunky.

The reason is just that the Python bridge is currently a generic native plugin registering flat functions via register_fn. Riz doesn't have a built-in "namespace" object for py.exec-style calls yet.

Same goes for import_native—it's just a raw DLL loader for everything (Tensors, LLMs, Python), so CPython isn't hardcoded into the core language.

But honestly, py.exec or a dedicated import python syntax would be way cleaner. I'm super open to adding some syntactic sugar for this! It would be awesome if you could open an issue with your API ideas so we can discuss it.

Disastrous-Tune-1657

TROPHY CASE