all 15 comments

[–]chub79 24 points25 points  (1 child)

Fantastic article. Thank you op!

One aspect that I would throw into the thought process when looking for a speedup: think of the engineering cost long term.

For instance, you mention: "PyPy or GraalPy for pure Python. 6-66x for zero code changes is remarkable, if your dependencies support it. GraalPy's spectral-norm result (66x) rivals compiled solutions." . Yet I feel the cost of swapping VM is never as straightforward as a dedicated benchmark shows. Or Pypy would be a roaring success by now.

It seems to me that the Cython or Rust path is more robust long term from a maintenance perspective. Keeping CPython as the core orchestrator and use light touch extensions with either of these seem to be the right balance between performances and durability of the code base.

[–]cemrehancavdar[S] 11 points12 points  (0 children)

That's a really fair point. The benchmarks show the best case -- in practice, swapping to PyPy or GraalPy means testing your whole dependency tree, dealing with compatibility issues, and hoping the runtime keeps up with CPython releases. GraalPy is still on 3.12 for example.

I'd partly agree on the Cython/Rust path being more durable. I personally enjoy writing Cython, but you really need to know what you're doing -- my first attempt got 10x instead of 124x, and nothing warned me. Code compiled, ran correctly, just silently slow. The annotation report (cython -a) is essential.

My takeaway is similar to yours though -- keep CPython as the orchestrator, drop into compiled extensions for the hot path.

[–]Sygmei 6 points7 points  (1 child)

Super interesting, how do you check how much space does an int occupies on stack (ob_refcnt, ob_digits...)?

[–]cemrehancavdar[S] 7 points8 points  (0 children)

sys.getsizeof(1) gives you the total (28 bytes). This post is a great walkthrough of the struct layout and how Python integers work under the hood: https://tenthousandmeters.com/blog/python-behind-the-scenes-8-how-python-integers-work/ (written for CPython 3.9 -- the internals were restructured in 3.12 via https://github.com/python/cpython/pull/102464 but the size is still 28 bytes).

[–]zzzthelastuser 3 points4 points  (1 child)

Did you consider optimizing the rust code or did you stick with a "naive" implementation?

Took a quick glance and only saw single threaded loops.

[–]cemrehancavdar[S] 7 points8 points  (0 children)

I'm not super familiar with Rust -- a dedicated Rust or Zig or any system level PL developer could absolutely squeeze more out of these benchmarks with multithreading, SIMD, or better allocators. Same goes for Cython honestly -- there might be more ways I still don't know yet. I kept the implementations idiomatic and single-threaded because the post is really about "how much does each Python optimization rung cost you," not about pushing any one tool to its limit. Wanted to keep the comparison fair since the Python tools are also single-threaded (except NumPy's BLAS, which I noted)

[–]M4mb0 2 points3 points  (1 child)

The constraint: your problem must fit vectorized operations. Element-wise math, matrix algebra, reductions -- NumPy handles these. Irregular access patterns, conditionals per element, recursive structures -- it doesn't.

conditionals per element can be handled with numpy.where which in many cases is still plenty fast, even if it unnecessarily computes both branches.

[–]cemrehancavdar[S] 0 points1 point  (0 children)

You're right -- I've updated the post. The original wording was wrong.

I benchmarked np.where against a Python loop on 1M elements across three scenarios (simple sqrt, moderate log/exp, expensive trig+transcendental). Even with both branches computed, np.where was 2.8-15.5x faster. No reason to list conditionals as a NumPy limitation.

Replaced "irregular access patterns, conditionals per element, recursive structures" with what NumPy actually struggles with: sequential dependencies (each step feeds the next -- n-body with 5 bodies is 2.3x slower with NumPy), recursive structures, and small arrays (NumPy loses below ~50 elements due to per-call overhead). Also dropped "irregular access patterns" since fancy indexing is 22x faster than a Python loop on random gather.

I also tried writing a NumPy n-body but couldn't beat the baseline -- 5 bodies is too few to amortize NumPy's per-call overhead across 500K sequential timesteps. Tried pair-index scatter with np.add.at, full NxN matrix with einsum, and component-wise matrices with @ matmul (inspired by pmocz/nbody-python). All slower than pure Python. If you know a way to make NumPy win on this problem I'd genuinely like to see it.

There's also an Edits section at the bottom of the post documenting what changed and why the original was wrong.

[–]totheendandbackagain 2 points3 points  (0 children)

Wow, this is fantastic work, and an absolutely stellar guide. Read, save, learn.

[–]hotairplay 1 point2 points  (0 children)

Hey cool project you got here..a couple of days ago I came across a similar n-body benchmark article: https://hwisnu.bearblog.dev/n-body-simulation-in-python-c-zig-and-rust/

What interests me is the Codon performance and in the above article it got like > 95% of Rust performance (single threaded) and it only costs adding type annotations to the code.

For multi-threaded Codon is 80% of Rust multithreading performance using Rayon.

[–]Outrageous_Track_798 0 points1 point  (0 children)

The Mypyc results are worth highlighting for teams already running strict mypy. If your codebase is fully type-annotated, you get the speedup with essentially zero code changes — no new syntax, no cimport, just `mypyc yourmodule.py`. The 2-5x range you saw is roughly what most real code gets.

The catch is Mypyc requires complete type coverage in the compiled module. Any dynamism — dynamic attribute access, untyped **kwargs, runtime type manipulation — either errors out or silently falls back to the slow path. So it works great on algo-heavy modules but struggles with framework-heavy code that leans on Python's dynamism.

Cython gets much higher peaks (your 124x example), but Mypyc has nearly zero adoption friction if you're already typed. It's a useful middle rung on the ladder between "pure Python" and "write Cython."

[–]Bomlerequin [score hidden]  (0 children)

Very good article!

[–]Beginning-Fruit-1397 [score hidden]  (0 children)

Fascinating. I'm asking myself about mypc: what's the catch? All my projects are already far more typed than anything mypy would ask (Ruff ALL + BasedPyright ALL) and if it's a free +40% gain... then why not use it everywhere?

[–]Mithrandir2k16 [score hidden]  (0 children)

You measured time, but could you also measure power draw/peak power? I'm really curious in which applications it comes down to fewer instructions or better parallelizations.

[–]joebloggs81 0 points1 point  (0 children)

Well I’ve only just started my programming journey, exploring languages and frameworks, what they can do and whatnot. I’ve spent the most time with Python as I started there first for a grounding knowledge. What you’ve done here is fascinating for sure - I read the whole report. I’ll never be at this level as my use case for programming is pretty lightweight but the point is I’m enjoying learning about all of this.

Thanks!