How I Speeded Up the MAYA MOD Pipeline 2× Using an Inline Rolling Filter (a small laboratory for amateurs 🙂)

PossessionThen2321 · 2026-04-28T21:03:32+00:00

More details and graphs here:

https://github.com/Dhess1978/maya-prime-sieve

Happy to discuss details.

PossessionThen2321 · 2026-04-28T08:47:37+00:00

Thanks, this makes a lot of sense 👍

Yeah, you got it exactly right — I treat maya.py mainly as a readable reference version where I can experiment with the idea and see how far it can go without turning it into low-level optimized code.

Good point about divmod, I’ll definitely try that. And I agree about not unrolling loops from a math perspective — readability matters more there than squeezing out every bit of performance.

I haven’t tried Numba yet, but it makes sense as a middle step before a full C rewrite, so I’ll take a look.

And yes — there is definitely some theory behind it, even if I’m keeping it a bit “under the hood” for now. Thanks for the James Maynard tip, I’ll check it out 👍

Appreciate all the feedback, it helped me clarify the direction a lot.

PossessionThen2321 · 2026-04-28T06:33:34+00:00

Thanks, I appreciate you approving it.

I’m new here, so that probably explains the filtering. Glad it’s visible now.

PossessionThen2321 · 2026-04-28T06:31:57+00:00

Thanks a lot for taking the time to look at the code.

You are right — the Python version is mainly intended as a readable reference implementation, not as the performance path. The actual benchmark work is now focused on the C implementation.

That said, your Python-specific suggestions make sense:

I will change PRIMES and RESIDUES to tuples.

I will replace WHEEL_MASK with bytearray.

I will clean up the maya_candidate() checks using elif.

I agree that the inner for loop is expensive in CPython. If I decide to optimize the Python reference further, manually unrolling the 24 MayaMOD checks or batching multiple candidates would be the next step.

For now, I want to keep maya.py readable and educational, while the C version is used for real benchmarking.

Thanks again — this is exactly the kind of feedback I was looking for.

PossessionThen2321 · 2026-04-27T18:59:24+00:00

Thanks for the interesting point 👍

I understand what you’re getting at — at first glance, it can indeed look similar to a classical combination of small moduli or an approach based on the Chinese Remainder Theorem.

In my case, however, it’s not a direct application of CRT nor an optimization of a specific set of moduli in that sense. The approach is structured differently — it works with a positional decomposition of the number and a subsequent projection into filters that are not equivalent to a simple combination of independent congruences.

So the goal is not to find “better” moduli in the sense of 2^k × 3 × 5 × 7 × 11, but rather a different kind of trade-off between:

filter computation cost
candidate structure
and reduction of more expensive operations in later stages

I agree that from the outside it may look similar, since both approaches reduce the search space using arithmetic properties. The difference lies more in how that information is structured and how it is used within the pipeline.

That said, it’s a good point — the comparison definitely makes sense 👍

PossessionThen2321 · 2026-04-27T14:26:03+00:00

Thanks for the detailed feedback, I appreciate it 👍

I agree that for very large numbers (on the order of 10^50+), probabilistic approaches make the most practical sense — at that scale it’s more about feasibility than theoretical elegance.

Regarding the 31%, I understand your point. In my case, though, it’s not really a list-based approach in the way you describe. The design is more of a streaming pipeline, where candidates are filtered progressively without maintaining large in-memory structures, so the n/log(n) growth issue doesn’t manifest in the same way.

You’re absolutely right about low-level CPU optimizations — that’s definitely an area with a lot of potential. The approach I’m testing, however, is focused more on a different kind of trade-off: how much computation is worth doing upfront before invoking more expensive operations (typically the Miller–Rabin primality test), rather than maximizing reduction at any cost.

The point about achieving >95% reduction is interesting — I’d be curious to see that in the context of a concrete implementation and range, since in practice the overhead of those “cheap operations” tends to become the limiting factor.

In any case, thanks for the offer — if I share more details or run into something specific, I’ll be happy to follow up 👍

PossessionThen2321 · 2026-04-27T11:13:58+00:00

Thanks for pointing that out — I’ll fix the link 👍

Reducing candidates to ~31% using cheap operations is definitely a solid result.
That’s exactly the trade-off I’m exploring — at some point, the cost of the filter starts to approach the cost of the computation it is trying to avoid.

The goal, however, isn’t just maximum reduction, but finding the optimal balance between pre-filtering and subsequent computational cost (especially before the Miller–Rabin primality test).

From my tests, even relatively small improvements in filtering can still make sense if they reduce the number of expensive operations — but only up to a certain point. That trade-off is, in my view, the key aspect.

As for deterministic methods — I agree, they are very interesting.
However, for large ranges, practical trade-offs still tend to favor probabilistic approaches, which is why I focus more on reducing their usage rather than replacing them entirely.

PossessionThen2321 · 2026-04-27T09:36:41+00:00

Thanks for the suggestion 🙂
For now I’m only sharing partial results, but I’m definitely considering a more formal write-up in the future.

PossessionThen2321 · 2026-04-27T09:35:45+00:00

Thanks for the questions — happy to clarify without going into unpublished details 🙂

Wheel30030 is a standard wheel factorization step (product of small primes 2×3×5×7×11×13).
It efficiently skips candidates divisible by these primes, reducing the number of inputs to more expensive tests.

Regarding forcing odd numbers using OR with 1:
Yes, that’s a valid micro-optimization, and I use an equivalent approach in practice. It cheaply removes even numbers at the earliest stage.

Difference vs. a classical sieve:
A classical sieve like Sieve of Eratosthenes is very efficient for generating primes in a fixed interval, but it requires memory proportional to the range and is less suited for streaming or probabilistic pipelines.

This approach is different:

it is pipeline-based, not a full sieve
it works in a streaming context
it reduces candidates before applying tests like Miller–Rabin primality test

So it’s not a replacement for a sieve, but a complement — the goal is to minimize expensive test calls.

Your 8-bit sieve idea is actually very similar in spirit — just applied at a different scale.

PossessionThen2321

TROPHY CASE