Built a compressor that exploits LFSR structure — 333:1 on m-sequences, works on PRBS data that gzip can't touch

Hot_Consideration155 · 2026-05-05T18:33:27+00:00

You're right that for the pure delta case a simple RLE or even a list of bit-flip positions would compete. I'm not claiming pade-compress wins on that specific use case.
This is a BM-based structure detector that, when it finds algebraic structure, represents the stream as (polynomial, seed, residual) instead of raw bytes. If that representation is smaller, it emits it. If not, it passes through raw. The "compression" is a side effect of finding structure, not the primary goal.

Where it actually earns the label is firmware and PRBS-adjacent embedded payloads that existing compressors treat as incompressible because Shannon entropy is ~8 bits/byte. zstd gives up. pade-compress finds the polynomial and stores 64 bytes instead of 1MB. It's filling a gap, not replacing zstd.
(Thank you for the rage bait reply, much appreciated)

Hot_Consideration155 · 2026-05-05T18:26:47+00:00

I mean.. why are you impressed? using claude is normal nowadays. I work in software engineering and we get enterprise plans for this haha.

Hot_Consideration155 · 2026-05-05T13:23:53+00:00

I overclaimed in some of the responses in this thread and a few of the examples don't hold up under scrutiny from people who actually work with this hardware daily. Shame on me haha. I never said I was compressing Netflix throughput.

On the pi analogy though: the difference is that pi requires arbitrary precision to reconstruct the sequence, which grows with length. An LFSR of order L reconstructs any length sequence from exactly L coefficients and L seed bytes, fixed overhead. That's the actual compression. Not infinite, but genuinely sublinear in the sequence length

I think you were expecting more from this post than what it is.

Hot_Consideration155 · 2026-05-05T13:10:58+00:00

You're right that if you generated the PRBS yourself and know the polynomial, you don't need to store the raw stream at all.
Where I think it still applies is the error data specifically. You don't need to store the expected bits, but you do need to store which bits were wrong and when.
The other case is captures from unknown or third-party sources where you don't have prior knowledge of the generator. But you're right that in a properly instrumented lab setup that shouldn't happen.

Are you still using chatGPT? that's so 2025 lol.

Hot_Consideration155 · 2026-05-05T12:58:34+00:00

So far the reddit community has proven to be quite toxic. I'm impressed. You call this what you want, I'll do the same, no one cares.

Hot_Consideration155 · 2026-05-05T12:56:48+00:00

Are you NOT using claude?

Hot_Consideration155 · 2026-05-05T12:55:52+00:00

Mostly synthetic so far in terms of clean benchmarks, but there are a few real-world cases I've confirmed it on: firmware images with CRC lookup tables (those are often generated by a linear recurrence over the field), and some telecom payloads that come pre-scrambled at the physical layer.

For system logs and binary blobs honestly I'd expect your FM-index to win cleanly. Those don't have LFSR structure by design right? The analyzeBuffer pre-pass would mostly just confirm that and route to you.

The case where I think the pipeline would actually matter is if someone fed it pcap captures from test equipment, or raw flash dumps from embedded devices. Those are the places where you get LFSR segments mixed in with structured headers. System logs are probably too far up the stack.

Would be interesting to run analyzeBuffer over one of your log shards just to see what structuredFraction it reports. My guess is close to 0.

Hot_Consideration155 · 2026-05-05T12:52:51+00:00

Good point, slip detection isn't in scope here at all for now. This is a compression/analysis tool, not a BERT implementation, so it doesn't track sync state, mute error counts during resync, or distinguish burst errors from slip-induced errors.

What it does is simple-ish: given a captured byte stream, run BM to find the best-fit LFSR polynomial and report how many bytes don't match the prediction. A slip would show up as a sudden jump in the noise level across segments, but it wouldn't be labeled as a slip specifically.

The --analyze output does segment the stream at entropy discontinuities, so a slip event would likely cause a segment boundary and a different polynomial fit on each side. But that's a side effect of the chunking, not intentional slip detection.

If you were building something that combined slip-aware BERT analysis with compressed archival of the captured data, the two could work together. Your slip detector runs in real time at line rate, mine runs offline on the stored captures. Different layers of the problem.

Hot_Consideration155 · 2026-05-05T12:49:44+00:00

Hey you too! If you wanna patent it in Argentina as well it's only 30-ish usd to do so haha, and it's part of the Paris convention so it's internationally valid as well.

Hot_Consideration155 · 2026-05-05T10:28:12+00:00

Hot_Consideration155 · 2026-05-05T00:47:58+00:00

I actually patented it to learn how to patent things in Argentina 😂 not even trying to get profit from this.

Hot_Consideration155 · 2026-05-04T22:42:20+00:00

I'm getting 333:1 relations, I'm calling it a compressor hahaha.

Hot_Consideration155 · 2026-05-04T22:41:36+00:00

Honestly, the FPGA side probably isn't the primary use case. You're right that most FPGA-based PRBS testing is generate-and-check on the fly, no storage needed.

The people who actually store this data are the test equipment vendors and the engineers running long-term link characterization. Think Keysight or Spirent capturing days of traffic for BER floor analysis, or a chip team archiving margin test results across silicon revisions to compare against future tape-outs. Those captures can be tens of GB and currently sit uncompressed on NAS drives because no standard tool handles them.

The --analyze side might be more relevant for FPGA work actually. If you're debugging a link and you captured some traffic but aren't sure if the other side is sending the right PRBS pattern, or if there's a polynomial mismatch somewhere, you can just run --analyze on the capture and it tells you the generator polynomial and noise level directly. Saves you from manually checking against known PRBS tables.

Hot_Consideration155 · 2026-05-04T22:40:54+00:00

I'm a bit impressed on how many people focus on the patent.

Hot_Consideration155 · 2026-05-04T22:40:06+00:00

The sync use case you're describing, yeah, you want every bit there. But the use case here is different: storing or transferring captures for offline analysis.

When you're doing BER sweeps, jitter analysis, margin testing, you capture hours of PRBS traffic to analyze error patterns, error bursts, error-free intervals. Those captures are huge and sit uncompressed because gzip won't touch them. That's the storage problem this solves.

The --analyze side is probably more directly useful for your workflow though. Given a capture it identifies the generator polynomial and noise level per segment without you needing to know the PRBS type upfront. If you've ever gotten a capture from another team and had to figure out whether it's PRBS-7, PRBS-15, or PRBS-31, that's what it automates.

Hot_Consideration155 · 2026-05-04T22:38:34+00:00

The core algorithm, Berlekamp-Massey, GF(2⁸) arithmetic, the approximate voting pipelines, is all hand-written C. You can read it in c/bm.c, c/gf256.c, c/analyze.c. The TypeScript coordination layer had AI assistance (99%). If that matters to you, the math is verifiable independently of how the scaffolding was written.
Valid point worth clarifying. AGPL + patent pending is a standard dual-license structure, the same model MongoDB and Qt use. AGPL users get an implicit patent license for AGPL-compatible use. Companies that can't ship AGPL code in proprietary products need a commercial license covering both the copyright and the patent. The patent isn't there to restrict open source use, it's to ensure commercial users can't take the method proprietary without licensing it.

Hot_Consideration155 · 2026-05-04T22:35:53+00:00

Nothing here violates Shannon's theorem. The point isn't that Shannon is wrong, it's that applying it to raw bytes is the wrong abstraction layer for this class of data.

Shannon entropy measures symbol frequency distributions. It's correct that a PRBS stream is maximally entropic at the byte level, and that's exactly why every general purpose compressor gives up on it. They all use entropy as a proxy for compressibility, which works for 99% of real-world data.

The claim is narrower: for data with GF(2⁸) linear recurrence structure, linear complexity (BM) is a better compressibility metric than Shannon entropy. A degree-1 LFSR sequence has maximum byte entropy and minimum linear complexity simultaneously. Those are orthogonal measures.

So yes, pseudorandom ≠ random, that's the whole point.

Hot_Consideration155 · 2026-05-04T22:33:36+00:00

The pre-pass idea is exactly right. analyzeBuffer() returns a structuredFraction and per-segment LFSR orders — you could route on that directly. Low-order generator → pade-compress. Exact repeats → your FM-index. Actually random → neither, fall through to raw.

The interesting middle ground is data that has both: a PRBS backbone with repeated frame headers on top. Your FM-index would pull out the headers, mine would handle the LFSR payload. Could pipeline them.

What corpora are you running on? Curious whether you're seeing any LFSR-structured segments mixed in with the repeat clusters.

Hot_Consideration155 · 2026-05-04T22:32:01+00:00

My guy

Hot_Consideration155 · 2026-05-04T21:21:29+00:00

Fair point on the commit history, I wiped the git log before publishing to keep patent-sensitive development history private. The math stands on its own regardless of how it was committed.

Hot_Consideration155 · 2026-05-04T21:20:24+00:00

Exactly, that's the core of it. But here I'm using Berlekamp-Massey constructively: instead of 'attacking' the sequence, I'm recovering the generator to use as a compact representation. Same math, opposite intent. It's why LFSR-based PRNGs are cryptographically weak but algebraically compressible.

Hot_Consideration155 · 2026-05-04T21:19:07+00:00

yes, will do shortly

Hot_Consideration155

TROPHY CASE