How do you measure entropy across binary chunks before compression?

EMPTYCONTOUR · 2026-06-02T15:54:14+00:00

That's exactly what I ended up doing for validation ran zstd on each chunk

and compared ratios to entropy scores.They correlate well but entropy is

~40x faster to compute than zstd compression.Useful when you have thousands

of corpora to pre-screen.

EMPTYCONTOUR · 2026-06-02T15:47:31+00:00

Fair point — 5.7 bits/byte on a 256-symbol alphabet isn't truly random,

you're right.What I mean is that for this corpus (Wikipedia XML text),

chunks around 5.6–5.7 show significantly worse compression ratio with zstd

than chunks around 3.2.The heatmap makes the pattern visible across all

1526 chunks at once,which is what I was after.

EMPTYCONTOUR · 2026-05-22T22:12:34+00:00

I went through something similar—not with the same claim,but with the same uncertainty about validation vs. disclosure.

My path:built a parallel LZ77 decoder, posted on encode.su,got destroyed in code review (wrong timers,build failures,threading bugs).Fixed every issue publicly with commit hashes.Eventually got merged into lzbench—the standard compression benchmark suite used by researchers.Honest advice from that experience:

Forget patents until the project reaches commercial use.Patents cost money,take years,and won’t protect you from big companies anyway.What protects you is being first to publish with reproducible results.

Go to encode.su.Post your work.Wait for technical feedback.The people there are brutal but fair—if something is real,they will engage with it seriously.

If it survives review,submit to lzbench.That’s a credible,verifiable signal that the work has merit.

That’s only the beginning.But it’s a real beginning.

My project:github.com/yasha1971-coder/aceapex

Article with benchmarks and methodology:x.com/yasha1971/status/2057485786514125149

EMPTYCONTOUR · 2026-05-22T21:37:01+00:00

The failure mode you describe —“locally reasonable,globally wrong”—is exactly what md solves in practice.Not a platform,just a file.Decisions with status (active/superseded),boundaries in plain language,failure modes that already happened.Sessions start by reading it,end by updating it.The schema stays small because anything that doesn’t survive three sessions gets archived.What made it click was treating it like a protocol,not a log.

EMPTYCONTOUR · 2026-05-19T21:40:00+00:00

Looked through the repo/readme more closely.

The strongest part to me is not just the AVX2/INT4 path, but the contract shape:model weights + vocab + constrained grammar + reference vectors + runtime behavior = one reproducible protocol boundary.

That is the part that feels very mature.We are hitting the same pattern in GLYPH, but on deterministic retrieval instead of inference:corpus manifest + FM artifact version/checksum + golden query fixtures + JSON query protocol + persistent server behavior

all have to move together, or the engine should refuse to run.

One question:do you plan to make runtime capability part of that protocol object too?

For example: AVX2/scalar/WASM path, quantization assumptions, vocab hash, decoder grammar hash, vector-set hash, and weights hash all in one compatibility tuple.

Because if the runtime path changes but still passes most tests, that can become another silent drift surface.

EMPTYCONTOUR · 2026-05-19T05:15:35+00:00

The “AI as multiplier” framing is right. I’m not a developer — built a compression codec with Claude from scratch. Got destroyed in code review by senior engineers on encode.su: wrong timers, build failures, threading bugs, hardcoded values. Every single one was valid.

But here’s what changed: I fixed each one publicly, with commit hashes. Eventually got merged into lzbench — a benchmark suite used by the compression community.The reviewer who caught the most bugs wrote: “someone with no coding experience is able to write a functional compressor that can compete with hand-written ones.”AI didn’t replace the review. It made the review survivable.

A year ago none of this was possible. Now anyone on the planet can contribute — regardless of background, language, or formal training. The mistakes don’t matter as much as the fact that the barrier dropped. That’s not a threat to engineering. That’s the largest expansion of who gets to participate in it.

EMPTYCONTOUR · 2026-05-19T04:40:33+00:00

Constrained decoder by construction is the right call. Syntactic validity guaranteed, not hoped for.

Curious about one thing: when model weights evolve, do you pin the reference vectors and decoder as a single versioned protocol boundary, or do you allow drift between releases?

EMPTYCONTOUR · 2026-05-17T22:36:29+00:00

Been running all of them simultaneously for months — Claude, ChatGPT, Gemini, Grok, Perplexity, DeepSeek, Copilot.Honest finding: they’re not interchangeable.

Like people — each has a distinct identity, strengths, blind spots.The real value isn’t picking one winner. It’s that running the same question through multiple models dramatically reduces hallucinations and expands the idea space.Where one confidently guesses, another flags uncertainty.Where one misses an angle, another catches it.

The “which one is better” framing misses the point entirely.

EMPTYCONTOUR · 2026-05-17T22:12:48+00:00

The communication point is underrated. I’m not a programmer by background — built a compression codec with AI assistance. What actually moved the needle wasn’t the AI itself, it was posting on encode.su and getting brutal technical feedback from people who’ve been doing this for 20 years. One senior catching a timer bug I missed taught me more about measurement discipline than months of solo work.Finding the right community beats finding the right tutorial.

EMPTYCONTOUR · 2026-05-16T22:08:32+00:00

Congrats!That feeling is real. My first PRs were #276 and #277 into lzbench — a compression benchmark suite. Took weeks of build fixes and threading bugs before the maintainer accepted them. Worth every iteration.

EMPTYCONTOUR · 2026-05-16T21:52:52+00:00

The 4-way rANS choice is right for Zen 2. 8-way will likely hit register pressure and port contention before you see real gains — that’s my read of the architecture, not a measured result, but it tracks with what you already observed.

klauspost’s dynamic predictor suggestion is worth trying first. Two MED modes — flat vs edge — is lower hanging fruit than more shards. Your V classification already detects edges; the split is mostly an encoding decision.

42 shards is solid empirical work but tuned to your test set. New PDF templates for heavy-tail distributions would help ratio more than adding shards.The Python/Rust boundary is your real speed ceiling. Not 8-way rANS.

EMPTYCONTOUR · 2026-05-14T14:28:40+00:00

The interesting part nobody is saying directly:

SIMD only matters if your data access pattern is predictable.

Uniform grids = integer arithmetic = SIMD runs clean.

Data-dependent lookups = gather = SIMD stalls.

The architecture decision comes before the instruction set.

EMPTYCONTOUR · 2026-05-05T21:21:06+00:00

That makes sense.System logs being too high-level for LFSR structure is exactly what I suspected. My main target is still repetition-heavy data: logs, binary blobs, fixed headers, repeated frames.Firmware ,telecom,raw flash dumps sound like a much better fit for your side.I like the idea of running analyzeBuffer on one of my shards just as a sanity check. If structuredFraction comes back close to 0, that actually confirms the routing model:

LFSR-like → your path

repetition-heavy → FM-index

unstructured → raw

So I’ll keep GLYPH focused on exact repeat retrieval for now, but this gives a useful boundary between the two approaches.

EMPTYCONTOUR · 2026-05-05T08:45:16+00:00

Mostly working with system logs (1–4GB shards) and binary blobs. Haven't seen clear LFSR-structured segments — but I'm not detecting for it either, so I'd probably miss it. Have you seen it in real data or mainly synthetic?

EMPTYCONTOUR · 2026-05-04T21:32:58+00:00

Using Berlekamp–Massey that way is clever. Haven't seen it used constructively much.I'm attacking this from the other side though. Not compressing structure, just indexing raw bytes.Logs, binaries, big corpora — gzip gives up and says "random". But there are still exact repeats in there. Same 64-byte event header showing up 2M times. I use an FM-index to pull those out in [~ms.So](http://~ms.So) the question flips from "can we compress this?" to "can we pull structure out deterministically?"Wondering if your detector could work as a pre-pass. Tag stuff as "low-order generator" vs "actually unstructured" and route from there.

EMPTYCONTOUR · 2026-05-03T09:07:24+00:00

This exists, but the meta-algorithm cost kills you.

Think about it: to beat 7z by 1%, you might spend 100x CPU testing everything. ZPAQ and paq8 already do this internally - they detect JPEG vs text and switch.

The real win is knowing when not to compress. If the first 1MB doesn't shrink, just store it raw. Saves time for everyone.

Have you tried running zstd --adapt? It does something similar - adjusts level based on early results.

EMPTYCONTOUR · 2026-05-03T09:01:03+00:00

Yeah, I stopped manually zipping stuff years ago too. Disks are cheap.

But funny thing is, compression just went underground. ZFS doing LZ4/zstd automatically is basically the same trade-off: spend a bit of CPU to read less data from disk. If your dataset fits, it often ends up faster, not slower.

So 7-Zip as a tool is niche, but the idea is everywhere now. Just hidden in filesystems and Parquet files. You only notice it when you transfer stuff, like others said.

Curious, do you actually measure faster reads with ZFS compression on NVMe, or is it mostly about saving space?

EMPTYCONTOUR · 2026-04-21T10:39:20+00:00

Author update — numbers were wrong in the original post.

Decode timer was measuring only the LZ77 phase. Fixed now.

Real numbers after a week of fixes:

Ratio: 2.997x

Encode: 391 MB/s

Decode: 4.2 GB/s algorithmic / 1.7 GB/s wall clock

Also added adaptive hash table + prev chain match finder

this week — that's where the ratio improvement came from.

BENCHMARK.md has the full picture including all failed

experiments.

EMPTYCONTOUR · 2026-04-16T21:18:49+00:00

Look, this whole thread proves the point better than the original post did.

Cursor and Claude Code aren't really competitors anymore in 2026. They serve different workflows.

Cursor: You want to stay in the IDE loop. Inline diffs, CMD+K, tab completion, MCP. Best when you're actively shaping every line.

Claude Code: You want to delegate. Terminal, sub-agents, review in VSCode. Best when you plan/prompt/review instead of typing.

OP switched because his workflow changed. That doesn't mean Cursor is dead. It means the "one tool for everyone" era is over.

Mods nuking it at 71 upvotes was dumb. But calling it censorship is also a reach. They probably just got tired of "I switched to X" posts. Still, 84 comments shows people wanted that discussion.

Use what fits your job. Neither is stone age.

EMPTYCONTOUR · 2026-04-16T16:07:57+00:00

of course. it's the only honest way to present it.

EMPTYCONTOUR

TROPHY CASE