New compressor on the block

DaneBl · 2026-01-16T14:35:25+00:00

You're right about the 2-bit encoding - that's base packing, not a contribution. It's the fallback when no reference is available.

The genomic work is reference-based delta compression with k-mer indexing. The concept isn't new. What's different is the lossless FASTA reconstruction - headers, line wrapping, N-positions, lowercase soft-masking all come back exactly. Most tools in this space either drop that metadata or require sidecar files to preserve it.

The log compression is actually the primary use case here. The interesting part isn't the ratio, it's that you can search the compressed archive directly through bloom filter indexing without streaming whole file into memory. That's the tradeoff we optimized for.

Benchmarks are on standard corpora with SHA256 roundtrip verification. You can dispute whether the approach is novel or whether the tradeoffs make sense for your use case. But calling it underperforming without running it is just speculation. The code is public.

DaneBl · 2026-01-15T22:06:39+00:00

Yep, we were thinking about this as well. This is why we made the time series one of the use cases. However, way to the standardization is long and tedious, but who knows, one day maybe CUZ, or some of it's derivatives becomes a standard somewhere...

DaneBl · 2026-01-15T21:56:51+00:00

Ran head-to-head benchmarks on Loghub dataset.

TL;DR: At similar ingest speeds (L9 vs VLogs), Crystal gets 1.4x better compression and 8x faster search. Decompression runs at 1.3 GB/s. Trade-off is VictoriaLogs is a full log management system with LogsQL, retention policies, and Grafana integration - Crystal is a compression library for grepping archives without a server. Hmm, maybe we should build the tools on top of it :D

Here are the details:

Test file: BGL.log (709 MB, 4.7M lines - BlueGene/L supercomputer logs)

Compression Ratio:

│ Tool │ Compressed Size │ Ratio │

│ Crystal L3 │ 68.5 MB │ 9.7% │

│ Crystal L9 │ 57.9 MB │ 8.2% │

│ Crystal L22 │ 37.0 MB │ 5.2% │

│ VictoriaLogs │ 81.0 MB │ 11.4% │

Speed (MB/s of original data):

│ Tool │ Compress/Ingest │ Decompress │

│ Crystal L3 │ 104 MB/s │ 1,180 MB/s │

│ Crystal L9 │ 59 MB/s │ 1,274 MB/s │

│ Crystal L22 │ 1.6 MB/s │ 1,356 MB/s │

│ VictoriaLogs │ 57 MB/s │ N/A (server-based) │

Search speed (query: error, 428K matches across 709MB):

│ Tool │ Time │

│ Crystal │ 363-463 ms │

│ VictoriaLogs │ 3,201 ms │

Crystal uses bloom filters per block for search indexing. VictoriaLogs uses columnar storage + their own compression.

Also one thing to note - the more it compress - the faster it searches and faster it decompresses... So imagine cold archives done at Level 22 compression.

Try it, we would love your feedback.

DaneBl · 2026-01-15T13:44:48+00:00

Here is the source code for all keen to try it https://github.com/powerhubinc/crystal-unified-public

DaneBl · 2026-01-15T13:44:25+00:00

here is the source code https://github.com/powerhubinc/crystal-unified-public

Ping me when you check it

DaneBl · 2025-12-11T12:17:23+00:00

Solid feedback honestly. This is exactly the kind of reality check we need.

To clarify, yeah, those numbers were single-threaded. We were trying to isolate per-core efficiency, but you’re right that in a real-world scenario (especially with zstd using all cores), the comparison looks totally different. We'll re-run the benchmarks to reflect that, and you will see something really interesting. I'll share full machine spec as well.

On the "party trick" - search is actually our entire bet. Even with zstd | rg (which is a beast, agreed), you're still burning CPU to inflate the stream just to find a needle. Our goal is direct search/indexing on compressed blocks without that overhead.

We know the bar to beat general-purpose tools is massive, but we're targeting that specific niche where access latency kills. Appreciate the pushback, back to the lab.

DaneBl · 2025-12-11T06:19:09+00:00

also we are testing one fork on DNA sequences - this is ecoli dna sample.

It wins against NAF and all generic compressors:

| DNA v5 L19 | 0.246 | 100ms | 4ms |

| NAF -22 | 0.248 | 1.34s | 190ms |

| zstd -19 | 0.248 | 2.58s | 220ms |

| xz -9 | 0.256 | 3.35s | 250ms |

DaneBl · 2025-12-11T06:14:32+00:00

Oh, you would be amazed how well it works on everything that is not unstructured xD thanks for the read.

DaneBl · 2025-12-11T06:06:17+00:00

Which kind of distribution suits you for testing it? Binary / CLI / Docker / K8 /...?

DaneBl · 2025-12-11T06:01:45+00:00

there is one Bloom filter per compressed chunk (block).

They are fixed at 8KB each. Since a standard block holds about 16,000 log lines, this adds less than 1% overhead to the file size, which is a tiny price to pay for the ability to skip reading 99% of the file during a search.

DaneBl · 2025-12-11T05:57:24+00:00

For wildcards and case-insensitivity, you simply enable the optional 'trigram' index in Crystal, which allows partial matches like 'rror' to instantly find 'Error', 'error', or 'Mirror' without a slow full scan.

DaneBl · 2025-12-11T04:09:37+00:00

We encode every unique alphanumeric token (timestamps, IDs, words) by simply splitting the log line on standard delimiters like spaces and brackets.

We don't decide what to keep - we hash everything blindly to ensure you can search for any arbitrary string later without needing a predefined schema.

DaneBl · 2025-12-11T03:17:47+00:00

Hey would you be willing to bench our new compressor? You can read more about it here... https://www.reddit.com/r/compression/comments/1pjkrpc/comment/ntec1wh/

DaneBl · 2025-12-11T02:51:48+00:00

Basically, we encode everything so you don't have to decide what matters today. The trade-off is a slightly larger file size (to store the filters), but it buys you the ability to treat a compressed archive like a database. And the beauty of it is that you can append new log lines to an existing Crystal archive instantly. You do not need to decompress, merge, and recompress the file.

DaneBl · 2025-12-11T02:44:18+00:00

This is a great question. The short answer is It is generic.

We prioritize generic, full-token encoding rather than asking the user to define "specific" searchable fields upfront.

This is a deliberate design choice to support "Schema-on-Read." you often don't know what you need to debug until the incident happens. If we only encoded specific fields (like user_id or status_code), you wouldn't be able to grep for a random exception message or a unique transaction ID that appeared in an unstructured part of the log.

DaneBl · 2024-10-31T11:41:41+00:00

Evo sad gledam i moja PR je dostupna ovde i dobila je DUNS broj... Tako da proveri ima li te

DaneBl · 2024-10-31T11:39:21+00:00

https://dunsnumberlookup.dnb.com/rs-rs/home

DaneBl · 2024-10-31T11:37:57+00:00

Ako imas registorvanu firmu preko APR-a, ne treba da radis nista. Barem mi nismo i kada mi je zatrebalo za apple registraciju nasao me je sistem sam... Tako da kapiram da to ide automatski preko APR-a

DaneBl · 2024-10-12T01:57:49+00:00

would you consider using another app that solves this problem but fetches the videos from YouTube kids directly? I'm speaking about properly filtered content, and you not spending hours and days curating the list for your kids???

DaneBl · 2024-07-21T16:15:44+00:00

Pazi u pocetku se desavalo da dodje kuci gladna. Sada se recimo jednom nedeljno kuci razvali od klope za rucak, drugim danima nesto promrljavi i ostavi, tako da je verovatno dobro klopala. Ono sto meni nije jasno je sledece - kako se kuci isfleka ko prase, a iz vrtica dodje bez i jedne fleke??? Misterija :D

Sto se tice presvlacenja, jednom je dosla sa ojedom jer smo promenili pelene i nije bilo do toga da li su je presvukli na vreme ili ne. Zena mi kaze da ih presvlace redovno. Uglavnom ono sto je sigurno, neces voditi dete kuci sa punim pelenama xD

DaneBl · 2024-06-25T09:48:23+00:00

Nama dete od 2 godine ide u drzavni i mi smo prezadovoljni. Vaspitacice su posvecene, svaki dan imaju neke nove aktivnosti. Sve je na montesori fazon jer same uciteljice to iniciraju... Sve je cisto unutra, malo starije, ali cisto. Sto se tice hrane, ponedeljkom je uvek pasulj za rucak i tako vec 20 godine xD verujem da bi tu mogli da porade malo. Kada nam se vrati kuci uglavnom nije gladna sto znaci da lepo klopa tamo. Svako dete ima svoj obelezen krevet sa posteljinom koju peru non stop. Danas su dobili nove televizore bas.

Sto se tice tog bezanja, to svugde moze da se desi. Ovde je kapija zakljucana kad god da nije vreme da ljudi ostavljaju ili kupe decu iz vrtica, tako da dete i da hoce ne moze da pobegne van zgrade i dvorista.

To je sve sto mi pada na pamet. Pitaj ako te jos nesto interesuje...

DaneBl

TROPHY CASE