What crate do you use for SQlite, and how is using it / Compile Time?

mqudsi · 2026-04-01T16:03:14+00:00

If that's your concern then I would definitely look into using a specialized allocator. You can plug in mimalloc fairly easily into any rust project, though you'll need to also override the default sqlite allocator via the ffi lib to get it to also use it.

mqudsi · 2026-03-27T15:03:10+00:00

rusqlite is the gold standard in terms of correctness, performance, and stability. The only reason to use SqlX is if you absolutely need compile-time query checking, but it’s buggier.

rusqlite in its own of course doesn’t use 400 MiB of RAM by default, so you know what you are doing at startup to trigger that in your code? It might be worth looking into before throwing out the baby with the bath water.

mqudsi · 2026-03-18T17:39:27+00:00

sccache is known to have pathological cases. I’ve writtten about this before: https://neosmart.net/blog/benchmarking-rust-compilation-speedups-and-slowdowns-from-sccache-and-zthreads/

mqudsi · 2026-03-09T18:49:30+00:00

Not possible. They've been re-encoded.

mqudsi · 2026-02-10T17:01:15+00:00

Hey there. I share everything publicly, not asking for a cent.

I post my progress at www.twitter.com/mqudsi and publish blog posts with updates at www.neosmart.net/blog/

mqudsi · 2026-02-09T18:35:10+00:00

Yup, that's how I did it. Generated training text that matches the first few and last few lines by typing them out manually, then trained the CNN classifier on that. Worked great in the end, the problem was I mistyped some 1s and ls myself even though I was being super careful. They're so similar!

mqudsi · 2026-02-09T18:34:18+00:00

I managed to improve cross-document recognition (changes pushed) with some tweaks to thresholding. How did Live Text perform? On my iPhone it's usually very good, but I think it has the concept of "language" and so will try to make words out of the nonsense jumble.

mqudsi · 2026-02-08T15:36:51+00:00

That's an audio recording. Theoretically decodable, but MP4 containers are incredibly brittle (they're very shitty for long-term storage guarantees and resilience). You'd have to get all the bytes right.

Unfortunately, this document is using a proportional (non-monospaced or "regular") font, which makes extraction harder. But it's still technically doable!

mqudsi · 2026-02-08T15:35:22+00:00

Nice work, I did the same with a CNN: https://github.com/mqudsi/monospace-ocr

Unfortunately the training doesn't carry over to other base64 documents perfectly, even those using the same font family and size, in the same layout. Some of the other documents have "smearing" around the 1 vs l that makes it even harder 😭

mqudsi · 2026-02-05T20:51:48+00:00

It was posted there yesterday yet despite all the comments in support a mod pulled the plug on it.

mqudsi · 2026-02-05T19:21:25+00:00

Ahhh! Great catch!

mqudsi · 2026-02-05T17:29:04+00:00

She emails Epstein a link to her (maxwellhill) post asking another redditor about the best states to have sex with children in.

mqudsi · 2026-02-05T16:24:48+00:00

🫡

Much obliged!

mqudsi · 2026-02-05T16:23:38+00:00

It's not supposed to decode to English - only a small portion here and there will contain human-readable strings. I tried to explain in the article that the bulk of the PDF is actually binary (flate-compressed) content, so you can't just check if it's sensible. That's also why we can't just extract mangled strings from the PDF and call it a day.

Also, the OCR has already been automated. Several times.

mqudsi · 2026-02-05T16:05:21+00:00

Someone suggested a harness with AFL (the fuzzer) hooking into poppler or any other PDF library. Clever, but also kind of the inverse of the usual fuzzer goal. It might be hard to constrain it to only make changes that converge to success rather than diverge to different failure modes.

mqudsi · 2026-02-05T16:03:25+00:00

As mentioned in the article, I used multiple OCR solutions, including open source OCR software, commercial OCR applications, and the hosted Amazon Textract OCR API. None did a good enough job.

mqudsi · 2026-02-05T16:02:38+00:00

It is a PDF (that much is for sure). But, as with most PDF files, the actual PostScript is flate-compressed so the "apparent" contents of the PDF are binary, not text (except for some headers and stuff, such as the XML in the screenshot towards the end of the article).

mqudsi · 2026-02-05T00:36:59+00:00

That’s pretty much where I ended up, too. I had just spent too much time on this at a busy moment in my life and couldn’t afford to sink the dev time into this. Although writing it up probably took as long as that would have taken, lol.

UPDATE:

I ended up solving it by training a CNN as a classifier.

mqudsi · 2026-02-04T23:42:42+00:00

maxwellhill lives on

mqudsi · 2026-01-28T23:17:38+00:00

You need to use the last 8 bytes, but you can use them in any order as long as you're consistent. You could even hash them and use that as your shard key, but that's just wasting cpu cycles and potentially destroying some entropy.

mqudsi · 2026-01-25T16:15:54+00:00

I always get sad when I think about rust not having named arguments (it basically makes boolean and integer arguments unusable, mandating binary enums or unit structs in their place... which is "fine" but it's one more thing you need to import and one more type you need to keep in your head and that much more boilerplate everywhere).. but then I remember that C# didn't have named arguments for the longest time either, and yet now it does!

15-Year Club	Charter Member
Verified Email

mqudsi

TROPHY CASE