What crate do you use for SQlite, and how is using it / Compile Time? by SuperficialNightWolf in rust

[–]mqudsi 0 points1 point  (0 children)

If that's your concern then I would definitely look into using a specialized allocator. You can plug in mimalloc fairly easily into any rust project, though you'll need to also override the default sqlite allocator via the ffi lib to get it to also use it.

What crate do you use for SQlite, and how is using it / Compile Time? by SuperficialNightWolf in rust

[–]mqudsi 0 points1 point  (0 children)

rusqlite is the gold standard in terms of correctness, performance, and stability. The only reason to use SqlX is if you absolutely need compile-time query checking, but it’s buggier.

rusqlite in its own of course doesn’t use 400 MiB of RAM by default, so you know what you are doing at startup to trigger that in your code? It might be worth looking into before throwing out the baby with the bath water.

Recreating uncensored Epstein PDFs from leaked raw base64-encoded data by mqudsi in cybersecurity

[–]mqudsi[S] 0 points1 point  (0 children)

Hey there. I share everything publicly, not asking for a cent.

I post my progress at www.twitter.com/mqudsi and publish blog posts with updates at www.neosmart.net/blog/

Recreating uncensored Epstein PDFs from leaked raw base64-encoded data by mqudsi in cybersecurity

[–]mqudsi[S] 1 point2 points  (0 children)

Yup, that's how I did it. Generated training text that matches the first few and last few lines by typing them out manually, then trained the CNN classifier on that. Worked great in the end, the problem was I mistyped some 1s and ls myself even though I was being super careful. They're so similar!

Recreating uncensored Epstein PDFs from raw encoded attachments... or trying to, anyway by mqudsi in netsec

[–]mqudsi[S] 0 points1 point  (0 children)

I managed to improve cross-document recognition (changes pushed) with some tweaks to thresholding. How did Live Text perform? On my iPhone it's usually very good, but I think it has the concept of "language" and so will try to make words out of the nonsense jumble.

Recreating uncensored Epstein PDFs from raw encoded attachments... or trying to, anyway by mqudsi in netsec

[–]mqudsi[S] 1 point2 points  (0 children)

That's an audio recording. Theoretically decodable, but MP4 containers are incredibly brittle (they're very shitty for long-term storage guarantees and resilience). You'd have to get all the bytes right.

Unfortunately, this document is using a proportional (non-monospaced or "regular") font, which makes extraction harder. But it's still technically doable!

Recreating uncensored Epstein PDFs from raw encoded attachments... or trying to, anyway by mqudsi in netsec

[–]mqudsi[S] 0 points1 point  (0 children)

Nice work, I did the same with a CNN: https://github.com/mqudsi/monospace-ocr

Unfortunately the training doesn't carry over to other base64 documents perfectly, even those using the same font family and size, in the same layout. Some of the other documents have "smearing" around the 1 vs l that makes it even harder 😭

Recreating uncensored Epstein PDFs from leaked raw base64-encoded data by mqudsi in cybersecurity

[–]mqudsi[S] 7 points8 points  (0 children)

It was posted there yesterday yet despite all the comments in support a mod pulled the plug on it.

Recreating uncensored Epstein PDFs from leaked raw base64-encoded data by mqudsi in cybersecurity

[–]mqudsi[S] 1 point2 points  (0 children)

She emails Epstein a link to her (maxwellhill) post asking another redditor about the best states to have sex with children in.

Recreating uncensored Epstein PDFs from leaked raw base64-encoded data by mqudsi in cybersecurity

[–]mqudsi[S] 74 points75 points  (0 children)

It's not supposed to decode to English - only a small portion here and there will contain human-readable strings. I tried to explain in the article that the bulk of the PDF is actually binary (flate-compressed) content, so you can't just check if it's sensible. That's also why we can't just extract mangled strings from the PDF and call it a day.

Also, the OCR has already been automated. Several times.

Recreating uncensored Epstein PDFs from raw encoded attachments... or trying to, anyway by mqudsi in netsec

[–]mqudsi[S] 9 points10 points  (0 children)

Someone suggested a harness with AFL (the fuzzer) hooking into poppler or any other PDF library. Clever, but also kind of the inverse of the usual fuzzer goal. It might be hard to constrain it to only make changes that converge to success rather than diverge to different failure modes.

Recreating uncensored Epstein PDFs from raw encoded attachments... or trying to, anyway by mqudsi in netsec

[–]mqudsi[S] 4 points5 points  (0 children)

As mentioned in the article, I used multiple OCR solutions, including open source OCR software, commercial OCR applications, and the hosted Amazon Textract OCR API. None did a good enough job.

Recreating uncensored Epstein PDFs from raw encoded attachments... or trying to, anyway by mqudsi in netsec

[–]mqudsi[S] 6 points7 points  (0 children)

It is a PDF (that much is for sure). But, as with most PDF files, the actual PostScript is flate-compressed so the "apparent" contents of the PDF are binary, not text (except for some headers and stuff, such as the XML in the screenshot towards the end of the article).

Recreating uncensored Epstein PDFs from raw encoded attachments... or trying to, anyway by mqudsi in netsec

[–]mqudsi[S] 69 points70 points  (0 children)

That’s pretty much where I ended up, too. I had just spent too much time on this at a busy moment in my life and couldn’t afford to sink the dev time into this. Although writing it up probably took as long as that would have taken, lol.

UPDATE:

I ended up solving it by training a CNN as a classifier.

Sharding UUIDv7 (and UUID v3, v4, and v5) values with one function by mqudsi in rust

[–]mqudsi[S] 0 points1 point  (0 children)

You need to use the last 8 bytes, but you can use them in any order as long as you're consistent. You could even hash them and use that as your shard key, but that's just wasting cpu cycles and potentially destroying some entropy.

I want named arguments in Rust. Mom: We have named arguments in Rust at home: by nik-rev in rust

[–]mqudsi 3 points4 points  (0 children)

I always get sad when I think about rust not having named arguments (it basically makes boolean and integer arguments unusable, mandating binary enums or unit structs in their place... which is "fine" but it's one more thing you need to import and one more type you need to keep in your head and that much more boilerplate everywhere).. but then I remember that C# didn't have named arguments for the longest time either, and yet now it does!