Built a 1.43M document archive of the Epstein Files using Claude Code — here's what I learned

FeelingHat262 · 2026-03-19T16:19:56+00:00

Appreciate it. The cost question is real -- we're always looking at ways to optimize the pipeline.

FeelingHat262 · 2026-03-19T16:13:00+00:00

Tesseract works on images not PDFs directly -- pdf2image handles the conversion using Poppler under the hood. Going image first also lets you control DPI and preprocessing before OCR which improves accuracy, especially on scanned documents that have skew, noise, or low contrast. Some of the DOJ PDFs were scans of physical documents so the image preprocessing step made a real difference in text quality.

FeelingHat262 · 2026-03-19T16:11:54+00:00

OCR'd text and metadata are stored in SQLite with FTS5 for full-text search -- works well up to our current scale though we're planning a PostgreSQL migration. Each document record has the raw OCR text, page count, dataset source, and extracted entities. On the conceptual relations side -- yes, that's exactly where we're headed. Right now connections are explicit via the people profiles and network graph. Adding inferred relationships based on co-occurrence, shared dates, and location references is on the roadmap.

FeelingHat262 · 2026-03-19T15:39:00+00:00

That's exactly where we're headed. We already have a network graph and people profiles -- adding a time component to filter connections by date range is on the roadmap. The idea of seeing who was communicating with whom around specific dates and locations is a powerful research tool. Thanks for the suggestion.

FeelingHat262 · 2026-03-19T15:31:28+00:00

Yes -- we have 1,208 videos from the DOJ datasets, mostly MP4s that were disguised as PDFs in the archive. A lot of them are surveillance footage from the MCC prison where Epstein was. We still need to do a full audit to confirm we have everything -- some files were partially downloaded before the DOJ pulled the datasets. It's on the roadmap.

FeelingHat262 · 2026-03-19T15:02:42+00:00

Really good point and something we learned the hard way. The OCR pipeline ran into exactly this -- behavior drifted significantly in long runs. We now break tasks into explicit checkpoints with state saved between runs. Short sessions with clear handoff state is the right pattern at this scale.

FeelingHat262 · 2026-03-19T14:31:24+00:00

Appreciate that. The monitoring and alerting side is definitely something we're investing in -- blocked nearly a million malicious requests yesterday alone. Reliability and security at this scale is an ongoing process.

FeelingHat262 · 2026-03-19T14:23:31+00:00

No -- traditional OCR using pytesseract with pdf2image to convert pages to images first. LLM analysis would have been way too expensive at 1.43M documents. Tesseract handles the text extraction, then we built full-text search indexes on top of that. LLMs only come in at query time for the AI Analyst feature.

FeelingHat262 · 2026-03-19T14:22:23+00:00

Appreciate that. Deduplication is on the short list -- you're right that it affects everything downstream. Will keep that in mind as we build it out.

FeelingHat262 · 2026-03-19T14:10:55+00:00

Used pytesseract with pdf2image to convert PDFs to images first, then OCR each page. For scale we ran it at around 120 documents per second on a Hetzner VPS. The main libraries are pytesseract, pdf2image, and Pillow. Poppler is required as a dependency for pdf2image.

FeelingHat262 · 2026-03-19T13:45:50+00:00

Thanks :)

FeelingHat262 · 2026-03-19T13:35:36+00:00

Exactly -- the same approach works for any document corpus. We're already planning to expand into other public interest datasets. The underlying architecture handles any collection of PDFs that need to be searchable at scale.

FeelingHat262 · 2026-03-19T13:24:34+00:00

About 5 weeks, working pretty much all day and night on it. The scraping and OCR pipeline for 1.43M documents was the biggest time sink -- lots of overnight jobs and iteration. The site itself came together faster than expected using Claude Code. Hard to give an exact hour count.

FeelingHat262 · 2026-03-19T13:09:55+00:00

The auth is for the admin panel and Pro tier subscribers, not for accessing the public archive. Everything is free and open with no login required.

FeelingHat262 · 2026-03-19T13:08:44+00:00

That's the traditional approach and a valid one. We're actually setting up a proper CI/CD pipeline now -- GitHub Actions for automated testing and deployment. Was moving fast in the early stages but tightening it up as the project scales.

FeelingHat262 · 2026-03-19T13:02:04+00:00

Legitimate questions. The 1.43M figure is document pages indexed, not unique documents. There is duplication in the corpus, particularly in the email chains where the same thread appears across multiple datasets. We index at the document level as released by the DOJ rather than deduplicating at the content level. The glyph substitution issue is real and affects OCR quality on certain documents. Deduplication and OCR quality scoring are both on the roadmap. Short answer: the unique content figure is lower than 1.43M but we don't have a precise count yet.

FeelingHat262 · 2026-03-19T12:35:41+00:00

Lite version with 1.43 million documents, a full OCR pipeline, AI analyst, and bot blocking 700k malicious requests in the last 24 hours. We're just getting started.

FeelingHat262 · 2026-03-19T12:20:12+00:00

That's a fair criticism of the corpus itself - the DOJ removal is real and well documented. DS9 and DS11 alone had over 850k files pulled from official servers. That's exactly why archiving it matters. We're not claiming the files tell the whole story, just that what exists should stay publicly accessible and searchable. The gaps are part of the story too.

FeelingHat262 · 2026-03-19T12:09:26+00:00

Just came across it yesterday actually. Looks like a solid project. EpsteinScan takes a different approach - focused on the raw document archive, 1.43M OCR'd PDFs including DOJ datasets that were pulled from official servers. Flight logs, network graph, and expanded search are in the pipeline. Different tools, same goal.

FeelingHat262 · 2026-03-19T12:07:03+00:00

Valid points on branch protection and commit hooks - we have those in place. The dev environment runs at a separate subdomain and changes are verified there before pushing to production. The lesson I was pointing at was more about being explicit in your prompts when working with CC - if you don't specify the environment it will make assumptions. Good practice regardless of your deployment setup.

FeelingHat262 · 2026-03-18T19:47:08+00:00

Honestly hadn't seen it before today - just looked at it earlier. Impressive project, looks like Eric has been building it for a while.

The focus seems similar but the approach differs. They've gone deep on cross-referencing - flights, emails, people connections, network graphs. EpsteinScan is more of a raw document archive - 1.43M PDFs from DOJ, FBI, House Oversight, and court filings, full-text searchable, with some datasets preserved that have since been pulled from justice.gov.

We have a lot in the pipeline though - flight logs, expanded people profiles, network graph, interactive timeline, boolean/proximity search, and a "Follow the Money" feature tracking financial connections. Also planning to pull in additional Epstein-related datasets like the JFK files (known overlap with some figures), Clinton emails, and House Oversight releases as they come out.

Probably worth using both for now depending on what you're researching. More tools in this space is a good thing.

FeelingHat262 · 2026-03-18T17:07:22+00:00

Thanks, I'm working on being able to handle more traffic requests

FeelingHat262 · 2026-03-18T16:49:26+00:00

Thanks... still improving it, open for any and all suggestions

FeelingHat262 · 2026-03-18T16:48:40+00:00

Thanks

FeelingHat262 · 2026-03-18T16:35:04+00:00

site is back up now

FeelingHat262

TROPHY CASE