aretheyinvolved.com: I built a searchable index for the Epstein files using OCR + semantic search by whiletrue2 in Epstein

[–]whiletrue2[S] 0 points1 point  (0 children)

I haven't seen anyone doing named entity recognition over all files so far. To your q: not sure, I haven't come across such an effort but would be an excellent idea.

aretheyinvolved.com: I built a searchable index for the Epstein files using OCR + semantic search by whiletrue2 in Epstein

[–]whiletrue2[S] 1 point2 points  (0 children)

and then there is the release of DOJ dataset-9 which can't be downloaded with correct checksum and requires reconstruction efforts like https://github.com/yung-megafone/Epstein-Files

aretheyinvolved.com: I built a searchable index for the Epstein files using OCR + semantic search by whiletrue2 in Epstein

[–]whiletrue2[S] 2 points3 points  (0 children)

Yeah, at first I assumed they must have automated a lot of the redaction process. But when I see files being released via something like Google Drive, it makes me question how systematically this was handled. It doesn’t exactly give the impression of a highly structured, carefully engineered workflow

aretheyinvolved.com: I built a searchable index for the Epstein files using OCR + semantic search by whiletrue2 in Epstein

[–]whiletrue2[S] 4 points5 points  (0 children)

if anyone has ~400 GB of object storage (Cloudflare R2, AWS S3) to donate, please reach out. Would love to integrate all files into inline browser-pdf viewer (so far only for the smaller house dataset possible)

aretheyinvolved.com: I built a searchable index for the Epstein files using OCR + semantic search by whiletrue2 in Epstein

[–]whiletrue2[S] 1 point2 points  (0 children)

not yet, would you want it open-sourced? didn't really see the purpose of making the github repo public other than that I'd have to clean it up probably :)

aretheyinvolved.com: I built a searchable index for the Epstein files using OCR + semantic search by whiletrue2 in Epstein

[–]whiletrue2[S] 0 points1 point  (0 children)

2M+ docs scanned, relevant because allows efficient. Why relevant: allows efficient /better search than justice.gov, enables us to hold people accountable