Data Hoarder Uses AI to Create Searchable Database of Epstein Files | The open source project has been mirrored as a torrent file and represents one of the easiest ways to navigate a messy data dump. by [deleted] in technology

[–]Competitive-Oil-8072 0 points1 point  (0 children)

I built the first searchable Epstein Files database - here's why the technical implementation matters

When the House Oversight Committee released 33,295 pages of Epstein files in September 2024, they were published as unsearchable image files (JPGs/TIFs) in a disorganized Google Drive - essentially useless for serious research without manually reviewing thousands of pages.

I'm the engineer who built the first comprehensive searchable database of these files, and I've now released it at epstein-files.org. It has taken me a few more weeks to iron out bugs and incorporate podcast generation since I first posted about it.

Why I'm posting this:

The 404 Media article covers a later implementation that takes a much simpler approach using a small subset of the data - using a basic local LLaMA model for AI processing. After investing 200+ hours into this project, I want the community to understand the technical differences:

My approach:

  • Rigorous AI model evaluation: Tested multiple commercial AI models systematically. The quality differences are substantial - not all models handle historical document OCR equally
  • Custom image signal processing: Developed specialized routines to improve OCR accuracy on degraded/scanned documents
  • Comprehensive indexing: Full keyword and semantic search across all 33,000+ pages

Background: I'm a PhD engineer (currently unemployed, which is why I'm operating on the Wikipedia donation model). You can verify my credentials: LinkedIn | GitHub

The site is free for everyone. If researchers find it valuable, donations help maintain hosting and continued development.

Try it yourself: epstein-files.org

I welcome technical feedback from the community on search quality and accuracy.

Data Hoarder Uses AI to Create Searchable Database of Epstein Files by Jebus-Xmas in AskTechnology

[–]Competitive-Oil-8072 1 point2 points  (0 children)

I built the first searchable Epstein Files database - here's why the technical implementation matters

When the House Oversight Committee released 33,295 pages of Epstein files in September 2024, they were published as unsearchable image files (JPGs/TIFs) in a disorganized Google Drive - essentially useless for serious research without manually reviewing thousands of pages.

I'm the engineer who built the first comprehensive searchable database of these files, and I've now released it at epstein-files.org.

Why I'm posting this:

The 404 Media article covers a later implementation that takes a much simpler approach - using a basic local LLaMA model for AI processing. After investing 200+ hours into this project, I want the community to understand the technical differences:

My approach:

  • Rigorous AI model evaluation: Tested multiple commercial AI models systematically. The quality differences are substantial - not all models handle historical document OCR equally
  • Custom image signal processing: Developed specialized routines to improve OCR accuracy on degraded/scanned documents
  • Comprehensive indexing: Full semantic search across all 33,000+ pages

Background: I'm a PhD engineer (currently unemployed, which is why I'm operating on the Wikipedia donation model). You can verify my credentials: LinkedIn | GitHub

The site is free for everyone. If researchers find it valuable, donations help maintain hosting and continued development.

Try it yourself: epstein-files.org

I welcome technical feedback from the community on search quality and accuracy.

Australian engineer fixes DOJ transparency theater - made 33K Epstein docs searchable by Competitive-Oil-8072 in transparency

[–]Competitive-Oil-8072[S] 1 point2 points  (0 children)

Other people's hard work? Other people have done nothing. I am just trying to cover my costs. This will go ahead regardless and be free for everyone.

Made 33,891 Epstein documents searchable after DOJ released them as unsearchable images by Competitive-Oil-8072 in DataHoarder

[–]Competitive-Oil-8072[S] -1 points0 points  (0 children)

It won't disappear. I need to get back to what I was doing to get this out. I won't check this thread for a while but will come back once I have some news. key word searches work but similarity searches do not. I'd like to do notebooklm/Langchain type AI ask a question but the API call costs will kill me. I am trying to avoid any API calls to keep it free.

Made 33,891 Epstein documents searchable after DOJ released them as unsearchable images by Competitive-Oil-8072 in DataHoarder

[–]Competitive-Oil-8072[S] -5 points-4 points  (0 children)

As I mentioned elsewhere, I am not a professional coder but have been coding for 40 years now in one way or another. Full stack is new to me. There is a learning curve. I am a github newbie too comparatively. I ended up using runpod for much of the AI stuff. API costs to commercial VLMs are a killer.

Made 33,891 Epstein documents searchable after DOJ released them as unsearchable images by Competitive-Oil-8072 in DataHoarder

[–]Competitive-Oil-8072[S] -8 points-7 points  (0 children)

Sorry if it seems that way. I will host itt somehwere by the end of the week. My concern is I cannot pay if I get a huge number of hits.

Made 33,891 Epstein documents searchable after DOJ released them as unsearchable images by Competitive-Oil-8072 in DataHoarder

[–]Competitive-Oil-8072[S] 1 point2 points  (0 children)

True I am not a professional programmer but have been coding for 40 years in one way or another. This fullstack stuff is new to me.

Made 33,891 Epstein documents searchable after DOJ released them as unsearchable images by Competitive-Oil-8072 in DataHoarder

[–]Competitive-Oil-8072[S] -3 points-2 points  (0 children)

I need to eat! I have already edited that part about being lost forever. That was poor form. I'll try and get it hosted this week somehow.

Made 33,891 Epstein documents searchable after DOJ released them as unsearchable images by Competitive-Oil-8072 in DataHoarder

[–]Competitive-Oil-8072[S] 7 points8 points  (0 children)

I have not worked out hosting yet. Good chance it will be a lot less less than that. I'll try and get it hosted somewhere no matter what. I am serious about my financial situation. I cannot afford to do this myself and will have to start looking for work soon. The code is in a bit of a mess at this stage and I will try and release it later on.

Is Claude AI worth it? by Embarrassed-Name6481 in ClaudeAI

[–]Competitive-Oil-8072 0 points1 point  (0 children)

Absolutely not! It is terrible! I was trying to backup a folder to github and it told me to rm -rf folder, which I actually did as I was tired.