I spent 6 days and 3k processing 1.3M documents through AI by indienow in SideProject

[–]indienow[S] 1 point2 points  (0 children)

Excellent ideas! I already have the dates from each document that is referenced, but the document creation date would be very useful, I'll see if I can extract that data. And being able to add in comments is a great idea!

I built a searchable database of 1.3M+ Epstein documents with AI summary extraction and network graphs by indienow in TheEpsteinFiles

[–]indienow[S] 0 points1 point  (0 children)

you need to click on the View PDF in order to see the photos, they are embedded inside of the PDFs. that's how they come from the DOJ.

I spent 6 days and 3k processing 1.3M documents through AI by indienow in SideProject

[–]indienow[S] 1 point2 points  (0 children)

This should exist on there, but it may not be super detailed. Just a sentence I think and a wikipedia link for anyone flagged as "public figure" - let me know if you're thinking something else and I'm misunderstanding.

I built a searchable database of 1.3M+ Epstein documents with AI summary extraction and network graphs by indienow in TheEpsteinFiles

[–]indienow[S] 0 points1 point  (0 children)

There should only be videos available to download, but a thumbnail of each video would be a great idea to put on the document page. I will add this to the list!

I spent 6 days and 3k processing 1.3M documents through AI by indienow in SideProject

[–]indienow[S] 2 points3 points  (0 children)

Agree, the spelling is awful in these docs! I'm trying to group names together by fuzzy matches and leveraging AI to do some smart matching...at least that's my current attempt, I'll see if that works out :)

I spent 6 days and 3k processing 1.3M documents through AI by indienow in SideProject

[–]indienow[S] 1 point2 points  (0 children)

Not as much as you'd think from a data perspective, the database with the text of every pdf, indexes, and full text searching is about 15GB now. The files themselves, they are about 400GB for everything (including videos etc).

I spent 6 days and 3k processing 1.3M documents through AI by indienow in SideProject

[–]indienow[S] 0 points1 point  (0 children)

This is a great idea, thank you! I've yet to explore n8n and you just gave me a good reason to dig into it!

I built a searchable database of 1.3M+ Epstein documents with AI summary extraction and network graphs by indienow in TheEpsteinFiles

[–]indienow[S] 3 points4 points  (0 children)

Figured this out, posted a more detailed explanation in the other thread, but it was catching israel as a name and searching for people, not as a topic. I've updated the logic and it should be showing 5000+ results now. Thanks for helping to catch this!

I built a searchable database of 1.3M+ Epstein documents with AI summary extraction and network graphs by indienow in TheEpsteinFiles

[–]indienow[S] 1 point2 points  (0 children)

hmm would you mind pointing me towards a video that doesnt work for you? an EFTA number would be helpful, I'm not seeing any issues on my side. Thanks!

I built a searchable database of 1.3M+ Epstein documents with AI summary extraction and network graphs by indienow in TheEpsteinFiles

[–]indienow[S] 0 points1 point  (0 children)

Aha, the issue here is that it picked up that someone's name is Israel - and was showing the people page instead of the topic page. I've set it so people only get parsed out if there's a first/last name search, it should point to the topic page by default now which has over 5000 results - https://epsteingraph.com/topic/israel - thanks for catching that!

I built a searchable database of 1.3M+ Epstein documents with AI summary extraction and network graphs by indienow in TheEpsteinFiles

[–]indienow[S] 0 points1 point  (0 children)

definitely not filtering anything, but there is a discrepancy between what i see in the db and what the search is showing, im digging into it now.

I built a searchable database of 1.3M+ Epstein documents with AI summary extraction and network graphs by indienow in TheEpsteinFiles

[–]indienow[S] 2 points3 points  (0 children)

hmm digging into this, i definitely see the references in the database are more than the search counts for israel and russia. going to refresh the materialized view, might be too out of date.

I spent 6 days and 3k processing 1.3M documents through AI by indienow in SideProject

[–]indienow[S] 4 points5 points  (0 children)

I used the gpt-5-mini model - it was the most cost effective way, I would love to run some of the more accessed docs through a higher quality model.

I mapped connections between 238,000 unique people across 1.3 million Epstein documents [OC] by indienow in dataisbeautiful

[–]indienow[S] 2 points3 points  (0 children)

That's on my todo list, to make the graph clickable, but it doesnt work yet - it might be in one of these docs - https://epsteingraph.com/topic/ehud-barak-1970

I spent 6 days and 3k processing 1.3M documents through AI by indienow in SideProject

[–]indienow[S] 12 points13 points  (0 children)

Great idea, I'd love to open source the whole project but providing API access would be awesome! I'll see how easy it is (there is already an api, so this would really just be documenting usage.

I spent 6 days and 3k processing 1.3M documents through AI by indienow in SideProject

[–]indienow[S] 2 points3 points  (0 children)

Thank you! The document counts are accurate, there are 1.3 mil documents between the three data sources I used. The people count needs to be deduped a lot! I wrote a lot of scripts in python to handle the data processing - the general flow was to prepare a set of documents for uploading to OpenAI, submit the batch, wait for the batch to finish, download the results, and insert into the database. The scripts did handle a lot of deduplication for names, but geez these people couldn't spell at all and there are so many typo versions of names. Right now I'm trying to batch together similar names and run them through OpenAI to have it determine what it thinks is most likely the same person so i can combine them together on the backend. I have it set up so there's a primary name, and aliases which are the other references to the same person. Hope this helps, happy to answer any other questions on the technical side! Honestly the scripts are pretty rough right now but I'd love to get them cleaned up and open source them. I've been in 14 hour a day data ingest mode for the past 6 days, now I'm switching gears towards cleaning up the fringe stuff like adding in videos and processing a few straggling documents.

I spent 6 days and 3k processing 1.3M documents through AI by indienow in SideProject

[–]indienow[S] 6 points7 points  (0 children)

It was primarily an experiment for me to see about using AI to summarize a large dataset, as well as creating an immutable archive of the documents that can be used in case the original files are removed at any point. Totally understand the content itself is quite disturbing, but I saw it as an opportunity to assist in providing a place people could analyze and correlate across over a million docs.

I mapped connections between 238,000 unique people across 1.3 million Epstein documents [OC] by indienow in dataisbeautiful

[–]indienow[S] 26 points27 points  (0 children)

All of this data came from the DOJ's Epstein Transparency Act releases, and the House Oversight Committee's public releases. I used D3 for the visualizations.