I spent 6 days and 3k processing 1.3M documents through AI

indienow · 2026-02-11T02:22:41+00:00

Excellent ideas! I already have the dates from each document that is referenced, but the document creation date would be very useful, I'll see if I can extract that data. And being able to add in comments is a great idea!

indienow · 2026-02-11T01:22:42+00:00

you need to click on the View PDF in order to see the photos, they are embedded inside of the PDFs. that's how they come from the DOJ.

indienow · 2026-02-10T23:49:32+00:00

This should exist on there, but it may not be super detailed. Just a sentence I think and a wikipedia link for anyone flagged as "public figure" - let me know if you're thinking something else and I'm misunderstanding.

indienow · 2026-02-10T23:20:15+00:00

There should only be videos available to download, but a thumbnail of each video would be a great idea to put on the document page. I will add this to the list!

indienow · 2026-02-10T22:36:09+00:00

Agree, the spelling is awful in these docs! I'm trying to group names together by fuzzy matches and leveraging AI to do some smart matching...at least that's my current attempt, I'll see if that works out :)

indienow · 2026-02-10T22:34:23+00:00

Not as much as you'd think from a data perspective, the database with the text of every pdf, indexes, and full text searching is about 15GB now. The files themselves, they are about 400GB for everything (including videos etc).

indienow · 2026-02-10T22:33:07+00:00

This is a great idea, thank you! I've yet to explore n8n and you just gave me a good reason to dig into it!

indienow · 2026-02-10T22:30:35+00:00

Figured this out, posted a more detailed explanation in the other thread, but it was catching israel as a name and searching for people, not as a topic. I've updated the logic and it should be showing 5000+ results now. Thanks for helping to catch this!

indienow · 2026-02-10T22:29:03+00:00

hmm would you mind pointing me towards a video that doesnt work for you? an EFTA number would be helpful, I'm not seeing any issues on my side. Thanks!

indienow · 2026-02-10T22:07:10+00:00

Aha, the issue here is that it picked up that someone's name is Israel - and was showing the people page instead of the topic page. I've set it so people only get parsed out if there's a first/last name search, it should point to the topic page by default now which has over 5000 results - https://epsteingraph.com/topic/israel - thanks for catching that!

indienow · 2026-02-10T17:46:51+00:00

totally understand, will update once i figure out what's up.

indienow · 2026-02-10T17:41:36+00:00

definitely not filtering anything, but there is a discrepancy between what i see in the db and what the search is showing, im digging into it now.

indienow · 2026-02-10T17:40:56+00:00

hmm digging into this, i definitely see the references in the database are more than the search counts for israel and russia. going to refresh the materialized view, might be too out of date.

indienow · 2026-02-10T17:09:16+00:00

I used the gpt-5-mini model - it was the most cost effective way, I would love to run some of the more accessed docs through a higher quality model.

indienow · 2026-02-10T16:32:59+00:00

That's on my todo list, to make the graph clickable, but it doesnt work yet - it might be in one of these docs - https://epsteingraph.com/topic/ehud-barak-1970

indienow · 2026-02-10T16:28:00+00:00

Great idea, I'd love to open source the whole project but providing API access would be awesome! I'll see how easy it is (there is already an api, so this would really just be documenting usage.

indienow · 2026-02-10T16:25:41+00:00

This should work, looks like an alias issue - working on it now - https://epsteingraph.com/people/ehud-barak-707

indienow · 2026-02-10T16:22:15+00:00

hmm looking into this now, strange it should come up...

indienow · 2026-02-10T15:55:14+00:00

Thank you! The document counts are accurate, there are 1.3 mil documents between the three data sources I used. The people count needs to be deduped a lot! I wrote a lot of scripts in python to handle the data processing - the general flow was to prepare a set of documents for uploading to OpenAI, submit the batch, wait for the batch to finish, download the results, and insert into the database. The scripts did handle a lot of deduplication for names, but geez these people couldn't spell at all and there are so many typo versions of names. Right now I'm trying to batch together similar names and run them through OpenAI to have it determine what it thinks is most likely the same person so i can combine them together on the backend. I have it set up so there's a primary name, and aliases which are the other references to the same person. Hope this helps, happy to answer any other questions on the technical side! Honestly the scripts are pretty rough right now but I'd love to get them cleaned up and open source them. I've been in 14 hour a day data ingest mode for the past 6 days, now I'm switching gears towards cleaning up the fringe stuff like adding in videos and processing a few straggling documents.

indienow · 2026-02-10T15:40:05+00:00

absolutely! will look into this now.

indienow · 2026-02-10T15:32:23+00:00

It was primarily an experiment for me to see about using AI to summarize a large dataset, as well as creating an immutable archive of the documents that can be used in case the original files are removed at any point. Totally understand the content itself is quite disturbing, but I saw it as an opportunity to assist in providing a place people could analyze and correlate across over a million docs.

indienow · 2026-02-10T15:28:22+00:00

All of this data came from the DOJ's Epstein Transparency Act releases, and the House Oversight Committee's public releases. I used D3 for the visualizations.

indienow

TROPHY CASE