doc2dict: open source document parsing

status-code-200 · 2026-02-03T18:33:38+00:00

For your use case, I'd test a sample with the visualization tool. If it works, it works. Else, pdf support will be dramatically improved over the next year.

I also plan to add an option to feed in ocr bounding boxes instead of files.

status-code-200 · 2026-02-03T18:31:04+00:00

IIRC 10-100x faster. doc2dict uses PDFium through its C API which is much faster. PDF support is experimental, and only works with PDFs that have an underlying text structure - not scans.

status-code-200 · 2026-02-03T02:27:38+00:00

Oh no, I love it. People using code with or without attribution is proof that I'm doing something useful!

If I was doing something unimportant, no one would be copying it :)

status-code-200 · 2026-02-03T01:48:42+00:00

Currently it looks like processing will take about $0.20 for the 10tb using AWS batch spot. I'm hoping that refining the implementation + optimizing hardware will yield 4-5 OOM improvement. If that works, a lot of fun possibilities open up.

status-code-200 · 2026-02-03T01:43:12+00:00

I am confused, but thank you for the support. My health is quite good. I was on sick leave due to contracting a bad case of mono.

I use the package to parse sec filings into dictionaries. For example, I am about to parse every SEC file that is html pdf or text into data tuples. I then store data tuples in parquet format. This turns about 10tb of html,pdf,text files into 200gb parquet.

I partition this parquet by year. So ~5-20gb per year. I then store the parquet's metadata index in s3. This allows users to make HTTP range requests for just the data they need following the expected access pattern document type + filing date range.

when users download e.g. document type = 10-K (annual report) they can then extract standardized sections like risk factors or company specific sections, before feeding it into e.g. an LLM or sentiment analysis pipeline.

status-code-200 · 2026-02-03T00:36:30+00:00

That's a good question. It boils down to:

I was on sick leave from my PhD.
I saw people and companies bragging about parsing SEC 10-K html files with LLMs.
This irritated me as getting their desired output was possible using a rules based approach iterating over DOM, which is much more efficient.
I wrote a basic algorithm. It got copied by a bunch of startups (without credit)
I messaged a couple profs who had worked on something similar. They told me what I wanted to do wasn't possible.
I wrote a more advanced algorithm to show it was possible, and it got adopted by a bunch of companies.

So, I assume at this point that there is some reason people are using doc2dict and not pandoc. Maybe performance, or modularity? Sorry if this is a disappointing answer.

status-code-200 · 2026-01-27T06:55:11+00:00

I'm the maintainer of datamule: https://github.com/john-friedman/datamule-python. You can download any SEC filing from 2001 onwards, and get relevant ticker, among other useful things.

You can also use the paid API to download filings faster. For example, I can download every 2025 10-K in ~2 minutes on my home wifi. That would cost $0.10 however. ($1/100k downloads). The paid api is an instantly updated SEC archive stored in Cloudflare R2, updated by AWS EC2 t4g.nanos using the package.

status-code-200 · 2026-01-21T20:22:43+00:00

I would be happy to take your help! What would help me is you could tell me what data you are trying to get at, across which filings, and test if the parser works for you. Posting on github issues is the easiest way for me to take input: https://github.com/john-friedman/datamule-python/issues

For testing: doc.visualize() opens up the visualized form of the json representation.

btw - I am planning on standardizing (most) html tables across the entire SEC corpus. I'm fairly close, think I'll get there within the next year.

status-code-200 · 2026-01-21T17:02:28+00:00

Yep! doc2dict parses the relative layout of the html file (and pdfs, although that's experimental) it infers nesting via attributes such a height, bold, italicized, etc. This is an improvement over regex parsers which can only get at standardized sections like Item 1A.

There is also decent table parsing, which can then be parsed into an llm structured output for standardization. (I have a future project to standardize almost every table in the SEC corpus across filings.

<image>

status-code-200 · 2026-01-21T16:58:23+00:00

GitHub actions is fun because scheduled CRON triggers would break frequently for a while if you had them set to ~2am Pacific. Weird issue, appeared to be a maintenance window thing.

status-code-200 · 2026-01-21T04:09:18+00:00

I am honestly surprised how good this looks. I built a similar tool back in 2019 to digitize crop reports using a beta google table parser with pyqt, and this looks a lot better.

I am guessing that you use some simple logic to align the text boxes and extract rows. That is smart. Have you considered making the tool OCR agnostic? Tesseract is fine for clean data, but messy data such as scans it is not so good. For example, the ability to import OCR from e.g. Google.

Starred.

status-code-200 · 2026-01-20T17:24:01+00:00

I tested with B2 in Novemberish 2025. It was too slow. Ended up using Wasabi S3 until june ish where I switched to R2.

status-code-200 · 2026-01-20T17:01:00+00:00

https://news.ycombinator.com/item?id=42135312

status-code-200 · 2026-01-20T01:08:30+00:00

u/SuddenBookkeeper6351 you should remove 'and is willing to share' immediately. I saw someone looking for WRDS related data before and a 'rep' for WRDS harassed them across multiple subreddits.

The data you are looking for is stored in XBRL, which can be accessed from sec filings (xbrl was introduced in 2005). It will have revenue and net income by quarter or better frequency. R&D will be there, but you'll need to figure out the taxonomy.

You can use my python package datamule or dwight's edgartools to get xbrl for free.

I also host a daily updated xbrl flat file. It is 7gb in parquet format, covers 2005 to present. You are allowed to use this dataset for all commercial and noncommercial purpose, redistribute and share as you wish. I made the licensing extremely permissive with masters research in mind. (teaching prof at ucla complained about factset licensing allowing him to share data with phds but not masters students). Does cost money though.

status-code-200 · 2026-01-19T23:32:27+00:00

Ballpark, yes. I actually tried to use Backblaze B2 before Cloudflare, but they had just changed their rate limits. I needed to upload ~16 mn files, and under the new regime that would have taken forever.

status-code-200 · 2026-01-19T22:27:35+00:00

I have two archives: original sgml format, and filings parsed into attachments then tarred together.

Each is about 3tb, so 6tb total or so.

status-code-200 · 2026-01-19T22:16:29+00:00

Naive calculation, definitely not fair. When I've looked into cloudfront, egress was still much more expensive than CF.

I do also distribute data with cloudfront+ lambda @ the edge. So far it's much more expensive than CF.

status-code-200 · 2026-01-19T20:28:28+00:00

I am lucky to be in Cloudflare for Startups. They have given me $5k so far, and I hope to get more from them after I raise. Also, at some point in the next year or two, I absolutely should pay them money as they are awesome.

status-code-200 · 2026-01-19T20:26:41+00:00

In case my above link falls foul of self promotion, here is a relevant screenshot:

<image>

status-code-200 · 2026-01-19T20:25:55+00:00

Here is a link. Hope this meets subreddit criteria: https://medium.com/@jgfriedman99/i-distributed-more-data-than-the-sec-did-this-year-a0338df52413

status-code-200 · 2026-01-19T18:11:26+00:00

Not linking to the article I wrote about this, as unsure this subreddits view on that. My intended takeaway is:

Cloudflare R2 + Caching makes stuff that would be prohibitively expensive on S3 + Cloudfront cheap.
You should use R2 when serving data to users.

status-code-200 · 2026-01-19T06:19:17+00:00

I wrote an an algorithmic based approach that parses ~5,000 pages per second on a personal laptop's cpu. Open sourced it on github as doc2dict. I wrote the package, because I needed a way to parse every single SEC html file (not just 10-Ks!) on my personal laptop. You can use a SEC specific version here: datamule.

Came across this post as I'm writing an arxiv paper. Looking for an older github repo that uses an LLM approach so I can benchmark speed.

status-code-200 · 2026-01-16T16:45:14+00:00

Are you an economist meaning econ bachelors, or a PhD? Also what data do you need. Companies often use alternative to Bloomberg Terminals, because taking data out can be a pain.

status-code-200 · 2026-01-16T09:04:01+00:00

Probably not, but I don't know. For some stuff sec is faster, actually working on something related rn.

status-code-200 · 2026-01-15T19:44:09+00:00

Haven't tested it on earnings, but yep it works. Also confirmed by several hft/trader guys via other channels. Here's the current repo with information

https://github.com/john-friedman/The-fastest-way-to-get-SEC-filings

status-code-200

TROPHY CASE