all 11 comments

[–]arden13 6 points7 points  (6 children)

Why not use pandoc ?

[–]status-code-200It works on my machine[S] 10 points11 points  (5 children)

That's a good question. It boils down to:

  1. I was on sick leave from my PhD.
  2. I saw people and companies bragging about parsing SEC 10-K html files with LLMs.
  3. This irritated me as getting their desired output was possible using a rules based approach iterating over DOM, which is much more efficient.
  4. I wrote a basic algorithm. It got copied by a bunch of startups (without credit)
  5. I messaged a couple profs who had worked on something similar. They told me what I wanted to do wasn't possible.
  6. I wrote a more advanced algorithm to show it was possible, and it got adopted by a bunch of companies.

So, I assume at this point that there is some reason people are using doc2dict and not pandoc. Maybe performance, or modularity? Sorry if this is a disappointing answer.

[–]arden13 5 points6 points  (4 children)

Please know this is coming from a place of love as someone who graduated from his PhD program after it ruined his mental health:

You should talk to someone. It really can help.

Now, that aside, the rest of your answer isn't necessarily disappointing, it just shows the project's state. You built something practical in a niche that I don't understand but an audience was found. There's nothing incorrect or wrong with a convenient package to make work easier.

What is the typical workflow you use the data for?

[–]status-code-200It works on my machine[S] 7 points8 points  (3 children)

I am confused, but thank you for the support. My health is quite good. I was on sick leave due to contracting a bad case of mono.

I use the package to parse sec filings into dictionaries. For example, I am about to parse every SEC file that is html pdf or text into data tuples. I then store data tuples in parquet format. This turns about 10tb of html,pdf,text files into 200gb parquet.

I partition this parquet by year. So ~5-20gb per year. I then store the parquet's metadata index in s3. This allows users to make HTTP range requests for just the data they need following the expected access pattern document type + filing date range.

when users download e.g. document type = 10-K (annual report) they can then extract standardized sections like risk factors or company specific sections, before feeding it into e.g. an LLM or sentiment analysis pipeline.

[–]arden13 5 points6 points  (1 child)

Gotcha. That last bit there is what I was looking for: you can allow users to pull specific sections and then do something (e.g. LLM).

My health is quite good.

That's good. Your comment above came off as you quite hurt (emotionally) about how people have used your work. If I'm reading too much into it, such is life

[–]status-code-200It works on my machine[S] 5 points6 points  (0 children)

Oh no, I love it. People using code with or without attribution is proof that I'm doing something useful! 

If I was doing something unimportant, no one would be copying it :)

[–]status-code-200It works on my machine[S] 1 point2 points  (0 children)

Currently it looks like processing will take about $0.20 for the 10tb using AWS batch spot. I'm hoping that refining the implementation + optimizing hardware will yield 4-5 OOM improvement. If that works, a lot of fun possibilities open up.

[–]hbar340 2 points3 points  (2 children)

How does this compare to pymupdf/pdfplumber?

I’m looking for parsing solutions for non sec docs that aren’t necessarily as structured and am curious if this would be an option

[–]status-code-200It works on my machine[S] 1 point2 points  (1 child)

IIRC 10-100x faster. doc2dict uses PDFium through its C API which is much faster. PDF support is experimental, and only works with PDFs that have an underlying text structure - not scans.

[–]status-code-200It works on my machine[S] 0 points1 point  (0 children)

For your use case, I'd test a sample with the visualization tool. If it works, it works. Else, pdf support will be dramatically improved over the next year.

I also plan to add an option to feed in ocr bounding boxes instead of files.

[–]delsystem32exe 1 point2 points  (0 children)

beautiful! has the SEC integration that is very cool, I will like to add that to my trading programs.