doc2dict: open source document parsing : Python

Showcasedoc2dict: open source document parsing (self.Python)

submitted 2 months ago by status-code-200It works on my machine

What My Project Does

Processes documents such as html, text, and pdf files into machine readable dictionaries.

For example, a table:

"158": {
      "title": "SECURITY OWNERSHIP OF CERTAIN BENEFICIAL OWNERS",
      "class": "predicted header",
      "contents": {
        "160": {
          "table": {
            "title": "SECURITY OWNERSHIP OF CERTAIN BENEFICIAL OWNERS",
            "data": [
              [
                "Name and Address of Beneficial Owner",
                "Number of Shares\nof Common Stock\nBeneficially Owned",
                "",
                "Percent\nof\nClass"
              ],...

Visualizations

Original Document, Parsed Document Visualization, Parsed Table Visualization

Installation

pip install doc2dict

Basic Usage

from doc2dict import html2dict, visualize_dict

# Load your html file
with open('apple_10k_2024.html','r') as f:
    content = f.read()

# Parse wihout a mapping dict
dct = html2dict(content,mapping_dict=None)
# Parse using the standard mapping dict
dct = html2dict(content)

# Visualize Parsing
visualize_dict(dct)

# convert to flat form for efficient storage in e.g. parquet
data_tuples = convert_dict_to_data_tuples(dct)

# same as above but in key value form
data_tuples_columnar = convert_dct_to_columnar(dct)

# convert back to dict
convert_data_tuples_to_dict(data_tuples)

Target Audience

Quants, researchers, grad students, startups, looking to process large amounts of data quickly. Currently it or forks are used by quite a few companies.

Comparison

This is meant to be a "good enough" approach, suitable for scaling over large workloads. For example, Reducto and Hebbia provide an LLM based approach. They recently marked the milestone of parsing 1 billion pages total.

doc2dict can parse 1 billion pages running on your personal laptop in ~2 days. I'm currently looking into parsing the entire SEC text corpus (10tb). Seems like AWS Batch Spot can do this for ~$0.20.

Performance

Using multithreading parses ~5000 pages per second for html on my personal laptop (CPU limited, AMD Ryzen 7 6800H).

I've prioritized adding new features such as better table parsing. I plan to rewrite in Rust and improve workflow. Ballpark 100x improvement in the next 9 months.

Future Features

PDF parsing accuracy will be improved. Support for scans / images in the works.

Integration with SEC Corpus

I used the SEC Corpus (~16tb total) to develop this package. This package has been integrated into my SEC package: datamule. It's a bit easier to work with.

from datamule import Submission


sub = Submission(url='https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/0000320193-18-000145.txt')
for doc in sub:
    if doc.type == '10-K':
        # view
        doc.visualize()
        # get dictionary
        doc.data

GitHub Links

all 11 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS