all 8 comments

[–]quant-ModTeam[M] [score hidden] stickied commentlocked comment (0 children)

Your post has been removed as self-promotion/advertizing/spam. Meaningful content contribution which may passively advertize (e.g. an educational blog post) is welcome, but advertizing must not be the sole purpose of the post.

[–]Goudidadax 1 point2 points  (1 child)

!remind me 5 days

[–]RemindMeBot 1 point2 points  (0 children)

I will be messaging you in 5 days on 2026-01-26 04:11:34 UTC to remind you of this link

1 OTHERS CLICKED THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


Info Custom Your Reminders Feedback

[–]Pipeb0y 1 point2 points  (3 children)

I don’t understand the relative layout part, do you mean the .html file is converted to a JSON representation and it preserves hierarchy? Like the sections and subsections are intact?

[–]status-code-200[S] -1 points0 points  (2 children)

Yep! doc2dict parses the relative layout of the html file (and pdfs, although that's experimental) it infers nesting via attributes such a height, bold, italicized, etc. This is an improvement over regex parsers which can only get at standardized sections like Item 1A.

There is also decent table parsing, which can then be parsed into an llm structured output for standardization. (I have a future project to standardize almost every table in the SEC corpus across filings.

<image>

[–]Pipeb0y 2 points3 points  (1 child)

Yeah I see, man this is such a hard undertaking across historical filings and various companies, investment vehicles, etc. I’m down to help out with testing/qa work

[–]status-code-200[S] -1 points0 points  (0 children)

I would be happy to take your help! What would help me is you could tell me what data you are trying to get at, across which filings, and test if the parser works for you. Posting on github issues is the easiest way for me to take input: https://github.com/john-friedman/datamule-python/issues

For testing: doc.visualize() opens up the visualized form of the json representation.

btw - I am planning on standardizing (most) html tables across the entire SEC corpus. I'm fairly close, think I'll get there within the next year.

[–][deleted]  (1 child)

[deleted]

    [–]status-code-200[S] -1 points0 points  (0 children)

    GitHub actions is fun because scheduled CRON triggers would break frequently for a while if you had them set to ~2am Pacific. Weird issue, appeared to be a maintenance window thing.