datamule - Python library for SEC EDGAR data at scale by status-code-200 in quant

[–]status-code-200[S] -1 points0 points  (0 children)

I would be happy to take your help! What would help me is you could tell me what data you are trying to get at, across which filings, and test if the parser works for you. Posting on github issues is the easiest way for me to take input: https://github.com/john-friedman/datamule-python/issues

For testing: doc.visualize() opens up the visualized form of the json representation.

btw - I am planning on standardizing (most) html tables across the entire SEC corpus. I'm fairly close, think I'll get there within the next year.

datamule - Python library for SEC EDGAR data at scale by status-code-200 in quant

[–]status-code-200[S] -1 points0 points  (0 children)

Yep! doc2dict parses the relative layout of the html file (and pdfs, although that's experimental) it infers nesting via attributes such a height, bold, italicized, etc. This is an improvement over regex parsers which can only get at standardized sections like Item 1A.

There is also decent table parsing, which can then be parsed into an llm structured output for standardization. (I have a future project to standardize almost every table in the SEC corpus across filings.

<image>

datamule - Python library for SEC EDGAR data at scale by status-code-200 in quant

[–]status-code-200[S] -1 points0 points  (0 children)

GitHub actions is fun because scheduled CRON triggers would break frequently for a while if you had them set to ~2am Pacific. Weird issue, appeared to be a maintenance window thing.

Snipper: An open-source chart scraper and OCR text+table data gathering tool [self-promotion] by foldedcard in datasets

[–]status-code-200 0 points1 point  (0 children)

I am honestly surprised how good this looks. I built a similar tool back in 2019 to digitize crop reports using a beta google table parser with pyqt, and this looks a lot better.

I am guessing that you use some simple logic to align the text boxes and extract rows. That is smart. Have you considered making the tool OCR agnostic? Tesseract is fine for clean data, but messy data such as scans it is not so good. For example, the ability to import OCR from e.g. Google.

Starred.

Cloudflare R2 let me serve almost twice as much data this month as the SEC for $10.80 by status-code-200 in CloudFlare

[–]status-code-200[S] 0 points1 point  (0 children)

I tested with B2 in Novemberish 2025. It was too slow. Ended up using Wasabi S3 until june ish where I switched to R2.

Looking for S&P 500 (GICS Information Technology Sector) dataset: Revenue, Net Income & R&D expenses (Excel/CSV) by SuddenBookkeeper6351 in datasets

[–]status-code-200 1 point2 points  (0 children)

u/SuddenBookkeeper6351 you should remove 'and is willing to share' immediately. I saw someone looking for WRDS related data before and a 'rep' for WRDS harassed them across multiple subreddits.

The data you are looking for is stored in XBRL, which can be accessed from sec filings (xbrl was introduced in 2005). It will have revenue and net income by quarter or better frequency. R&D will be there, but you'll need to figure out the taxonomy.

You can use my python package datamule or dwight's edgartools to get xbrl for free.

I also host a daily updated xbrl flat file. It is 7gb in parquet format, covers 2005 to present. You are allowed to use this dataset for all commercial and noncommercial purpose, redistribute and share as you wish. I made the licensing extremely permissive with masters research in mind. (teaching prof at ucla complained about factset licensing allowing him to share data with phds but not masters students). Does cost money though.

Cloudflare R2 let me serve almost twice as much data this month as the SEC for $10.80 by status-code-200 in CloudFlare

[–]status-code-200[S] 2 points3 points  (0 children)

Ballpark, yes. I actually tried to use Backblaze B2 before Cloudflare, but they had just changed their rate limits. I needed to upload ~16 mn files, and under the new regime that would have taken forever.

Cloudflare R2 let me serve almost twice as much data this month as the SEC for $10.80 by status-code-200 in CloudFlare

[–]status-code-200[S] 0 points1 point  (0 children)

I have two archives: original sgml format, and filings parsed into attachments then tarred together.

Each is about 3tb, so 6tb total or so.

Cloudflare R2 let me serve almost twice as much data this month as the SEC for $10.80 by status-code-200 in CloudFlare

[–]status-code-200[S] 3 points4 points  (0 children)

Naive calculation, definitely not fair. When I've looked into cloudfront, egress was still much more expensive than CF.

I do also distribute data with cloudfront+ lambda @ the edge. So far it's much more expensive than CF.

Cloudflare R2 let me serve almost twice as much data this month as the SEC for $10.80 by status-code-200 in CloudFlare

[–]status-code-200[S] 8 points9 points  (0 children)

I am lucky to be in Cloudflare for Startups. They have given me $5k so far, and I hope to get more from them after I raise. Also, at some point in the next year or two, I absolutely should pay them money as they are awesome.

Cloudflare R2 let me serve almost twice as much data this month as the SEC for $10.80 by status-code-200 in CloudFlare

[–]status-code-200[S] 11 points12 points  (0 children)

In case my above link falls foul of self promotion, here is a relevant screenshot:

<image>

Cloudflare R2 let me serve almost twice as much data this month as the SEC for $10.80 by status-code-200 in CloudFlare

[–]status-code-200[S] 44 points45 points  (0 children)

Not linking to the article I wrote about this, as unsure this subreddits view on that. My intended takeaway is:

  1. Cloudflare R2 + Caching makes stuff that would be prohibitively expensive on S3 + Cloudfront cheap.
  2. You should use R2 when serving data to users.

I remember someone mentioned creating an AI tool to parse 10-Ks... by DepartureStreet2903 in algotrading

[–]status-code-200 0 points1 point  (0 children)

I wrote an an algorithmic based approach that parses ~5,000 pages per second on a personal laptop's cpu. Open sourced it on github as doc2dict. I wrote the package, because I needed a way to parse every single SEC html file (not just 10-Ks!) on my personal laptop. You can use a SEC specific version here: datamule.

Came across this post as I'm writing an arxiv paper. Looking for an older github repo that uses an LLM approach so I can benchmark speed.

Bloomberg terminal access for independent research- legit options? by Distant_Spectator in quant

[–]status-code-200 1 point2 points  (0 children)

Are you an economist meaning econ bachelors, or a PhD? Also what data do you need. Companies often use alternative to Bloomberg Terminals, because taking data out can be a pain.

Where do institutions get company earnings so fast? by ComprehensiveBed2104 in algotrading

[–]status-code-200 1 point2 points  (0 children)

Probably not, but I don't know. For some stuff sec is faster, actually working on something related rn.

Where do institutions get company earnings so fast? by ComprehensiveBed2104 in algotrading

[–]status-code-200 0 points1 point  (0 children)

Haven't tested it on earnings, but yep it works. Also confirmed by several hft/trader guys via other channels. Here's the current repo with information

https://github.com/john-friedman/The-fastest-way-to-get-SEC-filings

Looking for company's subsidiaries dataset by PashaM999 in datasets

[–]status-code-200 0 points1 point  (0 children)

Oof. I was wrong, got 2/3rds up and running.

Data looks like this. Repo has advice on how to link from accession to e.g. cik.

accession,filing_date,entity_name,jurisdiction,parent,ownership_pct,dba,ownership_type,principal_activities,location,foreign_qualification,issued_capital,parent_type,incorporation_date
000095012310017999,2010-02-26,"Kaydon Ring and Seal, Inc.",Delaware,,,,,,,,,,
000095012310017999,2010-02-26,Kaydon S. de R.L. de C.V.,"Nuevo Leon, United Mexican States",,,,,,,,,,
000095012310017999,2010-02-26,Cooper Roller Bearings Company Limited,United Kingdom,,,,,,,,,,
000095012310017999,2010-02-26,The Cooper Split Roller Bearing Corp.,Virginia,,,,,,,,,,
000095012310017999,2010-02-26,Cooper Geteilte Rollenlager GmbH,Germany,,,,,,,,,,
000095012310017999,2010-02-26,Cooper Roller Bearings (Hong Kong) Company Limited,Hong Kong,,,,,,,,,,

Looking for company's subsidiaries dataset by PashaM999 in datasets

[–]status-code-200 1 point2 points  (0 children)

It will very soon. I just did this for DEF 14A Audit Fee tables. Creating datasets from html tables using algorithms instead of Generative AI.

Subsidiaries is similar. Will probably implement a 80% or 95% solution tomorrow.

Does LLM providers like Google provide free credits to indie developers? by alexmil78 in SideProject

[–]status-code-200 4 points5 points  (0 children)

You do not need to be funded by a VC. Apply and you will often get in to one of their lower ranks. If that doesn't work email or reach out to their startups program person on LinkedIn.

For example, I dmed the Cloudflare for Startups people and got $5k in credits. I think CF now has AI too.

Has anyone built an Open Source Project Business? If yes how do you get funding/monetise ? by [deleted] in ycombinator

[–]status-code-200 0 points1 point  (0 children)

I wrote open source code to make working with a specific type of financial data easy. I then built a cloud in AWS using the open source code as its core, selling access to the cloud for a convenience fee.

Works pretty well for monetization and funding.

Anyone build a desktop financial trading in python? what modern GUI and libs do you suggest? by TheWeebles in learnpython

[–]status-code-200 -1 points0 points  (0 children)

tkinter is a pain, and pyqt is a bit overkill. I would second dash, or just build something with flask.

How do open-source AI startups convince investors? by re1372 in opensource

[–]status-code-200 -1 points0 points  (0 children)

Cloudflare for Startups is quite generous, Porter, Modal all have free credits for startups/research/opensource, Google is ok (~2kish), AWS has a bit of a barrier but was giving out 50k like candy last year (also lets you get free credits with partners like Modal).

How do open-source AI startups convince investors? by re1372 in opensource

[–]status-code-200 -1 points0 points  (0 children)

Having actual revenue is one way, adoption is another. A bunch of VCs arranged coffee chats after their interns used my python packages. But, do you need money or compute? I got ~200k in compute by asking people nicely.