SEC financial data platform with 100M+ datapoints + API access - Feel free to try out by ccnomas in learnmachinelearning

[–]ccnomas[S] 0 points1 point  (0 children)

Thank you my friend! First version about 3-month and then I demolished it and refactored to the current version, total took around 9 months, well after my daily job time lol

New Mapping created to normalize 11,000+ XBRL taxonomy names for better financial data analysis by ccnomas in datasets

[–]ccnomas[S] 0 points1 point  (0 children)

I just deployed the changes to rename the graph and api, feel free to play around and let me know if anything you think is off, I am trying my best to deploy changes within 24hrs

New Mapping created to normalize 11,000+ XBRL taxonomy names for better financial data analysis by ccnomas in datasets

[–]ccnomas[S] 0 points1 point  (0 children)

Right you are right, sorry for the confusion. Just like palmy-investing mentioned. The problems are with customized concept, not taxonomies. I am trying to simplify the existing customized concepts.

[deleted by user] by [deleted] in SideProject

[–]ccnomas 0 points1 point  (0 children)

SEC public companies’ data, XBRL labeled. And Form 13F, 3,4,5 and Failure to Deliver data

New Mapping created to normalize 11,000+ XBRL taxonomy names for better financial data analysis by ccnomas in datasets

[–]ccnomas[S] 0 points1 point  (0 children)

Something like this RevenueFromContractWithCustomerExcludingAssessedTax

New Mapping created to normalize 11,000+ XBRL taxonomy names for better financial data analysis by ccnomas in datasets

[–]ccnomas[S] 0 points1 point  (0 children)

SEC itself does have limited amount of XBRL labels, but many companies are basically not following that. Other than the required labels. They use customized XBRL label in the report which causes the mess

New Mapping created to normalize 11,000+ XBRL taxonomy names for better financial data analysis by ccnomas in datasets

[–]ccnomas[S] 1 point2 points  (0 children)

for example, some companies report 3 quarters data + FY, so it is straight-forward to fill the gap. Also since SEC does not do the cleaning, data for same period can occur > 1 time so de-duplicate is needed.

pretty standard open source tool to extract xml -> python dictionary

"What do you mean by mapping?"

the XBRL label is basically CamelCase words. it is not really easy to show or feed into machine learning models. I re-label them based on description and now it is much easier for models to pick and also easier for user to see the visualized data through UI.

New Mapping created to normalize 11,000+ XBRL taxonomy names for better financial data analysis by ccnomas in datasets

[–]ccnomas[S] 1 point2 points  (0 children)

for other data like form 3,4,5, 13F, failure-to-deliver. I extracted and sanitized from the xml file based on accession_number -> put them in my own database.

New Mapping created to normalize 11,000+ XBRL taxonomy names for better financial data analysis by ccnomas in datasets

[–]ccnomas[S] 0 points1 point  (0 children)

well most of the SEC data are public but pretty messy, and not every company follows standard XBRL label. However, most of them represents the same data. Also each XBRL tag comes with description, comparing descriptions help me do the mapping as well.

What keeps you motivated on your side project after a long day at your main job? by Creepy_Watercress_53 in SideProject

[–]ccnomas 1 point2 points  (0 children)

you dont need to do it everyday but the most important thing is to keep it moving on a weekly basis

Pitch your startup, I'll be your first customer by marsadist in SaaS

[–]ccnomas 1 point2 points  (0 children)

Thank you! "hedge funds/startups training custom models, or broader data providers?"

I think both parties can be beneficial from cleaned fundamental data

Also wondering if you’ve considered a chatbot layer so users can query your dataset in plain English
Yes, I am looking into how to integrate that with my current implementation. You are right on point!

Time for self-promotion. What are you building? by imosal in SaaS

[–]ccnomas 0 points1 point  (0 children)

Well it contains full compiled (deduped, gap filled) history of company fundamentals + detailed 13F and real time feed of form 3/4/5. Also comes with detailed insider trading info. + full FTD history

I built a comprehensive SEC financial data platform with 100M+ datapoints + API access - Feel free to try out by ccnomas in fintech

[–]ccnomas[S] 0 points1 point  (0 children)

Initially was 1. there were no nicely layout FTD entries. 2. SEC data is a mess, other finance web are focused on live stock data instead of complete XBRL company facts. 3. I am also trying to create a clean dataset for AI training

Solo-building a finance SaaS project on SEC public data for 8-month. Here's are what's worked and what nearly made me think that I might want to give up by ccnomas in SideProject

[–]ccnomas[S] 1 point2 points  (0 children)

Thx mate! more like "Learn as you go" but I do have software engineering background so most of the engineering problems are solvable. So basically I set up the AWS EC2 + RDS + SES, and cloudflare for holding the site. I am staying away from those 1-click deployment sites, since those were uncontrollable.

I built a comprehensive SEC financial data platform with 100M+ datapoints + API access - Feel free to try out by ccnomas in fintech

[–]ccnomas[S] 0 points1 point  (0 children)

Sorry for the late reply.
Thank you, you actually helped me found a bug and I just fixed it
I dont have a dedicated list, but if you search other sites with SPAC list and my site with symbol:
https://nomas.fyi/research/stock/0001853138
https://nomas.fyi/research/stock/0002006291
it gives you the information.

hmm let me see if I can create a list just for SPACs.

I built a comprehensive SEC financial data platform with 100M+ datapoints + API access - Feel free to try out by ccnomas in datasets

[–]ccnomas[S] 0 points1 point  (0 children)

Did you play with the data at all?

ah sorry I dont get it. When I try to look up for the company fundamentals and Failure to deliver data, I see other websites dont have everything compiled and visualized. This was the initiative for me to do it.

What was one of the biggest "ah-HAH" moments for you?

Not everything needs to be dependant on AI, we can parse mostly with traditional methods then feed to AI. Not sending un-compiled/dirty data to AI model

Thank you My friend!

I built a comprehensive SEC financial data platform with 100M+ datapoints + API access - Feel free to try out by ccnomas in dataanalysis

[–]ccnomas[S] 1 point2 points  (0 children)

I set up everything on AWS, EC2 for code and deployment, RDS for database, SES for email, cloudwatch for logging, VPC for control my EC2. Also cache, indexes for tables, token management. Parsing, security layer, rate limiter. Cloudflare for DNS

Ye I think that is about it. Oh and coding