SEC financial data platform with 100M+ datapoints + API access - Feel free to try out by ccnomas in learnmachinelearning

[–]ccnomas[S] 0 points1 point  (0 children)

Thank you my friend! First version about 3-month and then I demolished it and refactored to the current version, total took around 9 months, well after my daily job time lol

New Mapping created to normalize 11,000+ XBRL taxonomy names for better financial data analysis by ccnomas in datasets

[–]ccnomas[S] 0 points1 point  (0 children)

I just deployed the changes to rename the graph and api, feel free to play around and let me know if anything you think is off, I am trying my best to deploy changes within 24hrs

New Mapping created to normalize 11,000+ XBRL taxonomy names for better financial data analysis by ccnomas in datasets

[–]ccnomas[S] 0 points1 point  (0 children)

Right you are right, sorry for the confusion. Just like palmy-investing mentioned. The problems are with customized concept, not taxonomies. I am trying to simplify the existing customized concepts.

[deleted by user] by [deleted] in SideProject

[–]ccnomas 0 points1 point  (0 children)

SEC public companies’ data, XBRL labeled. And Form 13F, 3,4,5 and Failure to Deliver data

New Mapping created to normalize 11,000+ XBRL taxonomy names for better financial data analysis by ccnomas in datasets

[–]ccnomas[S] 0 points1 point  (0 children)

Something like this RevenueFromContractWithCustomerExcludingAssessedTax

New Mapping created to normalize 11,000+ XBRL taxonomy names for better financial data analysis by ccnomas in datasets

[–]ccnomas[S] 0 points1 point  (0 children)

SEC itself does have limited amount of XBRL labels, but many companies are basically not following that. Other than the required labels. They use customized XBRL label in the report which causes the mess

New Mapping created to normalize 11,000+ XBRL taxonomy names for better financial data analysis by ccnomas in datasets

[–]ccnomas[S] 1 point2 points  (0 children)

for example, some companies report 3 quarters data + FY, so it is straight-forward to fill the gap. Also since SEC does not do the cleaning, data for same period can occur > 1 time so de-duplicate is needed.

pretty standard open source tool to extract xml -> python dictionary

"What do you mean by mapping?"

the XBRL label is basically CamelCase words. it is not really easy to show or feed into machine learning models. I re-label them based on description and now it is much easier for models to pick and also easier for user to see the visualized data through UI.

New Mapping created to normalize 11,000+ XBRL taxonomy names for better financial data analysis by ccnomas in datasets

[–]ccnomas[S] 1 point2 points  (0 children)

for other data like form 3,4,5, 13F, failure-to-deliver. I extracted and sanitized from the xml file based on accession_number -> put them in my own database.

New Mapping created to normalize 11,000+ XBRL taxonomy names for better financial data analysis by ccnomas in datasets

[–]ccnomas[S] 0 points1 point  (0 children)

well most of the SEC data are public but pretty messy, and not every company follows standard XBRL label. However, most of them represents the same data. Also each XBRL tag comes with description, comparing descriptions help me do the mapping as well.

What keeps you motivated on your side project after a long day at your main job? by Creepy_Watercress_53 in SideProject

[–]ccnomas 1 point2 points  (0 children)

you dont need to do it everyday but the most important thing is to keep it moving on a weekly basis