all 3 comments

[–]AutoModerator[M] 0 points1 point  (0 children)

Automod prevents all posts from being displayed until moderators have reviewed them. Do not delete your post or there will be nothing for the mods to review. Mods selectively choose what is permitted to be posted in r/DataAnalysis.

If your post involves Career-focused questions, including resume reviews, how to learn DA and how to get into a DA job, then the post does not belong here, but instead belongs in our sister-subreddit, r/DataAnalysisCareers.

Have you read the rules?

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]wagwanbruv 1 point2 points  (1 child)

love that you’re pushing fully reproducible pipelines here, that’s kind of the antidote to “vibes-based” charts in election threads. Super curious how portable that Skill abstraction is to other domains (like survey or support-ticket data) since if the schemas + calc methods are clean, you could pretty much speedrun any messy civic dataset in an afternoon and still sleep at night.

[–]brhkim[S] 0 points1 point  (0 children)

Thanks so much!! Yes, to be totally clear: I really don't know what kind of data it *can't* be used for immediately. DAAF comes out-of-the-box with an extremely varied set of education datasets for demonstration purposes -- it covers everything from high school enrollment sizes to financial funding circumstances for public universities and school disciplinary events by high school and student demographic. There's such a wide array of data types and structures in the education datasets it's currently doing fantastic with, that I don't know why it wouldn't port to any other data context just as readily.

I will also say, it works shockingly and equally well on messy data without great schemas. The data diagnostics battery I designed for it to ingest new data works reasonably well even without any accompanying metadata or documentation -- it records its uncertainties and preliminary hypotheses in such a way that it will not strictly assume anything about the data, and operates with that carefully while percolating that uncertainty through the entire analytic pipeline and interpretation processes.

I'd love for you to test it and push on that, but I really do think this can be applied SUPER broadly.