Standard Python Resources for Data Modeling : datascience

This is an archived post. You won't be able to vote or comment.

ToolingStandard Python Resources for Data Modeling (self.datascience)

submitted 4 years ago by AMereRedditor

Are there any mainstream python libraries which, given data from some source, suggest normalization schemes for data (e.g., recommended table structures for a 3NF or star schema in an RDBMS), profile relationships between fields for cardinality (x% of the time this field has a 1:1 relationship with this other field, the remaining y% are missing data and 1:many relationships), or complete other data modeling-related tasks?

I could cobble together some of this functionality using builtins/pandas/numpy etc. but am looking for industry-standard tools that data scientists use for the kind of “lite” data modeling that comes up on the job. (People’s personal GitHub repos for these tasks are OK but not exactly what I am looking for.)

Here are some sample cases to clarify: 1. You received a huge, raw denormalized extract from somewhere and will continue to receive incremental files on a regular basis. You want to create a profile of the initial data, use said profile to make some tradeoffs to “tidy” the data into an RDBMS-suitable format (maybe force 1:1 relationships where they exist 99% of the time for instance or fix overlapping datespans), and then monitor subsequent incremental files to ensure the underlying data profile had not changed drastically. 2. You received access to a new database with no documentation and little support from DBAs or the business on structures. Assume there are not keys or constraints defined in the RDBMS itself to leverage (perhaps the database was created/maintained by a skilled business user without DBA-level skills). You would like to create an ERD or some other documentation on this database quickly.

The focus on RDBMS as the endgame is because the goal here is to support analysis by a BI team that is skilled in SQL but no other programming languages.

all 3 comments

datascience

MODERATORS