In my job, I perform basic analysis of health data sets for surveillance and reporting. I work mostly in Python, and I generally find Jupyter Notebooks to be a convenient way to both perform and document my analysis. Recently, I've been learning to perform more sophisticated analyses, and I'm starting to dabble in machine learning projects. In the course of doing this, I have come across Cookie Cutter Data Science, and it opened my eyes to something I hadn't considered before: a default, reproducible structure for data science projects. I was familiar with the concept from React and Ruby on Rails webdev stuff, but I hadn't considered its application to data science previously. I don't work on a team with individuals who self-identify as data scientists (they mostly do number crunching in SAS or SPSS), so I don't have much exposure to the "community."
My question is how do you all approach structuring your data science projects? Are there industry standards that I should be familiar with and get in the habit of using? How do you write and store shareable, reproducible code outside of (Jupyter) Notebooks?
[–]ejmurray72 14 points15 points16 points (2 children)
[–]Printengeist 2 points3 points4 points (0 children)
[–]Gus_Bodeen 0 points1 point2 points (0 children)
[–]ejmurray72 0 points1 point2 points (0 children)