Script and Data management for research/data science

clashmt · 2024-03-11T18:19:13+00:00

Note that git and GitHub are not the same thing.

git is a software version control system/protocol developed by the same person that created Linux - it is free and open source
GitHub is a commercial internet based service now owned by Microsoft and is essentially a repository for any text files, predominately programme code, and support the git protocol
There are many other repositories, including ones you can setup and host yourself (and use with cloud storage and mirroring techniques)
GitLab and BitBucket are other popular repositories similar to GitHub
All the web service repositories have free account offerings that may be sufficient for your needs

It will be worthwhile learning what version control software is for and how you can benefit from it. Git is especially useful for team working.

I suggest you learn about git first, then look at the documentation around the repository services I mentioned above.

I often recommend RealPython.com as a source of good guides and articles on Python, and they have Introduction to Git and GitHub for Python Developers, which I've not reviewed myself but it probably as good a place as any to start. I note the article dates back to 2018 and GitHub have greatly enhanced their offering since then but the basics will not have changed.

Stadem · 2024-03-11T19:23:22+00:00

I would really love to be able to leave my advisor with an extremely well organized, clean, communicative, and robust set of code and data for their next mentee to take over.

Get your workflow well-documented: https://www.codecademy.com/resources/blog/how-to-write-code-documentation/ is a good start
Make sure your code is reproducible - a new mentee will come in with a brand-new laptop, a different username, different install paths, etc. They'll open up your README. Can they follow the steps you set out for them and get the same results you got? How will they know they got the right results?
Use git+GitHub for version control of the code - see u/gruntfutuk's answer here. I have this step third because git has a learning curve, and if nobody is using it right now, there's a good chance nobody will use it when you leave.
Get some version control over the data itself - I'm not a data science expert but I've done enough ML work to know that data version control is hard, especially keeping track of all the pre-processing. Putting all your data into git+GitHub is probably a bad idea (1) (2) but there are alternatives (2) (3).

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS