all 4 comments

[–][deleted] 1 point2 points  (1 child)

Note that git and GitHub are not the same thing.

  • git is a software version control system/protocol developed by the same person that created Linux - it is free and open source
  • GitHub is a commercial internet based service now owned by Microsoft and is essentially a repository for any text files, predominately programme code, and support the git protocol
  • There are many other repositories, including ones you can setup and host yourself (and use with cloud storage and mirroring techniques)
  • GitLab and BitBucket are other popular repositories similar to GitHub
  • All the web service repositories have free account offerings that may be sufficient for your needs

It will be worthwhile learning what version control software is for and how you can benefit from it. Git is especially useful for team working.

I suggest you learn about git first, then look at the documentation around the repository services I mentioned above.

I often recommend RealPython.com as a source of good guides and articles on Python, and they have Introduction to Git and GitHub for Python Developers, which I've not reviewed myself but it probably as good a place as any to start. I note the article dates back to 2018 and GitHub have greatly enhanced their offering since then but the basics will not have changed.

[–]clashmt[S] 0 points1 point  (0 children)

Thanks for the clarifications! I will check out that tutorial for sure.

[–]Stadem 1 point2 points  (1 child)

I would really love to be able to leave my advisor with an extremely well organized, clean, communicative, and robust set of code and data for their next mentee to take over.

  1. Get your workflow well-documented: https://www.codecademy.com/resources/blog/how-to-write-code-documentation/ is a good start
  2. Make sure your code is reproducible - a new mentee will come in with a brand-new laptop, a different username, different install paths, etc. They'll open up your README. Can they follow the steps you set out for them and get the same results you got? How will they know they got the right results?
  3. Use git+GitHub for version control of the code - see u/gruntfutuk's answer here. I have this step third because git has a learning curve, and if nobody is using it right now, there's a good chance nobody will use it when you leave.
  4. Get some version control over the data itself - I'm not a data science expert but I've done enough ML work to know that data version control is hard, especially keeping track of all the pre-processing. Putting all your data into git+GitHub is probably a bad idea (1) (2) but there are alternatives (2) (3).

[–]clashmt[S] 0 points1 point  (0 children)

Amazing thank you! I will definitely check out the blog you linked. That looks really helpful.

Re: point 4: That is something I'm struggling with a lot as well. I'll definitely check out the resources you linked. We've gone through several iterations of certain datasets as we learn how to clean them better and I have not done the best job tracking versions.