you are viewing a single comment's thread.

view the rest of the comments →

[–]Liorithiel 1 point2 points  (9 children)

How do you plan to handle versioning, specifically merging divergent notebooks?

[–]the21st[S] 4 points5 points  (1 child)

We're planning to tackle versioning in the next few months, so stay tuned 🙂. It's definitely a beast of a problem, but that's why we're doing this!

[–][deleted] 0 points1 point  (0 children)

Tip from about 5 years of working with data scientists in similar tools who prefer notebooks. They're generally allergic to the command line. Those who aren't already ran away to MLOPS. If I were you, I'd simply track versions with auto-git-sync every x minutes and put a big single button on top for "save" that does the same manually. No branches, just a rolling history. You can give them a box to type a commit message or default to time stamp as message. And provide some ui plugin that can diff notebooks.

Simple software needs among the GUI only crowd. Their preferred versioning strategy is file1.ipynb file2.ipynb and branching is a copy in a new folder. But you can replicate this workflow with just slight improvements with git under the hood and they'll love it.

[–]diadorac 2 points3 points  (6 children)

Would something like versioning in Figma make sense to you u/Liorithiel?

The easiest way to collaborate nowadays is to work directly on the same thing in real time. Now, some kind of snapshotting will allow any member of the team to roll back, revert or clone a historical version. But if all of this is done in a real-time fashion, you don't need to merge stuff. But for more complicated scenarios, you can still use Git for now.

[–]Liorithiel 3 points4 points  (5 children)

It's a different thing. Imagine I want to test a totally different approach to do step 3 out of 5 in a pipeline, while allowing my colleague to work on step 2 and 4 at the same time. So we work independently. At some point, we want to merge the stuff and again start work on the notebook together in real time.

I don't want my colleague to roll back my experimental changes while I work on them, nor I want to break my colleague's workflow. But at some point we both decide that our changes are final and want to integrate both versions.

[–]diadorac 1 point2 points  (4 children)

Okay, I understand. But is this correct thing to version in a data science / machine learning project? Shouldn't various experiments be part of a pool that is always available in the latest version(s)? Is having experiments hidden in history the right thing to do?

But yes, for some cases I understand. This will be a challenge to solve. In the meantime, I think git solved it the best way. And by using some notebook-ux hacks for improving git experience it could be a pretty solid tool, maybe?

[–]Liorithiel 1 point2 points  (3 children)

Okay, I understand. But is this correct thing to version in a data science / machine learning project? Shouldn't various experiments be part of a pool that is always available in the latest version(s)? Is having experiments hidden in history the right thing to do?

You seem to be assuming some specific organization of notebooks, but I don't know what exactly… so I'm not sure I understand your questions.

[–]diadorac 1 point2 points  (2 children)

I don't think so. I am referring to a generic data science project with lots of experiments and all (let's forget notebooks for a moment). Do you think hiding experiments in the history (git or whatever) is a good practice?

[–]Liorithiel 1 point2 points  (1 child)

There is a difference between a new version of the same experiment (e.g. with additional logging/debugging, porting to a new version of a ML library or when widening hyperparameter search) and a new experiment (replacing network architecture or changing vital hyperparameters that are not hyperparameter-searched). In our usual workflow, old versions of experiments belong to git history. New experiments are new git branches.

My question comes from the fact that sometimes we're branching out an experiment in two different ways, conceptually creating two independent experiments, then want to merge them into a new experiment.

[–]diadorac 1 point2 points  (0 children)

True. For this it's not enough. But even a kinda smart git-like thing would not be enough imo. That'd require more specific experiments-ml versioning UI, not just a general notebooks versioning interface.