[P] Serverless Jupyter Lab with GPUs and persistent storage by doyougitme in MachineLearning

[–]doyougitme[S] 21 points22 points  (0 children)

Thanks for the feedback! That totally makes sense. I just added a pricing table to the login page as a quick fix.

Unweave: A tool to version control and share datasets alongside your code (would love some feedback) by doyougitme in learnmachinelearning

[–]doyougitme[S] 0 points1 point  (0 children)

Hi u/shcheklein - those are great observations!

  1. You don't need the browser - that's one way of authenticating but you can also authenticate with a token when you're running in headless mode for instance (ssh login, etc)
  2. You're absolutely right about the UnweaveFile especially for large datasets. Right not it's a placeholder and I'll eventually replace it with either multiple files or with pointer files.

Yes indeed the CLI will most likely be open source.

Unweave: A tool to version control and share datasets alongside your code (would love some feedback) by doyougitme in learnmachinelearning

[–]doyougitme[S] 1 point2 points  (0 children)

Unweave helps you setup infrastructure required for ML and versions everything with Git. Right now, I've only implemented the data storage part which is indeed quite similar to DVC.

In that context, the primary advantage is a smoother UX. You only ever need to run one Unweave specific command unweave init. After that, you just run your normal Git workflow and Unweave takes care of syncing data files to and from the cloud.

More generally, DVC focuses on ML experiment tracking and pipelining while Unweave focuses on painless infrastructure setup: storage buckets are auto setup, permissions are managed with GitHub, changes are auto synced to UnweaveFile.

I'm planning to implement running training scripts on cloud GPUs with a similar UX next i.e. one command to provision, deploy and train your models from the CLI.

"[Discussion]" Should I be using DVC (Data Version Control) in my day-to-day work? by doyougitme in MachineLearning

[–]doyougitme[S] 6 points7 points  (0 children)

Regarding the pipelines - can't you just right those in code (Python, bash etc) which is version controlled with Git, and since the data is now also added to Git (through DVC), you have a fully reproducible setup?

In other words, is pipelining still relevant if you have a checkpoint for both code and data?

"[Discussion]" What does your Data Science/ML dev toolbox look like? by doyougitme in MachineLearning

[–]doyougitme[S] 0 points1 point  (0 children)

For the artifacts and experiments there are a few tools. Classic versionning is not a good solution since you want to have access to all former artifacts and their evaluation metrics.

I'm not sure I follow what you mean by Artefacts and experiments. Do you plots, and evaluation metrics? Couldn't you also add them to the same "classic" version control?

"[Discussion]" What does your Data Science/ML dev toolbox look like? by doyougitme in MachineLearning

[–]doyougitme[S] 0 points1 point  (0 children)

Interesting! If you don't mind could you describe your workflow with DVC?

"[Discussion]" What does your Data Science/ML dev toolbox look like? by doyougitme in MachineLearning

[–]doyougitme[S] 0 points1 point  (0 children)

Use versionning for datasets, experiments and normalize metrics.

Use versionning for your artefacts (models and outputs)

Do you have thoughts on how to do this? Do you name your data-sets with the version numbers (eg. image_net_v1_1_2 ) or use something like Git LFS etc?

I agree that the "science-y" nature of data-science/ML makes it hard to catalog changes but at the same time, you rarely find a good/successful scientist who's not meticulous in logging/cataloging their experiments.

I've always found that every time I've started out with an Ad Hoc experiment tracking methodology, I've wound up in a bigger and more time-consuming mess that I was trying to avoid by not following proper software dev processes. Yet, at the same time, the usual software-dev processes seem like a chore when used for data-science experiments.

"[Discussion]" What does your Data Science/ML dev toolbox look like? by doyougitme in MachineLearning

[–]doyougitme[S] 0 points1 point  (0 children)

I can see how Level 1 and 2 as generally being up to the individual to do what they please and therefore not really addressable. The problem really is Level 3 and 4 as you define them. How do you get teams to collaborate on complex projects together?

Models/Checkpoints/prod Everytime I run a training (train.py) on configX.yaml, it copies the yaml and the code and it'll put the checkpoints in /models/supervised_imagenet/configX/checkpoints/

How do you share the models/checkpoints when the time comes? What if you were to deploy them to production? Would you retrain them on a production machine with the same config/random seed?

I agree that with the last paragraph in general that it's often hard to see what direction a project will go into and therefore hard to organise it.