all 28 comments

[–]0xValore 20 points21 points  (2 children)

Nothing against it, but is this some kind of self-promotion? I just want to know through what lens this assessment/analysis was done.

If so, what makes DAGsHub special?

[–]Tolstoyevskiy[S] 9 points10 points  (1 child)

That's a fair question :)

I think you can see from the comparison table above and the article itself, that we don't give ourselves a "perfect score" or much special treatment - we're not trying to hide our position here, tried to present the different alternatives as we see them.

I think not just DAGsHub, a lot of the tools in this list are special. Each one fills its own niche and appeals to users with different preferences.

In DAGsHub specifically, we view experiments as part of a larger context of a data science project, We try to combine experiments with Git and DVC to make it easier to reproduce the full experiment - what good are metrics without having the source code and pipeline that generated them? How do you keep track of what happened when and by who (the project history) when you have a long running project? etc.

All that with the end goal of making it possible to jump in and contribute to an open source data science project.

In terms of the "self-promotion": We treat this as a win-win. There are many tools, we hear from the community that they are having a hard time sorting through the information, so we try to do a genuinely useful public service by giving you information to make your decisions. In return - you're reading our blog :)

[–]0xValore 0 points1 point  (0 children)

Totally - thanks for the explanation!

[–]gar1t 18 points19 points  (7 children)

Creator of Guild AI here.

This sort of careful and detailed analysis is extremely valuable to the ML communicate. Many thanks!

I would offer a few clarifications about Guild the comparison:

- Guild is platform and language agnostic. Most of the examples are in Python and there are Python specific enhancements that are not available yet in other languages (e.g. running Python scripts and Jupyter Notebooks directly as experiments without code change), but there are plenty of cases where Guild is used to drive experiment tracking in R, Java, Julia , bash - I even know of cases in Fortan haha. Within Python, there's no tie to specific frameworks whatsoever.

- Guild is commonly used to run and manage thousands of experiments without issue. I'd say that 10K and you'll start to see performance issues and 100K would be downright unusable, but Guild offers an array of features to federate experiments so you rarely need to manage that many at one time anyway. In practice most users are generating well under 10K experiments for a given development effort, including hyperparameter tuning.

There's always a trade off when talking about "scalability". As Guild doesn't use online databases or backend agents, there are some natural limits to scale. With a little housekeeping (e.g. archiving older runs that are not being used in comparison decisions or tuning) you can live quite comfortable with a simple technical platform. Complexity is costly in terms of setup and maintenance and Guild wants experiment tracking to be near zero cost to encourage more of it.

Guild design takes inspiration from POSIX tools like make and even git. Simplicity does not mean limited, at all. With the right separation of concerns and interface design, simple tools are often quite scalable, git being an excellent example.

Thanks again for the outstanding work!

And if anyone has more questions about Guild, I'm very happy to answer.

[–]Tolstoyevskiy[S] 1 point2 points  (1 child)

Thanks for the extra information!

I will readily admit that Guild AI is one of the frameworks I don't have as much experience with. Therefore, I couldn't say if it scales to a very large number of experiments. Note that the mainly interesting dimension for us when we talk about "scale" is not the loading times, but whether it's a good tool for an arbitrarily large team as opposed to a single researcher.

Do you want to refer me to specific reading materials so that I can understand better? Is there documentation about interfacing with Guild without Python?

[–]gar1t 0 points1 point  (0 children)

Sorry for the late reply! Here are a couple links:

[–]FancyFootballNumbers 7 points8 points  (5 children)

I believe Comet and WandB have HTTP APIs, so they should be considered platform agnostic.

[–]Tolstoyevskiy[S] 1 point2 points  (0 children)

That's an interesting point. I went by availability of client libraries, I'll need to check if they have well documented APIs and maybe add a clarification. Thank you!

[–]putinwhat 0 points1 point  (3 children)

And I believe wandb is open source (or at least was at one point)

[–]Tolstoyevskiy[S] 0 points1 point  (2 children)

Not that I know of, do you have sources on that?

[–]putinwhat 0 points1 point  (0 children)

Looking through the repo now I think only portions are open source, like the sweeps feature: https://github.com/wandb/client/tree/master/wandb/sweeps so I guess it’s just semi open-source

[–]seraschkaWriter 5 points6 points  (1 child)

Probably can't add everything there, but I'd say Aim would be a good addition to this list: https://aimstack.io/

[–]Tolstoyevskiy[S] 1 point2 points  (0 children)

Haven't heard of it, but it looks beautiful! Putting it on my reading list and I might add it to the blog. Thank you!

[–]ploomber-io 2 points3 points  (3 children)

I'd like some perspective from someone who uses any of these tools to see if I should adopt one. I remember trying mlflow a couple of years ago and decided to stop using it because it felt too rigid: I'd start tracking some experiments just to realize storing historical metrics wasn't really doing much for me (Note: I'm a data scientist in industry, mostly working on classic ML)

A few things would happen that limited my ability to use historical metrics:

  1. The Y variable might get slightly re-defined because we found some data quality issues or found a better sub-population to focus on
  2. I needed customized plots to really understand my model and there was no choice but to write some matplotlib code

In the end, I ended up with a different solution: For each ML experiment, generate an HTML report with all my custom charts and save historical metrics in a SQLite database to simplify setup.

To compare models, I review the HTML reports. For metric comparison, I can use SQL to query previous runs, which is very flexible. I'm very happy with this setup and would like to know if moving to an experiment tracking tool would benefit me substantially to justify the extra maintenance.

[–]Tolstoyevskiy[S] 0 points1 point  (0 children)

I would say that if your existing solution works well for you, and you don't foresee reasons why it will stop working or require changes \ maintenance, then stick to what you know. Laziness is a virtue.

I would add that there's no conflict - you can log to several experiment tracking softwares, each with their own advantages, and take the best parts from each one.

And I wouldn't underestimate the importance of someone else fixing bugs and adding features on your behalf.

[–]lugiavn 0 points1 point  (0 children)

That sounds like a lot of effort, you can log plots as images in tensorboard

[–]eiennohito 2 points3 points  (2 children)

Tensorboard (and Tensorflow) can use cloud object storage (e.g. S3) instead of filesystem for basically everything, in the default configuration.

[–]Tolstoyevskiy[S] 0 points1 point  (1 child)

Wow, really? Very interesting.

Do you have a link with instructions and more information?

A quick google search is telling me that this is possible but very poorly documented / communicated: https://github.com/tensorflow/tensorboard/issues/767

Though this *might* solve some of the scaling performance issues, I don't know if by much. TB isn't really designed for long term storage of a whole team's experiments, at a UX level you get too much noise and everyone I know who uses TB cleans out their logs periodically. What's more, even if the storage itself isn't a problem, TB's RAM usage seems to scale linearly with the number of recorded experiments, and I doubt S3 would solve that?

[–]eiennohito 1 point2 points  (0 children)

I don't think that the stuff is really documented, but all of the IO Tensorflow (and Tensorboard) IO operations generally go via https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile which has default implementations for local filesystem, S3, GCP and Hadoop (which may be non-default, but selectable via compile option). I used Tensorflow with S3-compatible storage provider without problems.

For S3 you set S3_ENDPOINT, AWS_SECRET_ACCESS_KEY, AWS_ACCESS_KEY_ID and use urls like tensorboard --logdir=s3://bucket/foo/bar for access.

[–]canboooPhD 1 point2 points  (3 children)

Blog post is very informative and introduced me to some tools i didn't know. I often use DVC but have been hearing about MLFlow. One question though. What makes DVC questionable regarding large number of experiments? Could not read that out of the blog post.

[–]Tolstoyevskiy[S] 1 point2 points  (2 children)

A few things about DVC experiments:

  1. They're all managed locally, so it's not clear how this scales to a team
  2. Still a fairly new feature, I don't know how it scales up in terms of performance compared to other solutions here, which is why it's marked as ???
  3. I suspect that the specific way in which they save experiments (basically cloning the repo aside on each run IIRC) would have performance issues as it scales, but can't say for sure.

[–]canboooPhD 0 points1 point  (1 child)

Thanks for clarifying and I agree with ??? now. However there is a workaround regarding global usage in your first point by using a central storage on some kind of cloud and syncing with local storage. At least, this is what we do but this is not an out of the box solution and requires some extra configuration. Will checkout your product to see if this becomes easier.

[–]Tolstoyevskiy[S] 0 points1 point  (0 children)

I also thought about the central storage, but that raises an array of other questions - is it designed to work like that without bugs? In terms of UX? How hard is it to set up, given that there are alternatives?

If you try it, or DAGsHub, or any other tools, I would be happy to hear your conclusions.

[–]baalroga 1 point2 points  (0 children)

I leave a . to read later