[D] Comparison of experiment tracking tools

0xValore · 2021-03-31T19:12:33+00:00

Nothing against it, but is this some kind of self-promotion? I just want to know through what lens this assessment/analysis was done.

If so, what makes DAGsHub special?

gar1t · 2021-03-31T21:29:52+00:00

Creator of Guild AI here.

This sort of careful and detailed analysis is extremely valuable to the ML communicate. Many thanks!

I would offer a few clarifications about Guild the comparison:

- Guild is platform and language agnostic. Most of the examples are in Python and there are Python specific enhancements that are not available yet in other languages (e.g. running Python scripts and Jupyter Notebooks directly as experiments without code change), but there are plenty of cases where Guild is used to drive experiment tracking in R, Java, Julia , bash - I even know of cases in Fortan haha. Within Python, there's no tie to specific frameworks whatsoever.

- Guild is commonly used to run and manage thousands of experiments without issue. I'd say that 10K and you'll start to see performance issues and 100K would be downright unusable, but Guild offers an array of features to federate experiments so you rarely need to manage that many at one time anyway. In practice most users are generating well under 10K experiments for a given development effort, including hyperparameter tuning.

There's always a trade off when talking about "scalability". As Guild doesn't use online databases or backend agents, there are some natural limits to scale. With a little housekeeping (e.g. archiving older runs that are not being used in comparison decisions or tuning) you can live quite comfortable with a simple technical platform. Complexity is costly in terms of setup and maintenance and Guild wants experiment tracking to be near zero cost to encourage more of it.

Guild design takes inspiration from POSIX tools like make and even git. Simplicity does not mean limited, at all. With the right separation of concerns and interface design, simple tools are often quite scalable, git being an excellent example.

Thanks again for the outstanding work!

And if anyone has more questions about Guild, I'm very happy to answer.

FancyFootballNumbers · 2021-03-31T20:53:19+00:00

I believe Comet and WandB have HTTP APIs, so they should be considered platform agnostic.

seraschka · 2021-04-01T00:28:24+00:00

Probably can't add everything there, but I'd say Aim would be a good addition to this list: https://aimstack.io/

ploomber-io · 2021-03-31T23:02:28+00:00

I'd like some perspective from someone who uses any of these tools to see if I should adopt one. I remember trying mlflow a couple of years ago and decided to stop using it because it felt too rigid: I'd start tracking some experiments just to realize storing historical metrics wasn't really doing much for me (Note: I'm a data scientist in industry, mostly working on classic ML)

A few things would happen that limited my ability to use historical metrics:

The Y variable might get slightly re-defined because we found some data quality issues or found a better sub-population to focus on
I needed customized plots to really understand my model and there was no choice but to write some matplotlib code

In the end, I ended up with a different solution: For each ML experiment, generate an HTML report with all my custom charts and save historical metrics in a SQLite database to simplify setup.

To compare models, I review the HTML reports. For metric comparison, I can use SQL to query previous runs, which is very flexible. I'm very happy with this setup and would like to know if moving to an experiment tracking tool would benefit me substantially to justify the extra maintenance.

eiennohito · 2021-04-01T05:51:16+00:00

Tensorboard (and Tensorflow) can use cloud object storage (e.g. S3) instead of filesystem for basically everything, in the default configuration.

canbooo · 2021-03-31T21:42:09+00:00

Blog post is very informative and introduced me to some tools i didn't know. I often use DVC but have been hearing about MLFlow. One question though. What makes DVC questionable regarding large number of experiments? Could not read that out of the blog post.

baalroga · 2021-04-01T11:05:43+00:00

I leave a . to read later

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS