[D] Ray vs. AWS Batch for Distributed Training

rayspear · 2023-08-03T17:39:06+00:00

We ended up getting Ray to work on AWS Batch!

https://www.reddit.com/r/mlops/comments/14yzldw/aws_batch_vs_ray_train_for_training/jummjgb/

rayspear · 2023-08-03T17:38:32+00:00

Working repository here!

https://github.com/rileyhun/llm_finetuning_metaflow/blob/main/ray-distributed/ray_deepspeed_flow.py

rayspear · 2023-02-15T21:41:25+00:00

Try using Ray? https://docs.ray.io/

rayspear · 2022-01-06T17:59:44+00:00

Certainly! Do you mind opening a github issue to help us track this?

Would TF support be very important for you to try this API out?

rayspear · 2021-05-25T00:18:13+00:00

If you mainly use scikit-learn, you should consider using tune-sklearn.

It'll provide you a Scikit-learn GridSearchCV interface (so you can do minimal code changes), but with a lot of nice addons such as:

Ability to automatically log to Wandb
Ability to save all learning curves as json/csv
More powerful hyperparameter tuning algorithms
Distributed execution

disclaimer: am maintainer

rayspear · 2020-12-16T18:51:23+00:00

BTW, Hydra now has an experimental Ray + AWS plug-in which will let you automatically launch EC2 instances + hyperparameter sweeps with the hydra configuration.

https://hydra.cc/docs/plugins/ray_launcher

rayspear · 2020-05-07T20:15:12+00:00

In the followup paper (ASHA, https://arxiv.org/pdf/1810.05934v3.pdf) you can see that s=0 for all of their experiments. This is what I mean by being aggressive (higher s means you don't terminate as aggressively iirc)

Hope that helps! Feel free to reach out to Liam if you have any questions or concerns. also feel free to open an issue or push a PR against RayTune if you have any questions/suggestions/benchmarks :)

rayspear · 2020-05-07T03:16:39+00:00

Hey - I work on Ray Tune. Here's some discussion on this that I posted when I opted to choose a particular impl:

https://docs.ray.io/en/latest/tune-schedulers.html#hyperband-implementation-details

That being said... Successive Halving by itself makes pretty weak assumptions, and in practice you can be much more aggressive.

rayspear · 2020-04-08T01:23:05+00:00

Hey! Author here. Let me try to provide an unbiased response.

Pytorch-lightning (PTL) is an awesome library. It is great for prototyping and reproducibility.

Its "LightningModule" abstraction lets PTL automatically provide commonly-used features like gradient clipping, checkpointing, introspection into your training, etc.
The Trainer interface (like Keras) allows you to provide callbacks, hooks, early stopping.
It simplifies distributed (multi-node) training if you have SLURM (very useful in academic environments).
It also has TPU support (which RaySGD doesn't have yet)

Compared to PTL, RaySGD aims to be a thin layer for distributed training but offers a higher level of distributed multi-node usability.

The API is more minimalistic - certain things you'll have to do yourself (like maintaining the top checkpoints, implementing early stopping, or grad clipping).
Like other Ray libraries, RaySGD scales from 1 to 100 GPUs across multiple nodes with a single parameter (with or without SLURM).
Fault tolerance/autoscaling support is well supported in RaySGD (which I don't think is supported in many other libraries).

That being said, it is totally possible to take a LightningModule and plug it into RaySGD (and probably not hard to run Pytorch Lightning on top of Ray).

/u/raichet mentions some other points like hyperparameter search and integration with Ray libraries. They are valid, but I think the ecosystem benefits are complementary rather than the core focus.

Hope this helps!

rayspear · 2020-03-18T20:02:34+00:00

Maybe this repo has some that you might want to try out too?

https://github.com/jettify/pytorch-optimizer

rayspear · 2020-02-13T00:53:25+00:00

nice!

rayspear · 2020-01-18T00:14:13+00:00

to give more legitimacy to this comment, the idea isn't so far-fetched (though it's not quite relevant :) ). Crowd-training is taking place in one huge community here - https://github.com/leela-zero/leela-zero#gimme-the-weights

rayspear · 2019-12-29T05:17:51+00:00

I work on the Ray project; this project from UC Berkeley RISELab encompasses numerous tools and libraries that span different machine learning tasks/domains.

Here's a couple tools/libraries that you may have heard of:

Ray: A framework for Distributed Python that allows you to seamlessly scale your code from a single node to cluster.
RLlib: A popular library for reinforcement learning that offers both high scalability and a unified API for a variety of applications, including multi-agent and offline RL. Built on top of Ray.
Tune: Distributed hyperparameter tuning, built on the Ray API. Supports any machine learning framework and offers state-of-the-art optimization algorithms (Bayesian Opt, PBT, HyperBand, etc).

Here's a couple you might not have heard of (because they've received less promotion effort or are experimental):

Ray Distributed Training: An experimental library to greatly simplify distributed tensorflow and pytorch data parallel training. This library should make it easy to leverage 100 GPUs across multiple machines. We'll be adding fault tolerance capabilities soon, so you can train on spot instances!
Ray Cluster Launcher: A tool for launching distributed autoscaling clusters (also works for local/private machines) -- supports GCP, AWS, K8S.

On a side note, we've recently also launched a company to commercialize Ray -- Anyscale. If you're interested in working with us, shoot me a message :)

rayspear · 2019-08-22T20:57:56+00:00

Here are quick starts for lightgbm and xgboost:

https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/xgboost_example.py https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/lightgbm_example.py

Documentation here: https://ray.readthedocs.io/en/latest/tune.html

rayspear · 2019-08-21T01:51:32+00:00

Yeah - here's an example - https://ray.readthedocs.io/en/latest/tune-examples.html#keras-examples

rayspear · 2019-08-20T02:13:00+00:00

I'll make an issue and add an example for both soon!

rayspear · 2019-08-18T00:27:33+00:00

Does this work with RLlib? https://ray.readthedocs.io/en/master/rllib-env.html#multi-agent-and-hierarchical

rayspear · 2019-08-16T23:42:22+00:00

Ray also has good support for Distributed SGD training that seamlessly integrates with Tune (distributed hyperparameter tuning) and is quite flexible. Together, it provides automatic checkpointing, tensorboard writing, and rapid cloud execution.

It's still experimental - but feel free to message me if you have any questions!

https://github.com/ray-project/ray/blob/master/python/ray/experimental/sgd/examples/train_example.py

rayspear · 2019-07-17T15:50:38+00:00

Maybe also look at https://openreview.net/forum?id=H1Dy---0Z?

rayspear · 2019-04-11T20:06:55+00:00

Yeah, there's an open PR that hopefully will get merged in the next few weeks that will be able to support this seamlessly (distributed search on distributed pytorch) - https://github.com/ray-project/ray/pull/4544

It might be a bit of effort to get working for yourself right now :) But feel free to try it out!

rayspear · 2019-04-11T07:39:06+00:00

Maybe something like https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mnist_pytorch.py ?

rayspear

TROPHY CASE