Ray 1.9 makes Train beta, launches Ray Job Submission server

Ray 1.9 makes Train beta, launches Ray Job Submission server

The team behind the Ray framework for building distributed applications has pushed version 1.9 of its project into the open — a step that also marks the beginning of the beta phase for deep learning module Ray Train.

Ray Train was added to the framework to “simplify distributed deep learning” by offering an abstraction layer to distributed machine learning frameworks like PyTorch Distributed, Horovod and TensorFlow. It also comes with support for well-known tools in the discipline, such as Jupyter notebooks, MLflow, and Tensorboard. 

With the graduation into beta, Train can work together with Ray Client and Ray Datasets. Other changes made to support users in their training endeavours include a simplification of single-worker training workflows, and the addition of APIs train.torch.prepare_model(...) and train.torch.prepare_data_loader(...). The latter are meant to prepare PyTorch models and DataLoaders for distributed training.

For the 1.9 release, the Ray team looked into ways of submitting and monitoring applications without an active connection. The result is Ray Job Submission server, which — along with CLI and SDK clients — is meant to tackle the problem by offering a way of submitting local applications to a remote cluster and managing them as jobs. The new addition is currently in alpha, so not fit for production, though the documentation promises “mostly stable” APIs.

Users who were missing a remote file storage option for runtime environments from Ray will find corresponding support in the latest release. It also includes garbage collection and logging improvements for runtime_env, and fixes the threaded actor/core worker/named actor race condition issues.

Meanwhile reinforcement learning library RLlib, which comes bundled with the Ray framework, underwent an architecture refactoring. After some modifications, using tf2 together with eager tracing shouldn’t be much slower than working with tf. Since experiments with sparse reward and reward-at-end environments suggest a more stable learning behaviour when the last ts in V-trace calculations isn’t dropped, this can now be forced by setting vtrace_drop_last_ts in the APPO/IMPALA config to false.

RLib also looks to lose the build_trainer and build_(tf_)?policy utility functions in one of the upcoming releases. While this might still take a while, developers currently working with those functions could do worse than familiarising themselves with a newly added proof of concept that uses sub-classing for building custom trainer classes instead. Details on this and added bug fixes can be found in the release notes and the project documentation.

Model serving library Ray Serve learned to save its internal state to an external storage and recover upon failure since the last release, and comes now fitted with replica autoscaling and a native pipeline API for model composition. Hyperparameter tuning library Tune was reworked in areas like experiment analysis, testing, and cloud checkpointing API/SyncConfig, and Ray Datasets started to offer groupby and aggregations.