Learn the foundations of distributed training of machine learning models with Ray Train.
Learn how to use Ray Train V2 to run distributed PyTorch data-parallel training on an Anyscale cluster by training a ResNet-18 model on MNIST across multiple GPUs. You’ll also understand when to use Ray Train, the basics of DDP, and how to set up data, visualization, checkpointing, and metric reporting for scalable training runs.
Learn how to integrate **Ray Data** into a **Ray Train** workflow by streaming batches from distributed Ray Datasets instead of a local PyTorch `DataLoader`. You’ll prepare data in Parquet, load it as a Ray Dataset, apply distributed preprocessing/transforms, and build a Ray Data–backed training loop that scales across a cluster.
Learn how to make Ray Train jobs resilient to worker or node failures by adding checkpoint loading/saving (model, optimizer, epoch) to your training loop. You’ll also configure automatic retries with `FailureConfig` and run training that can recover and resume after interruptions.