Model Training Foundations (Ray Train)

2.Intro

Learn how to use Ray Train V2 to run distributed PyTorch data-parallel training on an Anyscale cluster by training a ResNet-18 model on MNIST across multiple GPUs. You’ll also understand when to use Ray Train, the basics of DDP, and how to set up data, visualization, checkpointing, and metric reporting for scalable training runs.

📚 · Introduction to Ray Train
Imports
Define ResNet- Model for MNIST
Define the Ray Train Loop (DDP per-worker)
Define `train_loop_config`
Configure Scaling with `ScalingConfig`
Wrap the Model with `prepare_model()`
Build the DataLoader with `prepare_data_loader()`
Report Training Metrics
Save Checkpoints and Report Metrics
Create the `TorchTrainer`
Inspect the Training Results
Load a Checkpoint for Inference
Lesson

+11 more lessons

3.Train And Data

Learn how to integrate **Ray Data** into a **Ray Train** workflow by streaming batches from distributed Ray Datasets instead of a local PyTorch `DataLoader`. You’ll prepare data in Parquet, load it as a Ray Dataset, apply distributed preprocessing/transforms, and build a Ray Data–backed training loop that scales across a cluster.

🔄 · Integrating Ray Train with Ray Data
Define Training Loop with Ray Data
Build DataLoader from Ray Data
Prepare Dataset for Ray Data
Define Image Transformation
Configure `TorchTrainer` with Ray Data
Lesson

+4 more lessons

4.Fault Tolerance

Learn how to make Ray Train jobs resilient to worker or node failures by adding checkpoint loading/saving (model, optimizer, epoch) to your training loop. You’ll also configure automatic retries with `FailureConfig` and run training that can recover and resume after interruptions.

🛡️ · Fault Tolerance in Ray Train
Modify Training Loop to Enable Checkpoint Loading
Save Full Checkpoint with Extra State
Launch Fault-Tolerant Training
Manual Restoration from Checkpoints
Clean Up Cluster Storage
🎉 Wrapping Up & Next Steps
Lesson 7

+5 more lessons

Model Training Foundations (Ray Train)

About this course

2.Intro

3.Train And Data

4.Fault Tolerance

2.Intro

3.Train And Data

4.Fault Tolerance