Learn the fundamentals of deploying LLMs with Ray.
Learn the foundations of production LLM serving with Ray Serve LLM, including how real-time inference differs from training and the core challenges of deploying models at scale. You’ll understand key performance concepts and optimizations—such as KV caching, batching strategies, and model parallelization—to meet latency and throughput SLOs.
Learn how to configure and deploy a medium-sized LLM (4–8 GPUs) using Ray Serve LLM, from defining an `LLMConfig` to launching an OpenAI-compatible serving endpoint. You’ll practice running the deployment locally or on an Anyscale Ray cluster and sending inference requests to the live service.
In this module, you’ll explore advanced, production-ready Ray Serve LLM capabilities beyond basic deployment through hands-on examples. You’ll learn how to extend a base model with LoRA adapters to serve multiple specialized behaviors from a single scalable LLM service.