LLM Serving Foundations

1.Intro Serve Llm

Learn the foundations of production LLM serving with Ray Serve LLM, including how real-time inference differs from training and the core challenges of deploying models at scale. You’ll understand key performance concepts and optimizations—such as KV caching, batching strategies, and model parallelization—to meet latency and throughput SLOs.

Introduction to Ray Serve LLM: Foundations of Large Language Model Serving
What is LLM Serving?
Key Concepts and Optimizations
Challenges in LLM Serving
Ray Serve LLM + Anyscale Architecture
Getting Started with Ray Serve LLM
Key Takeaways

+4 more lessons

2.Deploy Medium Llm

Learn how to configure and deploy a medium-sized LLM (4–8 GPUs) using Ray Serve LLM, from defining an `LLMConfig` to launching an OpenAI-compatible serving endpoint. You’ll practice running the deployment locally or on an Anyscale Ray cluster and sending inference requests to the live service.

Deploy a Medium-Sized LLM with Ray Serve LLM
Overview: Why Medium-Sized Models?
Setting up Ray Serve LLM
Local Deployment & Inference
Deploying to Anyscale Services
Advanced Topics: Monitoring & Optimization
Summary & Outlook

+4 more lessons

3.Advanced Llm Features

In this module, you’ll explore advanced, production-ready Ray Serve LLM capabilities beyond basic deployment through hands-on examples. You’ll learn how to extend a base model with LoRA adapters to serve multiple specialized behaviors from a single scalable LLM service.

Advanced LLM Features with Ray Serve LLM
Overview: Advanced Features Preview
Example: Deploying LoRA Adapters
Example: Getting Structured JSON Output
Example: Setting up Tool Calling
How to Choose an LLM?
Conclusion: Next Steps

+4 more lessons

LLM Serving Foundations

About this course

1.Intro Serve Llm

2.Deploy Medium Llm

3.Advanced Llm Features

1.Intro Serve Llm

2.Deploy Medium Llm

3.Advanced Llm Features