Blog | llm-d

Distinguished Engineer, IBM

Deploying llm-d is not just a question of choosing a model server and adding GPUs. In a production inference deployment, operators have to choose routing policies, admission behavior, batching settings, KV-cache reuse strategies, prefill/decode placement, and autoscaling rules under concrete TTFT, ITL, throughput, and cost constraints.

These choices are coupled. A routing change that improves cache locality can concentrate load. A prefill/decode threshold that helps one workload can hurt another. An admission policy that protects critical traffic can reduce total served volume. A change in any one policy can shift TTFT, inter-token latency, throughput, SLO compliance, and accelerator cost in ways that are difficult to predict analytically.

The only reliable way to confirm those tradeoffs is to measure them in a GPU-backed llm-d cluster. But using cluster runs as the first step in every policy or capacity-planning experiment is too slow and expensive. BLIS provides a faster inner loop: a calibrated discrete-event simulator for distributed inference systems like llm-d. Developers can evaluate candidate policies and deployment configurations locally, then reserve cluster validation for the candidates most likely to matter.

Blog key takeaways

BLIS is a discrete-event simulator: — it models admission, routing, scheduling, KV cache, batching, and prefill/decode placement without loading model weights or occupying GPUs.
Calibrated fidelity: Median 7–9% error on end-to-end and inter-token latency across 36 validation experiments spanning 8B–141B parameter models, H100/A100/L40S GPUs, and diverse workloads. Approximately 200× faster than equivalent cluster runs.
Admission control case study: An AI-native policy-search loop using BLIS discovered a probabilistic admission controller that reduced critical-tier TTFT p90 by up to 97% and end-to-end latency by up to 50%, validated on a real llm-d cluster.
Capacity planning: BLIS evaluates hundreds of deployment configurations in minutes, producing ranked Pareto-optimal candidates before any GPU time is spent.

llm-d v0.7: From Feature Introduction to Production Hardening

May 12, 2026 · 14 min read

llm-d maintainers

If v0.6 was about proving what llm-d could do—OTel integration, prefill/decode disaggregation, initial multi-accelerator images—then v0.7 is about making sure you can actually deploy it. The theme across every category is the same: remove friction, broaden hardware reach, and give operators the documentation and CI coverage to trust the system in production.

Recent external validations demonstrated llm-d's performance gains. Those capabilities remain and continue to improve, but v0.7's investment is making them accessible: onboarding guides, tested installation paths, and confidence the guides work on your target platform.

Production-Grade LLM Inference at Scale with KServe, llm-d, and vLLM

April 21, 2026 · 5 min read

Yuan Tang

Senior Principal Software Engineer, Red Hat

Scott Cabrinha

Staff Site Reliability Engineer, Tesla

Director of Engineering, Red Hat

Sai Krishna

Staff Software Engineer, Tesla

The Problem with "Simple" LLM Deployments

Everyone is racing to run Large Language Models (LLMs), in the cloud, on-prem, and even on edge devices. The real challenge, however, isn't the first deployment; it's scaling, managing, and maintaining hundreds of LLMs efficiently. We initially approached this challenge with a straightforward vLLM deployment wrapped in a Kubernetes StatefulSet.

Predicted-Latency Based Scheduling for LLMs

March 13, 2026 · 28 min read

Kaushik Mitra

Software Engineer, Google

Benjamin Braun

Software Engineer, Google

Abdullah Gharaibeh

Senior Staff Software Engineer, Google

Distinguished Engineer, Google

Not all LLM requests cost the same. A short prompt might complete in milliseconds, while a long one can occupy a GPU for seconds. If we can predict how long a request will take on each candidate server before dispatching it, we can make substantially better routing decisions. This post describes a system that does exactly that: a lightweight ML model trained online from live traffic that replaces manually tuned heuristic weights with direct latency predictions.

Native KV Cache Offloading to Any Filesystem with llm-d

February 10, 2026 · 11 min read

Kfir Toledo

Research Staff Member, IBM

Danny Harnik

Senior Technical Staff Member, IBM

Effi Ofer

Research Staff Member, IBM

Or Ozeri

Research Staff Member, IBM

Guy Margalit

Senior Technical Staff Member, IBM Storage CTO Office

llm-d is a distributed inference platform spanning multiple vLLM instances. KV cache hits are critical to achieving high inference throughput. Yet, in a distributed environment, cache hits do not occur across different nodes as the KV cache is local to each vLLM instance. In addition, this local cache is limited in size, further limiting KV data reuse. This blog presents a new way to offload KV cache to storage, tackling both aforementioned challenges – KV cache sharing and KV cache scale. llm-d's filesystem (FS) backend is a KV cache storage connector for vLLM that offloads KV blocks to shared storage based on vLLM's native Offloading Connector. While the llm-d FS backend can speed up serving of single requests (improve TTFT), its main goal is rather to preserve stable throughput and low latency at scale, as concurrency and context lengths grow. This is accomplished by significantly enlarging the cache space and enabling KV reuse across multiple replicas and nodes in llm-d.

While there are a number of existing solutions for KV cache offload to storage (e.g. LMCache or Dynamo KVBM), the new connector offers simplicity, can run with llm-d and vLLM as the only dependency, and exhibits improved performance over state-of-the-art shared storage connectors.

llm-d 0.5: Sustaining Performance at Scale

February 4, 2026 · 13 min read

Director of Engineering, Red Hat

Distinguished Engineer, Google

Distinguished Engineer, IBM

In our previous release (v0.4), we focused on improving the end-to-end latency of production inference, introducing speculative decoding and extending prefill/decode disaggregation across a broader set of accelerator architectures. That work established llm-d’s ability to deliver state-of-the-art latency along the critical serving path. Sustaining low latency increasingly depended on how KV-cache pressure is handled once GPU memory is saturated, whether cached state can be reused across replicas instead of being repeatedly rebuilt, and how requests are routed when workloads mix adapters, models, and availability requirements.

With v0.5, llm-d expands its focus from peak performance to the operational rigor required to sustain performance at scale. This release prioritizes reproducibility, resilience, and cost efficiency, with concrete improvements across the following areas:

Developer Experience and reproducibility: We have simplified the benchmarking workflow with dedicated, in-guide benchmark support, allowing users to validate each “well-lit path” with a single command.
Hierarchical KV Offloading: A new storage architecture decouples cache capacity from GPU memory through native CPU and filesystem tiers.
Advanced Scheduling: Cache-aware routing now supports LoRA adapters and active-active high availability.
Resilient Networking: A new transport backend (UCCL) improves stability in congested networks.
Autoscaling Updates: We have introduced scale-to-zero capabilities for cost-efficient intermittent workloads.

llm-d 0.4: Achieve SOTA Performance Across Accelerators

December 2, 2025 · 10 min read

Director of Engineering, Red Hat

Distinguished Engineer, Google

Distinguished Engineer, IBM

llm-d’s mission is to provide the fastest time to SOTA inference performance across any accelerator and cloud. In our 0.3 release we enabled wide expert parallelism for large mixture-of-expert models to provide extremely high output token throughput - a key enabler for reinforcement learning - and we added preliminary support for multiple non-GPU accelerator families.

This release brings the complement to expert parallelism throughput: improving end-to-end request latency of production serving. We reduce DeepSeek per token latency up to 50% with speculative decoding and vLLM optimizations for latency critical workloads. We add dynamic disaggregated serving support to Google TPU and Intel XPU to further reduce time to first token latency when traffic is unpredictable, while our new well-lit path for prefix cache offloading helps you leverage CPU memory and high performance remote storage to increase hit rates and reduce tail latency. For users with multiple model deployments our workload autoscaler preview takes real-time server capacity and traffic into account to reduce the amount of time a model deployment is queuing requests - lessening the operational toil running multiple models over constrained accelerator capacity.

These OSS inference stack optimizations, surfaced through our well-lit paths, ensure you reach SOTA latency on frontier OSS models in real world scenarios.

llm-d 0.3: Wider Well-Lit Paths for Scalable Inference

October 10, 2025 · 10 min read

Director of Engineering, Red Hat

Distinguished Engineer, Google

Distinguished Engineer, IBM

In our 0.2 release, we introduced the first well-lit paths, tested blueprints for scaling inference on Kubernetes. With our 0.3 release, we double down on the mission: to provide a fast path to deploying high performance, hardware-agnostic, easy to operationalize, at scale inference.

This release delivers:

Expanded hardware support, now including Google TPU and Intel support
TCP and RDMA over RoCE validated for disaggregation
A predicted latency based balancing preview that improves P90 latency by up to 3x in long-prefill workloads
Wide expert parallel (EP) scaling to 2.2k tokens per second per H200 GPU
The GA release of the Inference Gateway (IGW v1.0).

Taken together, these results redefine the operating envelope for inference. llm-d enables clusters to run hotter before scaling out, extracting more value from each GPU, and still meet strict latency objectives. The result is a control plane built not just for speed, but for predictable, cost-efficient scale.

KV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d

September 24, 2025 · 21 min read

Maroon Ayoub

Research Scientist & Architect, IBM

Danny Harnik

Senior Technical Staff Member, IBM

Tyler Smith

Member of Technical Staff, Red Hat

Kellen Swain

Software Engineer, Google

Xining Wang

Senior Technical Expert, Alibaba Cloud

Hang Yin

Senior R&D Engineer, Alibaba Cloud

Kay Yan

Principal Software Engineer, DaoCloud

The llm-d project provides a series of “well-lit paths” - tested, benchmarked solutions for deploying large language models in production. Our first path, Intelligent Inference Scheduling, established a baseline for AI-aware routing by balancing both cluster load and prefix-cache affinities. The default configuration for that path uses an approximate method for the latter, predicting cache locality based on request traffic.

This blog illuminates a more advanced and powerful path: precise prefix-cache aware scheduling.

We take a deep dive into the next generation of this feature, which moves beyond prediction and gives the scheduler direct introspection into distributed vLLM caches. This precision is key to maximizing cache hit rates and achieving a new level of performance and maximizing cost-efficiency in your distributed deployments.

Blog key takeaways

KV-cache hit rates directly impact your bottom line: With 10x cost differences between cached and uncached tokens, cache efficiency isn't just a performance optimization — it's a fundamental cost and performance driver
This isn't theoretical: Real production workloads like conversational AI and agentic workflows naturally create the prefix-heavy patterns where this approach excels
vLLM's prefix caching breaks in distributed deployments: Standard load balancers scatter related requests across pods, destroying cache locality and forcing expensive re-computation
Precise prefix-cache aware scheduling delivers order-of-magnitude gains: Our benchmarks show 57x faster response times and double the throughput on identical hardware

Intelligent Inference Scheduling with llm-d

September 3, 2025 · 10 min read

Nili Guy

R&D Manager, AI Infrastructure, IBM

Vita Bortnikov

IBM Fellow, IBM

Etai Lev Ran

Cloud Architect, IBM

Director of Engineering, Red Hat