BLIS: Evolving llm-d at Simulation Speed
Deploying llm-d is not just a question of choosing a model server and adding GPUs. In a production inference deployment, operators have to choose routing policies, admission behavior, batching settings, KV-cache reuse strategies, prefill/decode placement, and autoscaling rules under concrete TTFT, ITL, throughput, and cost constraints.
These choices are coupled. A routing change that improves cache locality can concentrate load. A prefill/decode threshold that helps one workload can hurt another. An admission policy that protects critical traffic can reduce total served volume. A change in any one policy can shift TTFT, inter-token latency, throughput, SLO compliance, and accelerator cost in ways that are difficult to predict analytically.
The only reliable way to confirm those tradeoffs is to measure them in a GPU-backed llm-d cluster. But using cluster runs as the first step in every policy or capacity-planning experiment is too slow and expensive. BLIS provides a faster inner loop: a calibrated discrete-event simulator for distributed inference systems like llm-d. Developers can evaluate candidate policies and deployment configurations locally, then reserve cluster validation for the candidates most likely to matter.
- BLIS is a discrete-event simulator: — it models admission, routing, scheduling, KV cache, batching, and prefill/decode placement without loading model weights or occupying GPUs.
- Calibrated fidelity: Median 7–9% error on end-to-end and inter-token latency across 36 validation experiments spanning 8B–141B parameter models, H100/A100/L40S GPUs, and diverse workloads. Approximately 200× faster than equivalent cluster runs.
- Admission control case study: An AI-native policy-search loop using BLIS discovered a probabilistic admission controller that reduced critical-tier TTFT p90 by up to 97% and end-to-end latency by up to 50%, validated on a real llm-d cluster.
- Capacity planning: BLIS evaluates hundreds of deployment configurations in minutes, producing ranked Pareto-optimal candidates before any GPU time is spent.




























