Simple is Better: Multiplication May Be All You Need for LLM Request Scheduling

arXiv CS Monday 08 June 2026, 04:00 UTC By Dingyan Zhang, Jinbo Han, Kaixi Zhang, Xingda Wei, Sijie Shen, Chenguang Fang, Wenyuan Yu, Jingren Zhou, Rong Chen 1 min read

Key Points

Announce Type: replace Abstract: High-quality LLM request scheduling requires meeting two key objectives: ensuring the routed instance has KVCache to accelerate request execution, and ensuring that the workload is balanced across instances. Achieving both objectives is challenging because pursuing one may compromise the other. Current approaches use various combinators (e.g., linear combinations) to compute a scheduling score that combines indicators for the two objectives.

arXiv:2603.15202v3 Announce Type: replace Abstract: High-quality LLM request scheduling requires meeting two key objectives: ensuring the routed instance has KVCache to accelerate request execution, and ensuring that the workload is balanced across instances. Achieving both objectives is challenging because pursuing one may compromise the other. Current approaches use various combinators (e.g., linear combinations) to compute a scheduling score that combines indicators for the two objectives. These approaches are complex: they either require significant workload-specific hyperparameter tuning or model-hardware-aware simulator development, yet could still lead to suboptimal performance. In this paper, we show that using a simple multiplication of two carefully chosen indicators: one KVCache-aware (new prefill tokens if routed to an instance) and one load-balancing-aware (current batch size of the instance), as the scheduling score (LMETRIC) can achieve both objectives simultaneously without any hyperparameter tuning. The key idea is that the simply multiplied score considers both objectives in a manner similar to a linear combination, but the original hyperparameters cancel out during comparison, so no tuning is needed to find the best parameters. The two indicators are chosen based on our analysis of LLM characteristics. Our extensive experiments show that this simple approach can reduce TTFT by 92% and 39%, and TPOT by 24% and 51%, compared to vLLM-v1 and an in-production scheduler on real-world workloads covering chatbots and coding agents. We also derive the mathematical conditions under which multiplication may fail, and find that such conditions are extremely rare in practice and can be detected (and mitigated) beforehand. LMETRIC has been deployed in production and canary release confirms its effectiveness

LLM (ORG) KVCache (ORG) linear (ORG) LMETRIC (ORG)

Originally published by arXiv CS Read original →

Simple is Better: Multiplication May Be All You Need for LLM Request Scheduling

Related Stories

Link between poverty and access to nature | Letter

The Last Evolution, by John W Campbell Jr. (1932)

Genetically modified worms can now produce and deliver drugs inside a living body, scientists say

Indonesia Landslides Devastated Endangered Orangutans, Study Finds