SLO
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Beyond Greedy Chunking: SLO-Aware Sliding-Window Scheduling for LLM Inference
arXiv:2606.05933v1 Announce Type: new Abstract: With the rapid growth of interactive applications in large language model (LLM) online services, maintaining high system throughput while ensuring user-perceived latency has become a key issue in inference scheduling. Existing LLM service systems rely on coarse-grained output constraints, making it difficult to effectively handle resource contention among multiple requests, resulting in low resource utilization efficiency and limited support...
Scalable Joint Resource Allocation for SLO-Constrained LLM Inference in Heterogeneous GPU Clouds
arXiv:2604.07472v2 Announce Type: replace Abstract: Serving large language model (LLM) inference in cloud environments requires jointly optimizing model selection, GPU provisioning, parallelism configuration, and workload routing under latency, accuracy, memory, and budget constraints. While mixed-integer linear programming (MILP) can model this problem, its computational cost limits frequent re-optimization under demand variability. Existing heuristics often optimize individual components...
Emergence-as-Code as a Foundation for Self-Governing Reliable Systems
arXiv:2602.05458v2 Announce Type: replace Abstract: Service-level objective (SLO)-as-code tools make per-service reliability declarative, but users experience journeys: end-to-end executions whose availability and tail latency emerge from topology, routing, redundancy, timeouts/fallbacks, shared failure domains, and tail amplification. Journey objectives are therefore often maintained outside code and drift away from the effective runtime graph. We propose Emergence-as-Code (EmaC), a...
Harmonia: End-to-End RAG Serving Optimization
arXiv:2505.07833v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) improves the reliability of large language models by integrating external knowledge, but serving RAG pipelines efficiently is challenging because requests traverse heterogeneous components spanning LLM inference, databases, and CPU-side processing. We present Harmonia, an end-to-end RAG serving framework that addresses these bottlenecks through (i) a flexible pipeline specification interface for...
The Smart Bird Feeders Everyone’s Talking About (and Actually Buying) (2026)
you’ve probably seen a smart bird feeder or know someone who has one. They’re easily recognizable with their clear housing, cameras, and solar panels. Perhaps a friend or family member has sent you a photo or video of a bright goldfinch or handsome woodpecker (guilty).
NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference
Announce Type: new Abstract: Disaggregated LLM inference forces the KV cache to traverse the datacenter network before decoding begins, so transfer time enters directly into the Time to First Token (TTFT) budget. Current schedulers route on compute load and prefix-cache locality alone, ignoring the topological distance and dynamic congestion between prefill and decode instances.
Shift from a Leader-Follower to a Leader-Leader Approach
Shift from a Leader-Follower to a Leader-Leader Approach What a U.S. Navy Captain Can Teach Us About Engineering Leadership Even though today we lead people, we've most likely climbed the engineering ladder through technical excellence. Our code was cleaner, architectures more elegant and scalable, and solutions we built did work. Now, when we lead a team of engineers, we may feel that our efficiency has faded.