Home Knowledge Base SLO

SLO

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Beyond Greedy Chunking: SLO-Aware Sliding-Window Scheduling for LLM Inference

arXiv:2606.05933v1 Announce Type: new Abstract: With the rapid growth of interactive applications in large language model (LLM) online services, maintaining high system throughput while ensuring user-perceived latency has become a key issue in inference scheduling. Existing LLM service systems rely on coarse-grained output constraints, making it difficult to effectively handle resource contention among multiple requests, resulting in low resource utilization efficiency and limited support...

arXiv CS 5d ago

Scalable Joint Resource Allocation for SLO-Constrained LLM Inference in Heterogeneous GPU Clouds

arXiv:2604.07472v2 Announce Type: replace Abstract: Serving large language model (LLM) inference in cloud environments requires jointly optimizing model selection, GPU provisioning, parallelism configuration, and workload routing under latency, accuracy, memory, and budget constraints. While mixed-integer linear programming (MILP) can model this problem, its computational cost limits frequent re-optimization under demand variability. Existing heuristics often optimize individual components...

arXiv CS 2d ago

Emergence-as-Code as a Foundation for Self-Governing Reliable Systems

arXiv:2602.05458v2 Announce Type: replace Abstract: Service-level objective (SLO)-as-code tools make per-service reliability declarative, but users experience journeys: end-to-end executions whose availability and tail latency emerge from topology, routing, redundancy, timeouts/fallbacks, shared failure domains, and tail amplification. Journey objectives are therefore often maintained outside code and drift away from the effective runtime graph. We propose Emergence-as-Code (EmaC), a...

arXiv CS 1d ago

Harmonia: End-to-End RAG Serving Optimization

arXiv:2505.07833v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) improves the reliability of large language models by integrating external knowledge, but serving RAG pipelines efficiently is challenging because requests traverse heterogeneous components spanning LLM inference, databases, and CPU-side processing. We present Harmonia, an end-to-end RAG serving framework that addresses these bottlenecks through (i) a flexible pipeline specification interface for...

arXiv CS 1d ago

The Smart Bird Feeders Everyone’s Talking About (and Actually Buying) (2026)

you’ve probably seen a smart bird feeder or know someone who has one. They’re easily recognizable with their clear housing, cameras, and solar panels. Perhaps a friend or family member has sent you a photo or video of a bright goldfinch or handsome woodpecker (guilty).

Wired 1d ago

NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference

Announce Type: new Abstract: Disaggregated LLM inference forces the KV cache to traverse the datacenter network before decoding begins, so transfer time enters directly into the Time to First Token (TTFT) budget. Current schedulers route on compute load and prefix-cache locality alone, ignoring the topological distance and dynamic congestion between prefill and decode instances.

arXiv CS 7d ago

Shift from a Leader-Follower to a Leader-Leader Approach

Shift from a Leader-Follower to a Leader-Leader Approach What a U.S. Navy Captain Can Teach Us About Engineering Leadership Even though today we lead people, we've most likely climbed the engineering ladder through technical excellence. Our code was cleaner, architectures more elegant and scalable, and solutions we built did work. Now, when we lead a team of engineers, we may feel that our efficiency has faded.

Hacker News 9d ago