Adaptive GPU Multiplexing
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing
arXiv:2511.04791v2 Announce Type: replace Abstract: Modern LLM serving systems must sustain high throughput while meeting strict latency SLOs across two distinct inference phases: compute-intensive prefill and memory-bound decode phases. Existing approaches either (1) aggregate both phases on shared GPUs, leading to interference between prefill and decode phases, which degrades Time-Between-Tokens (TBT); or (2) disaggregate the two phases across GPUs, improving latency but wasting resources...
DynaGraph: Lightweight Multi-Model Interaction Framework via Dynamic Topological Reconfiguration
arXiv:2605.29511v2 Announce Type: replace Abstract: Tackling complex reasoning tasks typically relies on massive monolithic LLMs, which suffer from severe computational redundancy. While task decomposition through structured pipelines or multi-agent collaborations offers an alternative, these approaches inevitably fall into a critical dilemma: predefined static topologies are highly vulnerable to cascading errors, whereas unconstrained dynamic agents suffer from trajectory divergence and...