Home Knowledge Base Throughput Optimization

Throughput Optimization

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Throughput Optimization for Multi-AP IEEE P802.11bq Networks Based on Combinatorial Multi-Armed Bandits

arXiv:2606.03528v1 Announce Type: new Abstract: This paper addresses distributed throughput optimization for dense multi-AP IEEE P802.11bq networks. We develop a packet-level model that jointly captures cross-link carrier-sense multiple access with collision avoidance (CSMA/CA), sub-7GHz RTS/CTS exchange, beam-training overhead, directional mmWave interference, signal-to-interference-plus-noise-ratio (SINR)-based MCS selection, and retransmissions. The resulting configuration problem is...

arXiv CS 7d ago

Microarchitectural Co-Optimization for Sustained Throughput of RISC-V Multi-Lane Chaining Vector Processors

arXiv:2604.22314v2 Announce Type: replace Abstract: Modern RISC vector processors rely on multi-lane parallelism and chaining to achieve high sustained throughput, yet practical execution often deviates from the ideal reference due to microarchitectural inefficiencies. This work targets the open-source RVV processor Ara and analyzes its sustained-throughput loss under a fixed hardware configuration. We first establish an ideal multi-lane chaining model that decomposes ideal execution into...

arXiv CS 6d ago

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

arXiv:2606.07881v1 Announce Type: new Abstract: Pipeline parallelism is essential for training large neural networks, but existing schedules trade off throughput, memory, and optimization consistency. Synchronous pipelines preserve forward/backward weight consistency but suffer from bubbles; asynchronous pipelines remove bubbles but introduce weight-version mismatch, typically requiring weight stashing, prediction, or correction mechanisms. We introduce PACI (Pipeline Asynchronous training...

arXiv CS 1d ago

DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference

arXiv:2606.02982v1 Announce Type: new Abstract: The rapid growth of large language model (LLM) inference services has increased the demand for efficient multi-tenant GPU scheduling. While modern inference runtimes such as vLLM improve throughput through continuous batching and optimized memory management, accurately estimating the runtime cost of heterogeneous inference requests remains a significant challenge.

arXiv CS 7d ago

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

Announce Type: new Abstract: Asynchronous reinforcement learning can improve language-model post-training throughput by decoupling response generation from policy optimization, but stale responses introduce distribution drift. Standard behavior-corrected methods control this drift with behavior-policy probabilities, importance ratios, or clipping, which requires token-aligned, versioned, and numerically consistent behavior log-probabilities across rollout and learner systems. We ask whether...

arXiv CS 7d ago

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

Announce Type: replace Abstract: Asynchronous reinforcement learning can improve language-model post-training throughput by decoupling response generation from policy optimization, but stale responses introduce distribution drift. Standard behavior-corrected methods control this drift with behavior-policy probabilities, importance ratios, or clipping, which requires token-aligned, versioned, and numerically consistent behavior log-probabilities across rollout and learner systems. We ask...

arXiv CS 5d ago

BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

arXiv:2411.16102v2 Announce Type : replace Abstract: Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and modality makes requests more diverse in compute and memory demands, creating unique opportunities for throughput improvement by resource overlapping.

arXiv CS 1d ago

DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

arXiv:2511.04791v2 Announce Type: replace Abstract: Modern LLM serving systems must sustain high throughput while meeting strict latency SLOs across two distinct inference phases: compute-intensive prefill and memory-bound decode phases. Existing approaches either (1) aggregate both phases on shared GPUs, leading to interference between prefill and decode phases, which degrades Time-Between-Tokens (TBT); or (2) disaggregate the two phases across GPUs, improving latency but wasting resources...

arXiv CS 8d ago

Deterministic versus Stochastic Optimization for Joint Path Planning and Dynamic Time Splitting in Multiple-UAV-Cached IoT Networks

Announce Type: new Abstract: This paper examines wireless-powered Internet of Things (IoT) networks involving multiple unmanned aerial vehicles (UAVs) equipped with backscatter and caching technologies to relay and transmit signals. For data communication and energy harvesting (EH), the source transmits information and power to UAVs using the dynamic time splitting (DTS) method. UAVs use harvested energy for passive communication (backscatter) and for active communication (transmitting...

arXiv CS 1d ago

Accelerating Bidiagonalization of Banded Matrices through Memory-Aware Bulge-Chasing on GPUs

arXiv:2510.12705v3 Announce Type: replace Abstract: The reduction of a banded matrix to bidiagonal form is a critical step in the calculation of Singular Values, a cornerstone of scientific computing and AI. Although inherently parallel, this step has traditionally been considered unsuitable for GPUs due to its memory-bound nature. However, recent advances in GPU architectures, such as increased L1 memory per Streaming Multiprocessor or Compute Unit and larger L2 caches, have shifted this...

arXiv CS 1d ago