Home Knowledge Base BlendServe: Optimizing Offline Inference for Auto-regressive Large Models

BlendServe: Optimizing Offline Inference for Auto-regressive Large Models

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

arXiv:2411.16102v2 Announce Type : replace Abstract: Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and modality makes requests more diverse in compute and memory demands, creating unique opportunities for throughput improvement by resource overlapping.

arXiv CS 1d ago