BlendServe: Optimizing Offline Inference for Auto-regressive Large Models
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching
arXiv:2411.16102v2 Announce Type : replace Abstract: Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and modality makes requests more diverse in compute and memory demands, creating unique opportunities for throughput improvement by resource overlapping.