Home Knowledge Base Active Prefill Control

Active Prefill Control

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Fairness-Aware and Latency-Controllable Scheduling for Chunked-Prefill LLM Serving

arXiv:2606.09061v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed with highly heterogeneous workloads, chunked-prefill execution has emerged as a mainstream serving architecture. Balancing scheduling fairness and latency stability in such environments is critical; otherwise, severe head-of-line blocking and request starvation will degrade user experience. However, existing systems rely on rigid First-Come, First-Served (FCFS) policies and static token...

arXiv CS 1d ago

Spike-Aware C++ INT8 Inference for Sparse Spiking Language Models on Commodity CPUs

arXiv:2606.03026v1 Announce Type: new Abstract: Spiking language models expose activation sparsity that dense Transformer runtimes do not directly exploit. This paper studies that property from a systems perspective. Building on the SymbolicLight V1 spike-gated language model family, we implement a C++ CPU inference runtime that treats sparse binary spike states as an execution primitive rather than only applying post-hoc weight compression.

arXiv CS 7d ago