Science
Microarchitectural Co-Optimization for Sustained Throughput of RISC-V Multi-Lane Chaining Vector Processors
Key Points
arXiv:2604.22314v2 Announce Type: replace Abstract: Modern RISC vector processors rely on multi-lane parallelism and chaining to achieve high sustained throughput, yet practical execution often deviates from the ideal reference due to microarchitectural inefficiencies. This work targets the open-source RVV processor Ara and analyzes its sustained-throughput loss under a fixed hardware configuration. We first establish an ideal multi-lane chaining model that decomposes ideal execution into...
arXiv:2604.22314v2 Announce Type: replace
Abstract: Modern RISC vector processors rely on multi-lane parallelism and chaining to achieve high sustained throughput, yet practical execution often deviates from the ideal reference due to microarchitectural inefficiencies. This work targets the open-source RVV processor Ara and analyzes its sustained-throughput loss under a fixed hardware configuration. We first establish an ideal multi-lane chaining model that decomposes ideal execution into prologue startup, steady-state progression, and tail drain, and uses this reference to characterize real-execution deviations. Based on this model, we attribute Ara's bottlenecks to three critical paths: memory-side data supply and transaction progression, dependence-and-issue control, and operand delivery and result propagation. To address these bottlenecks, we propose coordinated optimizations, including a descriptor-driven memory front end with next-VL prefetch, early read-dependence release with dynamic local issue control, and multi-source forwarding with dual-source operand queues. Experimental results show that, without increasing raw memory bandwidth or changing the main processor configuration, Ara-Opt achieves a geometric-mean speedup of 1.33x over baseline Ara. Under roofline-based normalization, the geometric-mean gap-closed ratio reaches 12.2%. In particular, scal, axpy, ger, and gemm achieve speedups of approximately 2.41x, 1.60x, 1.52x, and 1.42x, with corresponding gap-closed ratios of 93.7%, 88.9%, 78.3%, and 59.3%, respectively. These results show that the proposed optimizations recover lost sustained throughput under essentially unchanged hardware resources and move regular streaming and high-throughput workloads closer to the roofline-based performance bound.