Home Science CUCo: An Agentic Framework for Compute and Communication...
Science

CUCo: An Agentic Framework for Compute and Communication Co-design

Key Points

Announce Type: replace Abstract: Computation and communication in distributed LLM training and inference are traditionally optimized in isolation; expert-crafted systems such as DeepEP, FLUX, and TokenWeave show the potential of co-design but require deep systems expertise and hardware-specific tuning; CUCo is an agentic framework that automates compute-communication co-design of CUDA kernels by combining a structured design-space formalization with a correctness-first fast-path agent for...

arXiv:2603.02376v2 Announce Type: replace Abstract: Computation and communication in distributed LLM training and inference are traditionally optimized in isolation; expert-crafted systems such as DeepEP, FLUX, and TokenWeave show the potential of co-design but require deep systems expertise and hardware-specific tuning; CUCo is an agentic framework that automates compute-communication co-design of CUDA kernels by combining a structured design-space formalization with a correctness-first fast-path agent for reliable baselines and an evolution-driven slow-path agent for high-performance strategies, achieving up to 1.57x speedup across four multi-GPU workloads and discovering a two-stream overlap strategy on a DeepSeek-V3 MoE layer that hides dispatch behind local compute at an LLM inference cost under $10 per workload.
An Agentic Framework (ORG) Compute and Communication Co-design (ORG) LLM (ORG) CUDA (ORG) GPU (ORG)
Originally published by arXiv CS Read original →