Science
SWARM+: Scalable and Resilient Multi-Agent Consensus for Decentralized Data-Aware Workload Management
Key Points
arXiv:2603.19431v3 Announce Type: replace Abstract: Distributed scientific workflows are increasingly executed across heterogeneous and geo-distributed computing environments, where centralized workload orchestration becomes a scalability and resilience bottleneck. This paper presents SWARM+, a decentralized workload management system that coordinates workload placement through hierarchical multi-agent consensus, reducing coordination overhead and dramatically improving scalability, while...
arXiv:2603.19431v3 Announce Type: replace
Abstract: Distributed scientific workflows are increasingly executed across heterogeneous and geo-distributed computing environments, where centralized workload orchestration becomes a scalability and resilience bottleneck. This paper presents SWARM+, a decentralized workload management system that coordinates workload placement through hierarchical multi-agent consensus, reducing coordination overhead and dramatically improving scalability, while tolerating failures and dynamic membership changes. SWARM+ enables data-aware scheduling policies that incorporate resource availability, data transfer node (DTN) connectivity, and data locality into workload placement decisions. We evaluate SWARM+ on the distributed FABRIC testbed using heterogeneous scientific workloads derived from production workflow traces obtained from the Pegasus Workflow Management System (WMS). Experimental results show that SWARM+ scales coordination to 990 distributed agents with approximately 1\,s per-job selection time at 110 agents. SWARM+ demonstrates balanced workload distribution, maintains over 97% job completion under distributed failures with graceful degradation (mean ~95% job completion) during correlated site outages, tolerates coordinator agent failures gracefully, improves schedule quality by employing data-aware policies, and reduces both selection time and scheduling latency by 97-98% when compared to the prior SWARM system.