Home Science SWARM+: Scalable and Resilient Multi-Agent Consensus for...
Science

SWARM+: Scalable and Resilient Multi-Agent Consensus for Decentralized Data-Aware Workload Management

Key Points

arXiv:2603.19431v2 Announce Type: replace Abstract: Distributed scientific workflows are increasingly executed across heterogeneous and geo-distributed computing environments, where centralized workload orchestration becomes a scalability and resilience bottleneck. This paper presents SWARM+, a decentralized workload management system that coordinates workload placement through hierarchical multi-agent consensus, reducing coordination overhead and dramatically improving scalability, while...

arXiv:2603.19431v2 Announce Type: replace Abstract: Distributed scientific workflows are increasingly executed across heterogeneous and geo-distributed computing environments, where centralized workload orchestration becomes a scalability and resilience bottleneck. This paper presents SWARM+, a decentralized workload management system that coordinates workload placement through hierarchical multi-agent consensus, reducing coordination overhead and dramatically improving scalability, while tolerating failures and dynamic membership changes. SWARM+ enables data-aware scheduling policies that incorporate resource availability, data transfer node (DTN) connectivity, and data locality into workload placement decisions. We evaluate SWARM+ on the distributed FABRIC testbed using heterogeneous scientific workloads derived from production workflow traces obtained from the Pegasus Workflow Management System (WMS). Experimental results show that SWARM+ scales coordination to 990 distributed agents with sub-second per-job selection time with 110 agents. SWARM+ demonstrates balanced workload distribution, maintains over $95\%$ job completion under distributed failures with graceful degradation during correlated site outages, tolerates coordinator agent failures gracefully, improves schedule quality by employing data-aware policies, and reduces both selection time and scheduling latency by $97$--$98\%$ when compared to the prior SWARM system.
Scalable and Resilient Multi-Agent (ORG) DTN (ORG) the Pegasus Workflow Management System (ORG) WMS (ORG)
Originally published by arXiv CS Read original →