MVAPICH2
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
PartRePer-MPI: Combining Fault Tolerance and Performance for MPI Applications
arXiv:2310.16370v2 Announce Type: replace Abstract: As we have entered Exascale computing, the faults in high-performance systems are expected to increase considerably. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher frequency resulting in an excessive amount of overhead which would not be sustainable for many scientific applications. Replication allows for fast recovery from failures by simply dropping the...