بایگانی برچسب برای: Adaptive Routing

Fault-tolerant-Routing-for-Multiple-Permanent.[taliem.ir]

Fault-tolerant Routing for Multiple Permanent and Non-permanent Faults in HPC Systems

The interconnection network communicates and links together the processing units of modern highperformance computing systems. In this context, network faults have an extremely high impact since most routing algorithms were not designed to tolerate faults. Because of this, just a single fault may stall messages in the network, preventing the finalization of applications, or may lead to deadlocked configurations. In this paper we introduce a fault-tolerant routing method designed to solve a large number of dynamic permanent and non-permanent link faults. As failures appear randomly during system operation, our method provides escape paths for the stalled messages and, at the same time, avoids deadlock occurrences. Our proposal avoids faulty areas by means of multipath routing approaches, taking advantage of the communication path redundancy, as long as alternative paths are available. Performance evaluation consists of synthetic test scenarios for proving correctness, and test scenarios based on the availability traces of real high-performance systems. Experiments show that our method allows applications to successfully complete their executions even in the presence of a large number of faults, given performance degradations below 3% for a 1024-node system with up to 200 simultaneous link failures.