Scheduling Message Processing for Reducing Rollback Propagation
- Yi-Min Wang ,
- W. Kent Fuchs
Published by Institute of Electrical and Electronics Engineers, Inc.
Traditional checkpointing and rollback recovery techniques for parallel systems have typically assumed the communication pattern is specified by program behavior. In this paper we exploit the property that the communication pattern can often be changed at runtime without affecting program correctness. A scheduling algorithm for message processing and its implementation for reducing rollback propagation are described. The algorithm incorporates a user-transparent prioritized scheme based upon the run-time communication and checkpointing history. Communication trace-driven simulation for several parallel programs written in the Chare Kernel language demonstrates that the probability of rollback propagation can be reduced at the cost of slight additional performance degradation.
© 1992 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.