Checkpointing and Rollback for Distributed Applications

If no precautions are taken and the computer system supporting an application fails, the program must be restarted from the beginning. In parallel or distributed systems, the probability that a node or network link will fail increases with the number of nodes. In addition, for large scale parallel programs, the time required to restart the computation from the beginning can be substantial.

For uniprocessor systems, if the the state is periodically stored in a checkpoint, the program can roll back to its latest checkpoint rather than restart from the beginning of the program. Recording and restarting from a checkpoint in a parallel system is a bit more complex. In the case of a message passing system, if each node independently records a periodic local checkpoint, an inconsistent global state may be recorded. For example, after recovery, it could appear as though a message was sent but never received or that a message was received but never sent.

Conceptually, the simplest solution is to keep all the local checkpoints around for every process. When an error occurs, the program is rolled back to the set of local checkpoints that constitute the most recent consistent global checkpoint. The program may have to roll back past more than one local checkpoint to achieve a consistent global state. In the worst case, the program has to roll all the way back to the beginning. This extended roll back is called the ``domino effect''. One way to avoid the domino affect altogether is to use a centralized or ``coordinated'' checkpointing algorithm that assures all checkpoints are consistent. However, a substantial amount of overhead is required to achieve a coordinated checkpoint and thus the ``failure-free'' (normal) operation of the system suffers a loss of performance.

The domino effect can also be avoided with uncoordinated checkpoints by logging all messages between processes. During recovery, messages are replayed from the message logs until the log is exhausted. However, the overhead of saving all previous checkpoints and all messages is undesirable due to the volume of information recorded and is impractical for embedded systems with limited resources. Techniques have been developed to limit the domino affect without logging every message and saving every checkpoint. If only those messages that would cause a domino effect are logged, the other messages can be computed on the fly by executing parts of other processes.

Distributed Shared Memory (DSM) offers a means of providing programmers with a shared memory abstraction on message passing systems. This abstraction promises to be the dominant paradigm for high-performance computing. DSM systems provide the shared memory abstraction via a sophisticated run-time system that passes messages to support the memory abstraction. Because DSM systems employ message passing at the lower levels, it is possible to apply techniques similar to those described above, i.e., to checkpoint the underlying message passing pattern. However, it is possible for the checkpointing algorithm to take advantage of patterns in the messages sent to reduce the overhead. This paper describes previous approaches to checkpointing parallel programs and extension of these approaches to distributed shared memory systems.

For more information about checkpointing and DSM check out the following links on related topics.

Go Back