Checkpointing-and-Communication Library (CCL) is a recently developed software implementing CPU offloaded checkpointing functionalities in support of optimistic parallel simulation on myrinet clusters. Specifically, CCL implements a non-blocking execution mode of memory-tomemory data copy associated with checkpoint operations, based on data transfer capabilities provided by a programmable DMA engine on board of myrinet network cards. Re-synchronization between CPU and DMA activities must sometimes be employed for several reasons, such as maintenance of data consistency, thus adding some overhead to (otherwise CPU cost-free) nonblocking checkpoint operations. In this paper we present a detailed cost model for non-blocking checkpointing and derive a performance effective re-synchronization semantic which we call minimum cost re-synchronization (MC). With this semantic, an occurrence of re-synchronization either commits an on-going DMA based checkpoint operation (causing suspension of CPU activities) or aborts the operation (with possible increase in the expected rollback cost due to a reduced amount of committed checkpoints) on the basis of a minimum overhead expectation evaluated through the cost model. We discuss viable techniques to solve the cost model, then we present the implementation ofMC that we have developed within the CCL framework. As we will show, such an implementation is based on proper solutions we introduce to estimate/determine the values of low level system parameters (e.g. the residual completion time for DMA operations). This paper also reports experimental results demonstrating the performance benefits from this optimized re-synchronization semantic, in terms of increase in the execution speed, for a Personal Communication System (PCS) simulation application, selected as a testbed among real world simulation problems.
Modeling and Optimization of Nonblocking Checkpointing for Optimistic Simulation on Myrinet Clusters / Quaglia, Francesco; Santoro, A.. - In: JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING. - ISSN 0743-7315. - 65, no.6:(2005), pp. 667-677. [10.1016/j.jpdc.2005.02.006]
Modeling and Optimization of Nonblocking Checkpointing for Optimistic Simulation on Myrinet Clusters
QUAGLIA, Francesco;
2005
Abstract
Checkpointing-and-Communication Library (CCL) is a recently developed software implementing CPU offloaded checkpointing functionalities in support of optimistic parallel simulation on myrinet clusters. Specifically, CCL implements a non-blocking execution mode of memory-tomemory data copy associated with checkpoint operations, based on data transfer capabilities provided by a programmable DMA engine on board of myrinet network cards. Re-synchronization between CPU and DMA activities must sometimes be employed for several reasons, such as maintenance of data consistency, thus adding some overhead to (otherwise CPU cost-free) nonblocking checkpoint operations. In this paper we present a detailed cost model for non-blocking checkpointing and derive a performance effective re-synchronization semantic which we call minimum cost re-synchronization (MC). With this semantic, an occurrence of re-synchronization either commits an on-going DMA based checkpoint operation (causing suspension of CPU activities) or aborts the operation (with possible increase in the expected rollback cost due to a reduced amount of committed checkpoints) on the basis of a minimum overhead expectation evaluated through the cost model. We discuss viable techniques to solve the cost model, then we present the implementation ofMC that we have developed within the CCL framework. As we will show, such an implementation is based on proper solutions we introduce to estimate/determine the values of low level system parameters (e.g. the residual completion time for DMA operations). This paper also reports experimental results demonstrating the performance benefits from this optimized re-synchronization semantic, in terms of increase in the execution speed, for a Personal Communication System (PCS) simulation application, selected as a testbed among real world simulation problems.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.