You are here

Raphael Y. de Camargo

Executing long-running parallel applications in Opportunistic Grid environments composed of heterogeneous, shared user workstations, is a daunting task. Machines may fail, become unaccessible, or may switch from idle to busy unexpectedly, compromising the execution of applications. A mechanism for fault-tolerance that supports these heterogeneous architectures is an important requirement for such a system.

Besides, we need to deal with the large amounts of data generated by checkpointing data intensive parallel applications. The classical approach is to employ high-throughput checkpoint servers connected to the computational nodes by high speed networks. In the case of Opportunistic Grid Computing, we do not want to be forced to rely on such dedicated hardware. Instead, we want to use the shared Grid nodes to store application data in a distributed fashion.

In this work, we provide fault-tolerant for the execution of BSP parallel applications on heterogeneous, shared workstations [1,2]. A precompiler instruments application source code to save state periodically into checkpoint files. Generated checkpoints are portable and can be recovered in a machine of different architecture. We implemented a monitoring and recovering infrastructure in the InteGrade Grid middleware.

To deal with the storage problem, we are evaluating several strategies to store checkpoints on distributed non-dedicated repositories [3]. We consider the tradeoff among computational overhead, storage overhead, and degree of fault-tolerance of these strategies. We compare the use of replication, parity information, and information dispersal (IDA).