HPC4U Fault Tolerant Resource Management: Internet Grid Migration Released

CETIC, ICT research centre announce that the HPC4U European research project (FP6), active in Grid computing technologies, released a cluster middleware providing fault tolerance for parallel applications and allowing job migration over the Grid. This system offers the user the possibility to negotiate on Service Level Agreements (SLAs), then running parallel applications on a given site (e.g. CETIC, Belgium). The jobs are regularly checkpointed in an application transparent manner. If a failure occurs (e.g. node outage), the HPC4U cluster middleware is not only able to restart the job locally, but also to migrate the job over the Grid on a remote site (e.g. University of Paderborn, Germany or Fujitsu, France). There the computation restarts from the latest checkpoint. This mechanism prevents loosing computation time and ensures SLA-compliance also in the case of resource failures. HPC4U Technology – Internet Grid Migration HPC4U worked on the realization of an SLA-aware Grid fabric, which is consisting of multiple elements. An open-source resource management system, OpenCCS (www.openccs.eu), developed by the University of Paderborn, represents the top layer element. It is responsible for managing the cluster in general, as well as to serve as the master interface to upper layer clients. Within HPC4U, the Technical University of Berlin integrated the resource management system with the Globus Toolkit 4 in order to set up a Grid ready environment able to migrate jobs over the internet on remote available resources. This integration also allows Grid middleware components to negotiate on Service Level Agreements (SLA). The resource management system is responsible for only accepting jobs, where the SLA can be fulfilled in the current system condition. In particular it is responsible for fulfilling all agreed SLAs, even in case of failures, e.g. resource outages. At the cluster level, the resource management system interacts with several subcomponents offering fault tolerance mechanisms. The MetaCluster checkpointing subsystem of IBM provides process fault tolerance mechanisms, the storage subsystem of Seanodes (VSM/Metanode and Exanodes) offers storage virtualisation coupled with fault tolerance mechanisms. The third system, in charge of network aspects is made of Scali MPI libraries and Dolphin SCI interconnect.