HPC4U Fault Tolerant Resource Management: Internet Grid Migration Released

CETIC, ICT research centre announce that the HPC4U European research project (FP6), active in Grid computing technologies, released a cluster middleware providing fault tolerance for parallel applications and allowing job migration over the Grid. This system offers the user the possibility to negotiate on Service Level Agreements (SLAs), then running parallel applications on a given site (e.g. CETIC, Belgium). The jobs are regularly checkpointed in an application transparent manner. If a failure occurs (e.g. node outage), the HPC4U cluster middleware is not only able to restart the job locally, but also to migrate the job over the Grid on a remote site (e.g. University of Paderborn, Germany or Fujitsu, France). There the computation restarts from the latest checkpoint. This mechanism prevents loosing computation time and ensures SLA-compliance also in the case of resource failures.

HPC4U Technology – Internet Grid Migration HPC4U worked on the realization of an SLA-aware Grid fabric, which is consisting of multiple elements. An open-source resource management system, OpenCCS (www.openccs.eu), developed by the University of Paderborn, represents the top layer element. It is responsible for managing the cluster in general, as well as to serve as the master interface to upper layer clients. Within HPC4U, the Technical University of Berlin integrated the resource management system with the Globus Toolkit 4 in order to set up a Grid ready environment able to migrate jobs over the internet on remote available resources. This integration also allows Grid middleware components to negotiate on Service Level Agreements (SLA). The resource management system is responsible for only accepting jobs, where the SLA can be fulfilled in the current system condition. In particular it is responsible for fulfilling all agreed SLAs, even in case of failures, e.g. resource outages. At the cluster level, the resource management system interacts with several subcomponents offering fault tolerance mechanisms. The MetaCluster checkpointing subsystem of IBM provides process fault tolerance mechanisms, the storage subsystem of Seanodes (VSM/Metanode and Exanodes) offers storage virtualisation coupled with fault tolerance mechanisms. The third system, in charge of network aspects is made of Scali MPI libraries and Dolphin SCI interconnect.

HPC4U Fault Tolerant Resource Management: Internet Grid Migration Released

MIT develops computational framework to probe dark matter via gravitational waves

Penn engineers push generative AI beyond molecular search

Japanese researchers push molecular simulation into the AI supercomputing era

Beamforming the future: BeammWave's 6G push signals the rise of orbital-terrestrial wireless networks

Wall Street wants to trade supercomputing power like oil

IBM’s quantum foundry gamble reveals a troubling reality about the future of computing

Pulsars as galactic scales: Supercomputer simulations reveal a new way to weigh neighboring galaxies

NVIDIA’s fiscal 2027 surge shows the new face of supercomputing

Huawei’s Tau Scaling ambition tests the limits of post-Moore semiconductor reality

Memory has become the new compute: Why Micron, SK Hynix crossing $1 trillion matters to supercomputing

POPULAR RIGHT NOW