Guided Research

We also offer Guided Research occasionally. Please contact Josef Weidendofer or Dai Yang with your research interest. 

Please check out our research interest first, whoever with an out-of-scope research topic may not get an answer. 

Analysis and Implementation of In-Memory and Near-Node Checkpointing/Restart Mechanism for HPC Applications Background

In a current project at our chair, we are analyzing modern High Performance Computing (HPC) systems with heterogeneous architectures towards exascale computing. Major challenges in exascale computing include an increasing number of nodes, dynamic resource allocation and organization, and fault resilience. So far, we have developed an extensible yet lightweight library (LAIK)12 to dynamically manage the application workload for better load balancing and proactive fault tolerance. A central element to achieve full functionality in our library is to provide recovery based reactive fault tolerance.

In this master’s thesis, a strategy for reactive fault resilience based on in-memory and near-node checkpointing mechanisms is to be developed and integrated into our LAIK library. By the end of this project, our library shall be capable of dynamically recovering from an arbitrary number of node failures. Existing checkpoint/restart approaches for application fault tolerance in HPC shall be analyzed and eventually adapted to be used within LAIK. By demonstrating their functionality on an example MPI-based application, the efficiency and performance of these algorithms shall be assessed and validated. Analysis shall be done on state-of-the-art hardware such as the Linux clusters and the SuperMUC at LRZ (Leibniz Supercomputing Center).

More Information

Contact: Dai Yang

Comparison and Integration of Fault Resilience Mechanism for distributed Applications

In a current project at our chair, we are analyzing modern High Performance Computing (HPC) systems with heterogeneous architectures towards exascale computing. Major challenges in exascale computing include an increasing number of nodes, dynamic resource allocation and organization, and fault resilience. So far, we have developed an extensible yet lightweight library (LAIK) to dynamically manage the application workload for better load balancing and proactive fault tolerance. This way, an upcoming failure can be avoided by proactively migrating application data to other physical location. Furthermore, by using our library, a global rebalancing can be triggered, ensuring application load balancing.

Our project partners at RWTH Aachen have developed an application-transparent framework in which running applications can be migrated to another physical location to overcome a failure by using virtualization or container technology.

In this master’s thesis, a comparison of performance and complexity with these two libraries a strategy for application fault tolerance is to be analyzed. In addition, a way of collaboration between these two mechanisms (e.g. decision making) is to be designed and developed.

More Information

Contact: Dai Yang

Porting NPB for Application Integrated Load Balancing and Fault Tolerance

In a current project at our chair, we are analyzing modern High Performance Computing (HPC) systems with heterogeneous architectures towards exascale computing. Major challenges in exascale computing include an increasing number of nodes, dynamic resource allocation and organization, and fault resilience. So far, we have developed an extensible yet lightweight library (LAIK) to dynamically manage the application workload for better load balancing and proactive fault tolerance. This way, an upcoming failure can be avoided by proactively migrating application data to other physical location. Furthermore, by using our library, a global rebalancing can be triggered, ensuring application load balancing.

To assess and improve the performance of our library, runtime results from suitable high performance benchmarks are required. One of the most common benchmark suites is the NAS Parallel Benchmarks3. It mimics the data flow and computations for different kinds of typical HPC applications. These benchmarks are written in C and/or FORTRAN. In this master’s thesis, a selected subset of the NPB ist to be ported using our LAIK library. Performance test and analysis are to be conducted on the ported NPB benchmarks.

More Information

Contact: Dai Yang