Master Seminar SS10
Programming Models and Tools for High Performance Computing

 Prof. Dr. Michael Gerndt

Organization

Information Meeting / Vorbesprechung: 9.02.10, 14:00, 01.06.020

This seminar will be organized as a block course.

The goal of this seminar is to steer an intensive discussion about some interesting topics in high performance computing. Therefore, the participants will have to deliver a 10 page report before their presentation. All the participants are asked to read the reports carefully so that they can participate in the discussions.

Registration

bulletPer Email to Prof. Dr. Michael Gerndt (gerndt@in.tum.de). Please select one of the topics.

Topics

Topics

Student Advisor date

Molecular Dynamics Simulation and Charm++ parallel programming language

Charm++ is a machine independent parallel programming system. Programs written using this system will run unchanged on MIMD machines with or without a shared memory. It provides high-level mechanisms and strategies to facilitate the task of developing even highly complex parallel applications. Molecular dynamics (MD) simulation is a form of computer simulation in which atoms and molecules are allowed to interact for a period of time by approximations of known physics, giving a view of the motion of the particles. There are a lot of parallel methods applied to such application which needs great power of computing resources. NAMD is a state-of-art parallel molecular dynamics code designed for high-performance simulation of large bio-molecular systems based on charm++ parallel language. Currently, NAMD scales to hundreds of processors on high-end parallel platforms and tens of processors on commodity clusters using gigabit ethernet. (more information)

  Haowei Huang
High performance computing using accelerators

In the last few years, computational accelerators have emerged and have taken a firm foothold now. Accelerators are computing components containing functional units, together with memory and control systems that can be easily added to computers to speedup portion of application. They can also be aggregated into groups for supporting acceleration of large problem sizes. Accelerators are not a new phenomenon, in the 1980's, for instance, Floating point systems sold attached processors like the API120-B with a peak performance of 12Mflop/s, easily 10 times faster than the general purpose system they were connected to. They come in various types like: General Purpose Graphical Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), ClearSpeed's floating point accelerators.

GPU is dedicated processor for rendering graphics. There are two dominant producers of High Performance GPUs chips: NVIDIA and ATM. Most GPU programs are written in shader language such as OpenGL (Linux, Windows) or HLSL (Windows). FPGAs have a long history in embedded processing and specialized computing. These areas include DSP, ASIC prototyping, medical imaging and other specialized compute intensive areas. An important differentiator between FPGAs and other accelerators is that they are programmable. The dominant FPGA chip vendors are Xilinx and Altera. ClearSpeed Technology produces a board that is designed to accelerate floating-point calculations. This board plugs into a PCI-X slot, has a clock cycle of 500 MHz, and contains 96 floating-point functional units that can each perform a double precision multiply-add in one cycle.

  Shaveta Tatwani

Performance Analysis of Petascale Applications with the HPCToolkit

Cutting-edge science and engineering applications require petascale computing. It is, however, a significant challenge to use petascale computing platforms effectively. Consequently, there is a critical need for performance tools that enable scientists to understand impediments to performance on emerging petascale systems. In this paper, we describe HPCToolkit, a suite of multi-platform tools that supports sampling-based analysis of application performance on emerging petascale platforms. HPCToolkit uses sampling to pinpoint and quantify both scaling and node performance
bottlenecks. We study several emerging petascale applications on the Cray XT and IBM BlueGene/P platforms and use HPCToolkit to identify special source lines in their
full calling context associated with performance bottlenecks in these codes. Such information is exactly what application developers need to know to improve their applications to take full advantage of the power of petascale systems.
  Michael Gerndt
Dynamic instrumentation infrastructure for Grid/Cloud services

Grid / Cloud computing expounds the vision of applications having on-demand, ubiquitous access to distributed services running on diverse, managed resources like computation, storage, instruments, and networks among others, that are owned by multiple administrators. Many Grid workflow middleware services require knowledge about the performance behavior of Grid applications/services in order to effectively select, compose, and execute workflows in dynamic and complex Grid systems. Moreover, Grid workflows introduce multiple levels of abstraction and all levels must be taken into account in order to understand the performance behavior of a workflow. Hence, any instrumentation infrastructure for Grid workflows shall assist the user/tool to conduct the monitoring and analysis in a specific way. As an outcome of this seminar, the student would have an idea to do performance analysis for Grid/Cloud applications.

 

  Shajulin Benedict

Partitioned Global Address Space Programming Model

An overview of the concepts, languages that implement it, advantages/disadvantages to MPI/OpenMP, and Tricks/Pitfals with getting performance.

 

Falco Cescolini Ventsislav Petkov
Performance tuning techniques for CUDA

CUDA is a parallel computing architecture developed by NVIDIA for its GPUs that allows access to the native instruction set and memory of the parallel computational elements in the GPUs. GPUs are a parallel many-core processing units capable of running thousands of threads simultaneously. GPUs offer a high performance gain for applications that are suited for that architecture. GPUs being intrinsically different from CPUs, developement techniques should be different on those architecture.

Draw shortly the differences between the two architecture, putting the emphasis on the advantages and limitation of CUDA. List performance tuning techniques for CUDA with supportive explanation of each. Performance techniques should constitute at least 60% of your report/presentation.

Claudia Simion Houssam Haitof

Performance analysis on GPGPU based Architectures

Utilization of the modern graphics processing units (GPUs) for non-graphical  high performance computing purpose becomes more and more popular recently. The trend is motivated by their incredible computational speed and attractive cost/performance ratio. Although modern GPUs permit high throughput and employ more parallelism, achieving appropriate application performance is a complicated task.

Also the performance analysis procedure itself is more difficult. This seminar topic will provide an insight into performance analysis of general purpose applications running on GPUs (GPGPU), performance analysis tools available and important optimization issues.

Shulei Zhu

  
Yury Oleynik
Programming Models for Scalable Multicore Architectures

To answer the need of more and more computational power the underlying hardware gets more and more parallel. Nowadays the number of computational cores on each chip is growing fast and heterogenous systems are used soaring. This also affects the area of High Performance Computing, where parallelization on clusters or supercomputers was already in heavy use for many years.

Parallelism on the hardware level can be found in many layers. Almost all general purpose processing units are using several functional units in one core to be able to finish more operations per clock cycle. To make use of these capabilities the processor includes scheduling and reordering of instructions which is heavily influenced by the dependencies of these instructions.

Logical independent execution streams can be scheduled as separate threads and therefore optimize the usage of the available functional units and pipelines on multithreaded CPUs.

Within these functional units you can find another level of parallelism like VLIW or SWAR to apply computational operations on multiple words.

Current trends in CPU architectures are adding more and more cores on a single chip. They can be quite different in their capabilities or connection to some on-chip-network for communication with other cores and the memory.

The next level combines multiple chips in one machine, in the case of smaller machines many of them in a cluster and ?nally even clustering several clusters over the internet which must also be addressed by a programming model.


There are many different approaches on how to fetch, compute and manage the data within a processor. SIMD, MIMD, SPMD, Vector Processing or Stream Processing are some common processing models – each with its special advantages and disadvantages. Processing models expose the functionality of the processor and are specified by the vendor. They are optimized for performance and simplicity.
 

  Marcel Meyer
HPC - made for applications: splitting the productivity and efficiency layers

"Writing programs that scale with increasing numbers of cores should be as easy as writing programs for sequential computers." This is the challange arising nowadays from the industry towards the academia world, as multicore architectures and HPC emerge into the market. The need for more computation power is clearly seen in the large applications already existing. The question is now, do the researchers manage to meet this requirement by providing suitable programming models and frameworks along with the new architectures?

Projects like ParLab and ACES III already made first steps towards a solution. Whether reviving the Programming Patterns paradigm or sticking to the High-Level Languages, the key to the problem seems to hide in splitting the productivity and efficiency layers. Whether this is possible, how could it be realized and what are the advances so far is to be researched for and discussed in this topic.
 
  Anca Berariu

More information: gerndt@in.tum.de.