ICS 2014 features three half day tutorials. The tuturials will be given in parallel to the afternoon sessions of the conference.
Wednesday, June 11th, 14:00-18:30
D.K. Panda and Jithin Jose (Ohio State)
Multi-core processors, accelerators (GPGPUs), coprocessors (Xeon Phis) and high-performance interconnects (InﬁniBand, 10 GigE/iWARP and RoCE) with RDMA support are shaping the architectures for next generation clusters. Efﬁcient programming models to design applications on these clusters as well as on future exascale systems are still evolving. Partitioned Global Address Space (PGAS) Models provide an attractive alternative to the traditional Message Passing Interface(MPI) model owing to their easy to use global shared memory abstractions and light-weight one-sided communication. Hybrid MPI+PGAS programming models are gaining attention as a possible solution to programming exascale systems.These hybrid models help the transition of codes designed using MPI to take advantage of PGAS models without payingthe prohibitive cost of re-designing complete applications. They also enable hierarchical design of applications using the different models to suite modern architectures.
In this tutorial, we provide an overview of the research and development taking place along these directions and discuss associated opportunities and challenges as we head toward exascale. We start with an in-depth overview of modern system architectures with multi-core processors, GPU accelerators, Xeon Phicoprocessors and high-performance interconnects. We present an overview of language based and library based PGASmodels with focus on two popular models UPC and OpenSHMEM. We introduce MPI+PGAS hybrid programming models and highlight the advantages and challenges of designing a uniﬁed runtime to support them. We examine the challengesin designing high-performance UPC, OpenSHMEM and uniﬁed MPI+UPC/OpenSHMEM runtimes. We present casestudies using application kernels, to demonstrate how one can exploit hybrid MPI+PGAS programming models to achievebetter performance without rewriting the complete code. Using the publicly available MVAPICH2-X software package (http://mvapich.cse.ohio-state.edu/overview/mvapich2x/), we provide concrete case studies and in-depth evaluation of runtime and applications-level designs that are targeted for modern systems architectures with multi-core processors, GPUs,Xeon Phis and high-performance interconnects.
Thursday, June 12th, 14:00-18:30
Helmar Burkhart, Danilo Guerrera (University of Basel)
Today’s microarchitectures are fascinating
because their compute power allows affordable computational experiments
that trigger breakthroughs in many science
disciplines. But their performance is not for free: the programming task is rather
complex and time-consuming. High-level approaches for HPC software development promise increased programmer productivity with a limited performance loss only.
In the tutorial we present current approaches
such as domain-specific languages and pattern-oriented frameworks and
compare them to standard HPC programming
practice both in terms of performance and productivity. We will introduce the class of stencil calculations that are part of many scientific kernels such as image processing, meteorology, computational engineering, and life sciences. Because of the rather low arithmetic intensity, stencils are hard to optimize on manycores. We will demonstrate the practical usage of stencil compilers such as Pochoir and Pluto and will provide a hands-on session using our own tool PATUS (Parallel AutoTUned Stencils) that generates code for architectures such as AMD Magny-Cours, Intel Sandy Bridge and Intel Xeon Phi. We will discuss performance relevant parameters of modern architectures and introduce the basics of software automatic tuning.
Friday, June 13th, 13:30-18:00
Eduard Ayguadé, Rosa M. Badia and Vladimir Subotic (UPC Barcelona)
One of the biggest problems in the current computing industry is the increasing gap between the highly parallel hardware and mostly sequential software. Thus, parallelism has become of concern to every single programmer. However, parallelizing applications is still far from trivial and sometimes too much focused on scientific numerical applications.
This session will start by describing how we teach parallelism to undergraduate students (task decomposition, task graph, parallelism and speedup metrics, geometric versus divide-and-conquer, …). Then we will present a set of tools designed to help students retain the knowledge through hands-on practice. Finally the session will also present a top-down approach to explore different parallelization strategies and understand their potential benefits. All the tools are publicly available and distributed by the Barcelona Supercomputing Center (BSC-CNS).
Tareador (a tool developed by BSC) provides a very intuitive approach for a student (i.e. future parallel programmer) to find the optimal parallelization strategy for a sequential application. Tareador API allows a programmer to annotate an arbitrary decomposition of the code into tasks. Then, Tareador executes the annotated application and evaluates the potential parallelism of the specified task decomposition. Namely, Tareador provides to the user two outputs:
- Dependency graph of all annotated task instances (identifying the memory objects, or parts of them, that cause dependences).
- Simulation of the potential parallel execution on a specified number of processors (assuming a very simple architecture).
Throughout the hands-on session, attendees will use these tools to analyze the potential parallelization of few popular application kernels. We also propose an iterative top-down approach for finding a suitable task decomposition of a sequential code. In the first part of the hands-on session, the participants manually conduct this approach in order to thoroughly grasp the essentials of Tareador. Nevertheless, in the second part of the session, the participants are motivated to use Tareador feature for automatic exploration of potential decompositions in order to quickly explore parallelization strategies in more complicated applications.
At this stage of development, Tareador does not generate the final parallel code, but provides enough information to the programmer to understand the task decomposition and the interactions among tasks. From here, the programmer can implement the decomposition using his/her favorite parallel programing model (in our case, OpenMP 3.0).