Kategorien
Seiten
-

IT Center Events

HPC Events : 1st VI-HPS Tuning Workshop

07. Januar 2021 | von

VI-HPS Logo

VI-HPS Tuning Workshop

Wednesday, March 5 – Friday, March 7, 2008

following the

SunHPC 2008 Seminar

Monday, March 3 – Tuesday, March 4, 2008

Time

Location

SunHPC 2008 Seminar:

Monday, March 3, 9:00  –  Tuesday March 4, 17:30

VI-HPS Tuning Workshop:

Wednesday, March 5, 9:00  –  Friday March 7, 12:30

Center for Computing and Communication

RWTH Aachen University

Seffenter Weg 23
52074 Aachen
Lecture Room and Lab Room 3

Contents

  • Introduction
  • Programming Tools
  • Soft- and Hardware Platforms
  • Documentation
  • Agenda and Slides
  • Costs
  • Registration and Feedback
  • Logistics

Introduction

The first VI-HPS Tuning Workshop organized by the RWTH Aachen University together with the partners of the Virtual Institute for High Productivity Supercomputing (VI-HPS)
  • the Jülich Supercomputing Centre (JSC) of the Forschungszentrum Jülich,
  • the Center for Information Services and High Performance Computing (ZIH) of the Technische Universität Dresden, and
  • the Innovative Computing Laboratory of the University of Tennessee (ICL).

took place in Aachen in March 2008.

Besides esteemed tools experts of the VI-HPS partners, we will be happy to welcome performance experts from Sun Microsystems as special guests, contributing their experience and assistance in using the Sun performance tools.

The mission of the VI-HPS is to improve the quality and accelerate the development process of complex simulation programs in science and engineering. For this purpose, integrated state-of-the-art programming tools for high-performance computing are developed that assist domain scientists in diagnosing programming errors and optimizing the performance of their applications. As training and support is an essential component of the VI-HPS activities, we are happy to invite application programmers to bring in their codes, to learn more about state-of-the-are programming tools for high-performance computing and get expert assistance in debugging, tuning and parallelization using MPI and OpenMP.

Performance tuning is still often a matter of some experimentation, but we can give you advice on a best effort basis. Application developers can expect to learn about the execution performance of their applications: this insight can be helpful even where it doesn’t directly lead to performance improvements. To maximize the efficiency of the workshop, we would like to ask you to prepare a test case that reflects a typical production run, but does not take too long to execute. In the ideal case, a run should not take more than 5 to 10 minutes to finish.  It is also important to have an easy way of verifying that the results of this test run are correct.

Wel provided access to and support for the programming tools and the computing platforms listed below:

The Programming Tools

Developing correct and scalable parallel programs is hard and good programming tools may increase the programmer’s productivity considerably. But developing these tools themselves is an art as well and this tuning workshop will be a unique opportunity for application programmers to directly interact with tool developers when working on their application codes.

In order to get some basic insight into the functioning of performance tools I like to point to an introduction into „Profiling and Tracing in Linux“ by Sameer Shende (pdf) which basically holds for other operating systems as well. A summary for first reading is provided here.

The tools which were presented and available in the context of this workshop were

  • Sun Performance Analyzer is a proprietary tool from Sun Microsystems which is freely available for Linux and Solaris together with the Sun Studio compiler suite. The tool primarily targets towards serial and shared memory parallel program analysis and has some limited features for MPI program analysis.
  • Acumem Virtual Performance Expert (VPE) is a commercial tool from Acumem to analyze the cache behaviour of program.
  • ompP is a free profiling tool for OpenMP applications currently developed in the ICL.
    User Guide (pdf)
  • OPARI is a free source-to-source translation tool for instrumenting OpenMP codes developed at the JSC. OPARI is used in conjunction with ompP, KOJAK and VampirTrace
  • VampirTrace is an free MPI tracing library provided by ZIH generating trace files in the OTF open trace format. Such a tracefile can be displayed with the VAMPIR tool
    User Guide (pdf)
  • VAMPIR is a commercial framework and graphical analysis tool developed at the ZIH to display and analyze OTF trace files generated by VampirTrace and other tools.
  • KOJAK / SCALASCA for the performance analysis of MPI applications with a strong focus on improved scalability. Like KOJAK, tracefiles are analyzed to automatically detect and classify performance properties which then are displayed by the CUBE graphical utility.
    With the most recent versions, both tools have been merged into SCALASCA1.0.
    Quick Reference Guide (pdf)
  • PAPI is a free interface to hardware performance counters developed at ICL which can be used by ompP, KOJAK, SCALASCA and VampirTrace.
    User Guide (pdf)
  • MARMOT is a free correctness checking tools for MPI programs developed at ZIH and HLRS
  • Sun Thread Analyzer a proprietary tool from Sun Microsystems which is freely available for Linux and Solaris together with the Sun Studio compiler suite. The tool checks for data races and deadlocks in shared memory parallel programs.

We were setting up a tools matrix giving an overview of the above mentioned tools.

The Soft- and Hardware-Platforms

The platforms which will be available for this tuning workshop are:

We concentrated on 64 bit addressing mode on all these platforms!

Agenda and Slides

We started on Wednesday morning with short presentations of the programming tools and start to get hands on the machinery starting from Wednesday afternoon. While we have beenproviding as much time as possible for practical work, we accommodated more detailed presentations on the tools in parallel to the lab time upon demand during that following 2 days.

Social Dinner

Sun Microsystems is sponsoring a social dinner in the restaurant „Kazan“ on Tuesday evening at 19:00.
See our citymap for the location.

Presentations on Wednesday morning starting at 9:00 till 12:30

Optional presentations between Wednesday afternoon 14:00 and Friday 12:30 in parallel to hands-on sessions

  • VampirTrace for Instrumentation and Run-Time Measurement, including the user’s view as well as „what’s going on behind the scenes“ – Holger Brunst, Andreas Knüpfer, ZIH (slides), Video
  • Vampir and VampirServer — the guided tour – Holger Brunst, Andreas Knüpfer, ZIH (slides), Video
  • TotalView, a short introduction into parallel debugging (~20 min)- Dieter an Mey, RWTH (slides)
  • Acumem VPE – Mats Nilsson (Acumem) (slides)
  • Scalasca guided tour, tips & tricks (60 min) – Brian Wylie, JSC
  • Scalasca performance properties: „the metric tour“(30 min) – Markus Geimer, JSC
  • Scalasca/1.0 for KOJAK experts, including integration with VAMPIR (15 min, Thursday only!) – Bernd Mohr, JSC

The Costs

There is no seminar fee. All other costs (e.g. travel, hotel, and consumption) are at your own expenses.

Registration and Feedback

Registration is closed.

In order to improve the settings for future tuning workshops we heavily rely on your feedback. Please take a few minutes to fill out the online form .

HPC Events : HPC December Workshop 2010 Part II: Array Building Blocks Tutorial (Intel Ct)

07. Januar 2021 | von

HPC December Workshop 2010 Part II

Array Building Blocks Tutorial (Intel Ct)

Thursday, December 9 – Friday, December 10, 2010

Kindly supported by:  

Date

Time

Location

  • Day 1:
    Thursday, December 9
09:00 – 17:15
 
 
 
Center for Computing and  Communication
RWTH Aachen University
Seffenter Weg 23
52074 Aachen

Seminar Rooms 1 & 2  (Kopernikusstr.)

 
  • Day 2:
    Friday, December 10
09:00 – 15:30

Introduction

Intel® Array Building Blocks (ArBB) supports a high-level, generalized and portable programming model for data-parallel programming. It simplifies the efficient parallelization of computations over large data sets. Programmers do not need to focus on the implementation details of their data-parallel program, but instead can express a program’s algorithms in terms of operations on collections of data. ArBB’s deterministic semantics avoid race conditions and deadlocks by design, improving reliability and maintainability, and can be used for both rapid prototyping and production-stable codes.

Agenda

Thursday, December 09
09:00 – 10:30 Introduction to Intel ArBB (Ct)
10:45 – 12:30 Introduction to Intel ArBB (Ct) continued
14:00 – 17:00 Hands-On Work
17:00 – 17:15 Wrap-up (optional)

Friday, December 10

09:00 – 10:30 ArBB Execution Engine
10:45 – 12:30 ArBB Advanced Programming
14:00 – 15:30 Hands-On Work

Learning Material

Registration

closed

There will be a Social Dinner on Wednesday December 08 at 7p.m at Restaurant Elisenbrunnen.

HPC December Workshop Part I

Be sure to also consider part I of our december workshop: The Tuning Workshop

Contact

Christian Terboven
Tel.: +49 241 80 24375
Fax: +49 241 80 22134
E-mail: terboven@rz.rwth-aachen.de

Thomas Reichstein
Tel.: +49 241 80 24924
Fax: +49 241 80 22134
E-mail: reichstein@rz.rwth-aachen.de

HPC Events : The UltraSPARC T1 („Niagara“) based Sun Fire T2000 Server

07. Januar 2021 | von

The UltraSPARC T1 („Niagara“) based Sun Fire T2000 Server


Sun Microsystems UltraSPARC T1 Prozessor

    Prosessor Ultra SPARC T1
  • Architecture
SPARC V9
  • Adress space
48-bit virtual, 40-bit physically
  • Cores
(up to) 8 cores running 4 threads each
  • Pipelines
8 integer units with 6 stages,
4 threads running on a single core share one pipeline
  • Clock speed
1.0 GHz (or 1.2 GHz )
  • L1 Cache (per Core):

16 KByte instruction cache,
8 KB data cache (4-way set-assoziativ)

  • L2 Cache
3 MByte on chip
12-way associative, 4 banks
  • Memory Controler
four 144-bit DDR2-533 SDRAM interfaces
4 DIMMS per Controller – 16 DIMMS total
Optional: 2-Channel operation mode
  • JBUS Interface
3.1 GByte/sec bandwidth (peak)
128 bit address/data bus
150 – 200 MHz
  • Technology
CMOS, 90nm, 9-Layer Cu Metal,
  • Power Consumption
72 Watt


Since november 2005 Sun Microsystems offers the new UltraSPARC T1 processor, codenamed „Niagara“. The processor has 8 parallel cores, which are 4 times multi-threaded, i.e each core can run 4 processes quasi simultaneously. Each core has an Integer pipeline (length 6) with is shared between the 4 threads of the core, to hide memory latency.

Since there is only one floating point unit (FPU) for all cores, the UltraSparc T1 processor is suited for programs with few or none floating point operations, like web servers or databases. Nevertheless the UltraSparc T1 is binary compatible to an UltraSparc IV CPU.

Each of the 8 processor cores has its own level 1 instruction and data caches. All cores have access to one common level 2 cache of 3 MB and to the common main memory. Thus the processor behaves like an UMA (uniform memory archtiecture) system.


The Sun Fire T2000 Server

Since december 2005 a Sun Fire T2000 server with a 1GHz UltraSparc T1 „Niagra“ processor from Sun Microsystems is installed at the RWTH Aachen Universities Center for Computing and Communication (CCC). This system has 8 GByte main memory, and is running Solaris 10.

After the upgrade of all big Sun Fire servers of the Center with dual core UltraSPARC IV chips late 2004 and after the recent purchase of 4 new dual-core Opteron based V40z servers, the installation of the UltraSPARC T1 based Sun Fire T2000 system marks an important step towards the employment of new types of micro-processors with chip multi-processing (CMP) and chip multi-threading (CMT) technologies which will most likely dominate the market of HPC systems in the future.


An „integer-only“ Machine in a Center of Excellence for Engineering Sciences and Computational Fluid Dynamics?

Why installing a machine which is only capable of delivering some 100 MFlop/s in a compute environment dominated by technical applications?

On the first sight, this does not seem to fit well. But we want to be prepared for future technologies. For sure future multi-threading processors will be capable of executing floating point operations at the same rate as the Niagara processors executes integer opterations today. So we want to look at the questions of how to use this kind of architecture properly. Will this kind of machines be able to suite our needs in future?

The stagnation of the performance growth of single processors is very bad news for the HPTC community. As a consequence parallelization is getting even more important. For many engineering and scientific applications parallelization is not at all trivial. Unfortunately in our environment we do not see many codes which are „embarrassingly parallel“. Therefore we think that in future parallelization has to be on multiple levels. Hybrid parallelization using MPI plus OpenMP and autoparallelization, nested parallelization with MPI and with OpenMP will be needed to keep even more processors busy and to cut down the turnaround time of large simulation jobs.

So we want to investigate in how far the well-known techniques of MPI and OpenMP programming work on this brand new chip. Therefore we are looking at several benchmarks and applications, which are not dominated by floating point operations.


First Experiences using the Niagara Processor – A Word of Caution!

Of course we are very curious about first performance results using this brand new processor architecture. And most likely others are curious to see our first benchmark results, too. But please handle these results with care. Performance results can only be as good as the people who run the experiments, and we are new to this system. They can be only as good as the compiler supports the architecture, and the Sun Studio compiler version 11 is the first version to support the Niagara processor. There were several people working on the used systems at a time. We try not to interfere with each other, but you never know. Also time is always short – we might have overlooked something.

So take all these numbers as preliminary!

We have been using the compiler switches -fast [-xtarget=ultraT1] [-g] [-xarch=v9b] throughout the tests unless otherwise noted.

The picture shows the machines which have been used for many of the comparisons below. The Sun Fire T2000 is the 2U silver box on top of two blue Sun Fire E2900 boxes below.

Indeed, after receiving the hint from Sun Microsystems to change a system parameter, we found out that this parameter might heavily impact the performance of the Sun Fire T2000. So we basically have to repeat all our measurements! We are also awaiting another patch for the system software …
So far we added set consistent_coloring=2 in the /etc/systems file.
Have a look at the altered performance curve of John-the-Ripper


A Look at the Memory Performance of the Sun Fire T2000

A tiny serial program measures the memory latency using pointer chasing by a long sequency of identical instructions:

p = (char **)*p;

A look into the disassembly reveals that in fact one load instructions follows another:

... 
ld [%g5], %g2 
ld [%g2], %g1 
ld [%g1], %o7 
ld [%o7], %o5 
ld [%o5], %o4 ...

The content of the memory location just read from memory delivers the address of the following load instruction. Thus these load instructions cannot be overlapped. The memory latency measured is: 107 ns

Now how about memory bandwidth? The code segment which is timed is

long *x, *xstart, *xend, mask; 
... 
for ( x = xstart; x < xend; x++ ) *x ^= mask;

So each loop iteration involves one load and one store of a variable of type long. The memory footprint of this loop is always much larger than the level 2 cache, so each load and store operations goes to the main memory. The memory bandwidth which has been measured depends on the size of the long type, which is 4 bytes when compiled for 32 bit addressing and 8 bytes when compiled for 64 bit addressing: 463 MB/s and 873 MB/s respectively.

Now the idea of the multi-threading architecture, as has been explained in a presentation given by Partha Tirumalai and Ruud van der Pas during the SunHPC colloquium in Aachen in October 2004, is bridging the growing gap between processor and memory speed by overlapping the stall time of one thread when waiting for data from memory with the activity of other threads running on the same hardware, thus leading to a much better utilitsation of the silicon.

So an obvious experiment is to run the same kernel measuring memory latency and bandwidth several times in parallel in order to find out in how far multiple threads running on the same processor or even on the same core on this processor interfere which each other. For this purpose we took the same tiny program kernels, parallelized it using MPI and used explicit processor binding to carefully place processes onto the processor cores. The given numbers are for 64-bit addressing mode.

For comparison we include measurement results on the Sun Fire E2900 which is equipped with 12 UltraSPARC IV processors running at 1200 MHz.

# MPI processes
Niagara
# cores used
Niagara
# threads per core used
Niagara
latency [ns]
Niagara
bandwidth per process [MB/s]
Niagara
total bandwidth [MB/s]
SF E2900
latency [ns]
SF E2900
bandwidth per process [MB/s]
SF E2900
total bandwidth [MB/s]
1
1
1
107
863
863
~232
~1813
~1813
2
1
2
107
825
1650
~232
~1601
~3202
2
2
1
107
861
1722
4
1
4
108
705
2820
~249
~1200
~4800
4
2
2
108
820
3280
4
4
1
108
859
3436
8
2
4
110
698
5584
~250
~857
~6856
8
4
2
109
802
6416
8
8
1
109
847
6776
16
4
4
113
669
10706
~262
~446
~7136
16
8
2
113
426
6816
24
~314
~350
~8400
32
8
4
129
144
4608

measuring memory latency and bandwidth with a parallel kernel program

The experiment nicely shows that the memory performance scales quite well. The memory latency increases only up to 129 ns when running 32 processes. For up to 8 processes it is profitable to distribute them across all eight cores instead of filling part of the cores with processes and leaving others empty. The only surprising exception is the 16 process case, where it seems to be more profitable to start 4 threads on 4 cores each leaving the other 4 cores empty.
The kernel program challenges the memory bandwidth considerably and reveils that in such a case the bandwidth might get a limiting factor for performance. The maximum total bandwidth which could be measured is about 12.4 times higher than the bandwidth which can be dedicated to a single process. It seems that the bandwidth is sufficiently scalable for up to eight processes, but may become a limiting factor for more processes in such extreme cases with no data locality.

The lat_mem_rd benchmark, which is part of the LMbench suite, can be used to look a bit closer into memory latency and the memory hierarchy. The pointer chasing mechanism works like described above, but we vary the stride between successive memory accesses and also the memory footprint.

The figure shows the memory performance as measured by the serial lat_mem_rd program for various strides

At first we ran the original serial version varying the memory footprint between 1KB and 8 MB and choosing a stride of 8, 16, 32, 64 and 128 bytes. As long as the memory footprint is below 8 KB, all access can be satisfied in the L1 cache and the „memory“ latency (the average latency of each load instructiong) is 3 ns. When the memory footprint is between 8 KB and 3 MB, the accesses are all satisfied in the L2 cache and the „memory“ latency is about 22 ns, with the only exception when the stride is only 8 bytes. In this case every second load instructions hits the cache line which has previously been fetched into the L1 cache, as the L1 cache line is 16 bytes long. With a stride of 16 bytes or larger, each load instruction misses the L1 cache. If the memory footprint is larger than 3 MB the latency of fetching a cache line of 64 bytes into the L2 cache takes 107 ns and if the stride is less than 64 bytes, this cache line is reused leading to lower average latencies.

Also this kernel program was parallelized using MPI and ran on the Niagara processor with a varying number of processes and with the strides 8, 64 and 8192. All MPI processes execute the same pointer chasing loop simultaneously of course on their private piece of memory. We plot the average latencies of all processes for each measurement.

The figure shows the memory performance as measured by the mpi version of the lat_mem_rd program for a stride of 8 bytes and various number of MPI processes.

This was measured after we changed the system parameters.

With a stride of 8 Bytes there is a lot of cache line reuse of course. But the most striking information which is given by the above figure is that for a large memory footprint the latency does not really get worse when the number of MPI processes increases. The L2 cache which is shared by all cores leads to shift of the slope when the number of processes running simultaneously increases, which has to be expected.

The latency for a small memory footprint raises for 16 or more threads. This might be consequence of the sharing of the L1 caches between all threads of a single core. Here further investigations would be necessary to understand this effect. This is also the case for larger strides as can be seen below.

The figure shows the memory performance as measured by the mpi version of the lat_mem_rd program for a stride of 64 bytes and various number of MPI processes.

This was measured before we changed the system parameters, note the difference!

The figure shows the memory performance as measured by the mpi version of the lat_mem_rd program for a stride of 64 bytes and various number of MPI processes.

This was measured after we changed the system parameters, note the difference!

The same is true for a stride of 64 bytes, which leads to cache misses for each load operation. The latency is well below 140 ns in all cases. Again the effect of sharing a common L2 cache is clearly visible.

The figure shows the memory performance as measured by the mpi version of the lat_mem_rd program for a stride of 8192 bytes and various number of MPI processes.

This was measured after we changed the system parameters, which did not make big difference for this case.

A stride of 8 KB may lead to all kind of nasty effects which nead further investigation. It can be expected that TLB misses have played an important role in slowing down the latency in this serios of measurments.


The EPCC OpenMP Micro Benchmark

The EPCC OpenMP Micro Benchmark carefully measures the performance of all major OpenMP primitives. The first test focusses on the OpenMP directives and the second test take a closer look at the performance of parallel loops using the various OpenMP schedule kinds. The Sun Fire T2000 shows a very similar behaviour in all aspects than the Sun Fire E2900 system, just the T2000 is clearly faster by a factor of roughly two! This correlates nicely to the difference in the memory latency.


NAS Parallel Benchmark NPB 3.2 OpenMP – Integer Sort

The NAS Parallel Benchmark Suite is very well-know in the HPC community. It is available in serial, OpenMP and MPI versions and contains one program focussing on integer performance: is. We ran this program for 4 different test cases W, A, B, C which work datasets of increasing size.

Comparing the performance of the Sun Fire T2000 system with the Sun Fire E2900 system, it can clearly be seen that the large caches of the UltraSPARC IV processors are profitable for the smaller testcases W, A, and B, whereas for the largest test case C a single Niagara processor running 16 threads outperforms 12 UltraSPARC IV processors.

This was measured after we changed the system parameters, which did not make big difference for this case.


Integer Stream Benchmark

The OpenMP Stream Benchmark was changed to do integer instead of floating point operations. Both a Fortran and a C++ version where used for measurements.

Overall the UltraSPARC T1 system scales well, in some cases even up to 32 threads.


Parallel Partitioning of Graphs with ParMETIS

The ParMETIS program package is frequently used for partitioning unstructured graphs, in order to minimize the communication of large-scale numerical simulations.We selected a test case (mtest) which is not dominated by floating point operations. It is used to compute a k-way partitioning of a mesh containung elements of different types on p processors.

The measurements cannot be used to judge scalability, as the partitioning depends on the number of processors used. Here we compare the performance of a 24-way Sun Fire E2900 system, equipped with 12 dual-core UltraSPARC IV processors running at 1200 MHz versus the 1 GHz single-socket SunFire T2000 system:

#MPI processes
UltraSPARC IV
UltraSPARC IV
UltraSPARC T1
factor
total MFlop/s
seconds
seconds
UltraSPARC T1:UltraSPARC IV
1
1
1.4
3.1
2.2
2
15
1.5
3.3
2.2
4
22
1.0
2.1
2.1
8
29
0.9
1.6
2.0
16
35
0.8
1.8
2.3
24
36
0.8
2.4
2.8
32
3.3

When using the same number of MPI processes the UltraSPARC IV-based machine solves these problems about two times faster than the Niagara-based system. Remember, we are comparing a 12-socket SMP machine with a single socket system.


Password Cracking with „John the Ripper“

John the Ripper“ is a popular password crack program used by system administrators to search for weak user passwords.
Hardware Counter analysis reveals that it does not use many floating point instructions and thus it might be a good candidate for the Sun Fire T2000 as a parallel version is publicly available too.
We employed mpich2 to compile and run this parallel version, as the test mode of „John“ uses the SIGALARM signal which is suppressed by Sun’s MPI implementation (HPC ClusterTools V6).

The performance measure is checks per second and we only looked at traditional DES encryption so far.

The figure depicts that the MPI version of John scales very well on the Sun Fire E2900 whereas on the Sun Fire T2000 it scales up to 8 MPI processes and then it drops down again. This was measured before we changed the system parameters…
Looking at the absolute performance, the Sun Fire E2900 clearly outperforms the Sun Fire T2000.This was measured before we changed the system parameters…
This figure depicts the scalability of „John“ on the Sun Fire T2000 after we changed the system parameters…

Why does „John the Ripper“ not scale on the Niagara processor?
Now, if the US IV chip does not suffer from stalls because a program displays a very good data locality such that almost always data can be kept in the caches, than multiple US IV processors will of course outperform a single Niagara processor. And indeed …


… hardware counter measurements reveal that instruction and data (level 1) cache misses badly hurt the Niagara processor. The number of data cache misses increases, when more processes are running per core, as they have to share a common L1 cache of 8 KB, whereas a single core on the UltraSPARC has a data cache of 64 KB.
Plotted are the number of misses per MPI process for the whole test run.
Please note that the scales differ by an order of magnitude between the Sun Fire T2000 (left y-axis) and the Sun Fire E2900 (right y-axis)!

This was measured before we changed the system parameters…

The UltraSPARC IV (US IV) and the Niagara processor are both able to initiate 8 instructions per cycle, the US IV being a dual core 4-issue superscalar processor and the Niagara with 1 instruction per cycle for each of the eight cores.
Now comparing the number of checks per processor chip the 1 GHz Niagara chip even performs a little better than the 1.2 GHz US IV: 1283599 checks/sec versus 1187115 checks/sec.


What if the Niagara could count using floating point numbers …

In many of the PDE solvers running on our machines a CG type of linear equation solver is used in the heart of the computations. The most time consuming part of such a solver frequently is a sparse matrix vector multiplication, which accesses the main memory using index lists. Now how would a processor like the Niagara perform, if it had a floating point unit per core, like it has integer units today. In order to find out, we just changed the data type in one of our sparse matrix vector multiplication codes from double to long long int knowing, that the results of course would not be very meaningful. Well, just an experiment. This is what we get:

Sparse matrix vector multiplication on the SF E2900 and the SF T2000 when using 64 bit floating point or integer numbers.
Whereas the Niagara only reaches up to about 100 MFlop/s it matches the speed of 12 UltraSPARC IV processors when using integers.

Summary

The single-thread performance of the UltraSPARC IV processor running at 1.2 GHz is about 2 -3 times higher than the single-thread performance of the 1 GHz Niagara chip for many of the programs which we looked at so far and which do not execute many floating point operations.
But the lower memory latency even for a high number of threads and the high memory bandwidth lead to a good performance for multi-threaded programs. This is particularly impressive for the class C integer sort benchmark of the NPB OpenMP collection.
Also the ParMETIS results are impressive, when taking into account that the performance of a single Niagara chip is compared to up to 12 UltraSPARC IV chips.



You want to get Hands-on Experience with the Niagara Processor?

Users having accounts on the machines of the CCC can seemlessly use the Sun Fire T2000 system upon request. The system is fully compatible to the other UltraSPARC IV based systems. It is running Solaris 10 and the Sun Studio compilers. Nevertheless we recommend that you load the latest Sun Studio 11 compiler into your environment by

module switch studio studio/11

and then recompile your application using the compiler flag

-xtarget=ultraT1

which will be expanded to

-xarch=v8plus -xcache=8/16/4/4:3072/64/12/32 -xchip=ultraT1

by the compiler (Keep in mind that the rightmost compiler option dominates if you for example want to add the -fast or the -xarch=v9b flag !)

The Solaris 10 operating systems deals with the system like a full blown 32-way shared memory machine. Keep in mind, that floating point intense application will run very slow! The machine is not built for such applications!

HPC Events : A Short Introduction in Performance Tuning

07. Januar 2021 | von

[:de]

Dieter an Mey
Aachen, February 2008

High Performance Computing (HPC)

As described in the short introduction into HPC the focus of HPC is to reduce the time to solution of computational problems. The target is an efficient implementation of a suitable algorithm in a suitable programming language leading to an executable program which exploits the suitable machinery efficiently. Today, this usually results in a cache friendly parallel program written in C, C++ or Fortran.

Currently there are  two dominating parallelization paradigms:

  • OpenMP for shared-memory-parallelization on multiprocessor computers (SMP), with all processors accessing a common memory
  • Message Passing with MPI for distributed-memory parallelization for clusters of computers connect via a fast network
As current supercomputers typically consist of a cluster of multiprocessor computers (SMP cluster), MPI and OpenMP can nicely be combined, a strategy which is called hybrid parallelization. Also MPI can be used to run multiple processes on each multiprocessor cluster node.

Performance Tuning Targets

What can be accomplished by parallelization and performance tuning? In an ideal case parallelization would lead to a speed-up which scales linearly with the number of processors employed compared to the original serial program running on a single processor. A list of naive expectations often will be made:

  • A serial program is expected to run close to the processor’s theoretical peak performance.
  • An OpenMP program running on an SMP with M processors (or processor cores) is expected to run M times faster with M OpenMP threads compared to the serial program running on one processor (core).
  • An MPI program running on an SMP cluster consisting of N nodes with each node having M processors (or processor cores) is expected to run N*M times faster with N*M MPI processes
  • A hybrid (OpenMP + MPI) parallel program running on an SMP cluster consisting of N nodes with each node having M processors (or processor cores) is expected to run N*M times faster with N MPI processes and M OpenMP threads per MPI process (or with for example N*2 MPI processes and M/2 OpenMP threads per MPI process).

What if a program’s performance does not meet these expectations? Indeed, there are good reasons why these expectations most likely will not be met:

  • While today’s processor typically have a very high peak performance, memory performance lags behind and in many cases the processors cannot be fed with data fast enough to maintain a high computational speed. As a rule of thumb, modern processors typically achieve some five percent of their peak performance. But it may get worse, if the way data is accessed in memory is very unfortunate. In a lucky case, if a program displays a nice (temporal and spatial) locality and thus is very cache friendly performance may get close to some 20 to 50 percent of theoretical peak performance.
  • Multiprocessor computers, multicore processors and even more multithreaded processor cores share a lot of resources. In an SMP the processors share paths to the common memory, cores in a multicore processor frequently share caches and pins, multithreaded processor cores share virtually everything except for a couple of state registers. All this resource sharing may lead to considerable conflicts limiting the speed-up of parallel programs.
  • In a compute cluster the network between these nodes may be a bottleneck, if the MPI program is communication intensive. In this case the performance of the network, its latency, bandwidth and topology, will be an important factor.

Thus, performance tuning has three major aspects:

  • designing and tuning a sequential program to exploit the serial performance
  • parallelization for shared memory to exploit the node performance
  • parallelization using message passing to exploit the cluster performance

Now, what if a program has been parallelized using OpenMP and/or MPI but it still runs too slow? What if adding resources, increasing the number of OpenMP threads and/or MPI processes even slows down the program?

Parallelization always introduces synchronization and communication overhead. Additionally any code regions which have not been parallelized limit the scalability of the parallel program according to the Amdahl’s Law. Another root of bad scalability is Load Imbalance, when some processes jost do nothing while waiting for other becaming ready.

Performance tuning tools can be great convenience to learn more about the runtime behavior of the program and to detect performance problems.

Tuning Sequential Programs

In order to improve the performance of a single process, the hardware architecture of a processor has to be taken into account. Over the last decades the processor speed has been improved considerably, whereas memory has primarily increased its capacity but not the access time (latency). As a consequence, processors frequently have to wait for data coming from memory and in many cases cannot run at full speed. In order to overcome this bottleneck, chip designers have added fast but small buffers (caches) early on. These caches work transparently to the user and can speed up programs considerably. Still many compute intense applications only run at some five percent of the theoretically peak performance of modern processors, leaving considerable headroom for tuning efforts. The advent of multicore processors aggravates this memory bottleneck furthermore.

In order to improve the cache performance it is beneficial to reuse data residing in a cache in a timely manner (temporal locality). As caches are organized in so-called “cache lines” of typically 64 bytes and data is transported between memory and processors in chunks of cache lines, it is profitable to really use all of the data of such a chunk once it resides in a cache (spatial locality).

Hardware Counters can provide statistical information about the distribution and frequency of cache misses or the time it takes to fetch data from memory for certain parts of a program. The rate of executed processor instructions and particularly the rate of executed floating point instructions can be measured, which is a suitable measure for program performance.

Tuning OpenMP Programs

  • As all threads of an OpenMP program access a common memory, tuning the single thread’s memory performance is essential for improving the performance of an OpenMP program as well. Taking care about the locality of memory accesses is essential as all modern computer architectures are (cc)NUMA.
  • A nasty aspect of bad performance of OpenMP programs which is particularly hard to detect is called “false sharing”. When threads update different parts of the same cache line, cache lines may be forced to travel between the caches, effectively putting the caches out of commission.
  • Adding OpenMP constructs also introduces administrative overhead to the original program. If this overhead is too large compared to the amount of parallelized work which is distributed to multiple threads, then using OpenMP may be counterproductive at all as it will slow down the application.
  • Furthermore, critical regions may have to be introduced into the program in order to update shared memory locations by multiple threads and avoid data races. These critical regions serialize the program flow and may seriously decrease performance.
  • Load imbalances may cause threads to idle or wait at synchronization points (barriers). OpenMP provides easy-to-use loop scheduling policies to overcome these problems in many cases, which may in turn insert additional overhead. Another approach to handle non-uniformly pieces of work is Tasking (OpenMP from v3.0 on).
These handicaps are examples of performance bottlenecks which may need to be detected and resolved to obtain a reasonable performance gain.
Tuning tools provide statistics and visual representations of the program execution revealing such performance obstacles.

Tuning MPI Programs

Again, single processor (core) performance is the point of departure for parallel performance. (Ironically, MPI speedup looks better in case the performance of the single MPI process is worse, because then the granularity of the work chunks is automatically larger.)
  • The major source of inefficiency of MPI programs is communication overhead. How much time is spent in sending messages compared to useful calculation? An opportunity of hiding communication is by overlapping communication and computation.
  • Global communication involving many or all MPI processes – for example calculating reductions, residuals, or error estimations – may include costly synchronizations. Sometimes the agglomeration of such reductions can reduce the overhead.
  • Load imbalances may easily become a serious performance problem. Tuning tools providing statistical and visual information about the program execution can easily detect these problems, but solving them may require a lot of program and even algorithmical changes.

Performance Tuning Procedure

A major challenge for performance analysis is the choose of reasonable, representative input data set which still not take too much time.

The application of tuning tools consists of three major steps:

  • Instrumentation or modification of the program to generate performance data
  • Collecting runtime information
  • Evaluation, analysis and presentation of the collected data (typically after the program run, but some tools also present performance data dynamically at runtime).
What kind of performance information can be collected?
  • timing information (CPU-time and/or real-time) on a statement, loop, routine, or program level
  • hardware performance counter information (cache hits and misses, various kinds of instruction counters, memory access counters, network accesses etc.)
  • number of executions (e.g. floating point operation) on a per instruction, per loop, or per routine basis
  • specific wait times when threads or processes are synchronized, waiting for messages etc.
  • program counter (PC) and sometimes callstack information to relate performance data to the program location at the analysis phase

Program Instrumentation

Instrumentation of the program can be done on various levels:

  • On the source code level,
    • the programmer or a source-to-source pre-processor can add statements to inquire and record timing or hardware counter information and call library routines belonging to some measurement system in order to collect additional application specific information.
    • The compiler can instrument the code (if it supports this task).
  • At link time profiling versions of libraries can be used, or routines can be wrapped by adapter routines collecting performance information.
  • The executable can be enhanced through binary instrumentation.
  • The binary program can be executed under the control of some measuring tool, as simple as a timer and as complex as a tracing utility which has access to a comprehensive dynamic tracing framework provided by the operating system.

Collecting runtime information

How can performance information be collected? Two different methods can be distinguished:

  • Profiling updates summary statistics of execution when an event occurs. It uses the occurrence of an event to keep track of statistics of performance metrics. These statistics are maintained at runtime as the process executes and are stored in profile data files when the program terminates.
    In profiling based on sampling, a hardware interval timer periodically interrupts the execution. Such interrupts can as well be triggered by performance counters, that measure events in hardware.
  • Tracing records a detailed log of timestamped events and their attributes. It reveals the temporal aspect of the execution. This allows the user to see when and where routine transitions, communication and user defined events take place.

The program’s image file and symbol table can be used after the program run to calculate the time spent in each routine.

Evaluation, analysis and presentation of the runtime information and performance properties

The output of performance tools can be as simple as a textual summary of a small amount of runtime statistics and as overwhelming as a huge amount of data to be presented by graphical user interfaces. These user interfaces provide different views on the data including timeline visualization revealing the behavior of the application over time.

The purpose of a performance tool is not to overwhelm the user with the sheer amount of data but to pinpoint the most important performance problems.

Sometimes employing a tool presenting only a few lines of text can be more economical, as it can be applied more frequently in scripts and even in batch mode. On the other hand, the costly presentation of the time flow of the application can be a very helpful for the programmer to give a feeling for what is going on at runtime and may even lead to the detection of program errors.

Outlook

Today, performance analysis of “regular” parallel programs is well understood. But analyzing “irregular” parallel programs applying nested, hybrid or even recursive parallelization, changing the number processes or threads on-the-fly is much harder and the subject of active research. On the other hand, the number processor cores employed in large applications is constantly increasing leading to enormous amounts of performance data, causing problems when storing, retrieving, analyzing and presenting them in an adequate and efficient way.

 

References

  1. Sameer Shende, Profiling and Tracing in Linux
    http://www.cs.uoregon.edu/research/paraducks/papers/linux99.pdf
  2. Intel VTune Amplifier
  3. Bernd Mohr,  Performance Analysis of MPI Programs with Vampir and Vampirtrace (PDF)

[:en]

Dieter an Mey
Aachen, February 2008

High Performance Computing (HPC)

As described in the short introduction into HPC the focus of HPC is to reduce the time to solution of computational problems. The target is an efficient implementation of a suitable algorithm in a suitable programming language leading to an executable program which exploits the suitable machinery efficiently. Today, this usually results in a cache friendly parallel program written in C, C++ or Fortran.

Currently there are  two dominating parallelization paradigms:

  • OpenMP for shared-memory-parallelization on multiprocessor computers (SMP), with all processors accessing a common memory
  • Message Passing with MPI for distributed-memory parallelization for clusters of computers connect via a fast network
As current supercomputers typically consist of a cluster of multiprocessor computers (SMP cluster), MPI and OpenMP can nicely be combined, a strategy which is called hybrid parallelization. Also MPI can be used to run multiple processes on each multiprocessor cluster node.

Performance Tuning Targets

What can be accomplished by parallelization and performance tuning? In an ideal case parallelization would lead to a speed-up which scales linearly with the number of processors employed compared to the original serial program running on a single processor. A list of naive expectations often will be made:

  • A serial program is expected to run close to the processor’s theoretical peak performance.
  • An OpenMP program running on an SMP with M processors (or processor cores) is expected to run M times faster with M OpenMP threads compared to the serial program running on one processor (core).
  • An MPI program running on an SMP cluster consisting of N nodes with each node having M processors (or processor cores) is expected to run N*M times faster with N*M MPI processes
  • A hybrid (OpenMP + MPI) parallel program running on an SMP cluster consisting of N nodes with each node having M processors (or processor cores) is expected to run N*M times faster with N MPI processes and M OpenMP threads per MPI process (or with for example N*2 MPI processes and M/2 OpenMP threads per MPI process).

What if a program’s performance does not meet these expectations? Indeed, there are good reasons why these expectations most likely will not be met:

  • While today’s processor typically have a very high peak performance, memory performance lags behind and in many cases the processors cannot be fed with data fast enough to maintain a high computational speed. As a rule of thumb, modern processors typically achieve some five percent of their peak performance. But it may get worse, if the way data is accessed in memory is very unfortunate. In a lucky case, if a program displays a nice (temporal and spatial) locality and thus is very cache friendly performance may get close to some 20 to 50 percent of theoretical peak performance.
  • Multiprocessor computers, multicore processors and even more multithreaded processor cores share a lot of resources. In an SMP the processors share paths to the common memory, cores in a multicore processor frequently share caches and pins, multithreaded processor cores share virtually everything except for a couple of state registers. All this resource sharing may lead to considerable conflicts limiting the speed-up of parallel programs.
  • In a compute cluster the network between these nodes may be a bottleneck, if the MPI program is communication intensive. In this case the performance of the network, its latency, bandwidth and topology, will be an important factor.

Thus, performance tuning has three major aspects:

  • designing and tuning a sequential program to exploit the serial performance
  • parallelization for shared memory to exploit the node performance
  • parallelization using message passing to exploit the cluster performance

Now, what if a program has been parallelized using OpenMP and/or MPI but it still runs too slow? What if adding resources, increasing the number of OpenMP threads and/or MPI processes even slows down the program?

Parallelization always introduces synchronization and communication overhead. Additionally any code regions which have not been parallelized limit the scalability of the parallel program according to the Amdahl’s Law. Another root of bad scalability is Load Imbalance, when some processes jost do nothing while waiting for other becaming ready.

Performance tuning tools can be great convenience to learn more about the runtime behavior of the program and to detect performance problems.

Tuning Sequential Programs

In order to improve the performance of a single process, the hardware architecture of a processor has to be taken into account. Over the last decades the processor speed has been improved considerably, whereas memory has primarily increased its capacity but not the access time (latency). As a consequence, processors frequently have to wait for data coming from memory and in many cases cannot run at full speed. In order to overcome this bottleneck, chip designers have added fast but small buffers (caches) early on. These caches work transparently to the user and can speed up programs considerably. Still many compute intense applications only run at some five percent of the theoretically peak performance of modern processors, leaving considerable headroom for tuning efforts. The advent of multicore processors aggravates this memory bottleneck furthermore.

In order to improve the cache performance it is beneficial to reuse data residing in a cache in a timely manner (temporal locality). As caches are organized in so-called “cache lines” of typically 64 bytes and data is transported between memory and processors in chunks of cache lines, it is profitable to really use all of the data of such a chunk once it resides in a cache (spatial locality).

Hardware Counters can provide statistical information about the distribution and frequency of cache misses or the time it takes to fetch data from memory for certain parts of a program. The rate of executed processor instructions and particularly the rate of executed floating point instructions can be measured, which is a suitable measure for program performance.

Tuning OpenMP Programs

  • As all threads of an OpenMP program access a common memory, tuning the single thread’s memory performance is essential for improving the performance of an OpenMP program as well. Taking care about the locality of memory accesses is essential as all modern computer architectures are (cc)NUMA.
  • A nasty aspect of bad performance of OpenMP programs which is particularly hard to detect is called “false sharing”. When threads update different parts of the same cache line, cache lines may be forced to travel between the caches, effectively putting the caches out of commission.
  • Adding OpenMP constructs also introduces administrative overhead to the original program. If this overhead is too large compared to the amount of parallelized work which is distributed to multiple threads, then using OpenMP may be counterproductive at all as it will slow down the application.
  • Furthermore, critical regions may have to be introduced into the program in order to update shared memory locations by multiple threads and avoid data races. These critical regions serialize the program flow and may seriously decrease performance.
  • Load imbalances may cause threads to idle or wait at synchronization points (barriers). OpenMP provides easy-to-use loop scheduling policies to overcome these problems in many cases, which may in turn insert additional overhead. Another approach to handle non-uniformly pieces of work is Tasking (OpenMP from v3.0 on).
These handicaps are examples of performance bottlenecks which may need to be detected and resolved to obtain a reasonable performance gain.
Tuning tools provide statistics and visual representations of the program execution revealing such performance obstacles.

Tuning MPI Programs

Again, single processor (core) performance is the point of departure for parallel performance. (Ironically, MPI speedup looks better in case the performance of the single MPI process is worse, because then the granularity of the work chunks is automatically larger.)
  • The major source of inefficiency of MPI programs is communication overhead. How much time is spent in sending messages compared to useful calculation? An opportunity of hiding communication is by overlapping communication and computation.
  • Global communication involving many or all MPI processes – for example calculating reductions, residuals, or error estimations – may include costly synchronizations. Sometimes the agglomeration of such reductions can reduce the overhead.
  • Load imbalances may easily become a serious performance problem. Tuning tools providing statistical and visual information about the program execution can easily detect these problems, but solving them may require a lot of program and even algorithmical changes.

Performance Tuning Procedure

A major challenge for performance analysis is the choose of reasonable, representative input data set which still not take too much time.

The application of tuning tools consists of three major steps:

  • Instrumentation or modification of the program to generate performance data
  • Collecting runtime information
  • Evaluation, analysis and presentation of the collected data (typically after the program run, but some tools also present performance data dynamically at runtime).
What kind of performance information can be collected?
  • timing information (CPU-time and/or real-time) on a statement, loop, routine, or program level
  • hardware performance counter information (cache hits and misses, various kinds of instruction counters, memory access counters, network accesses etc.)
  • number of executions (e.g. floating point operation) on a per instruction, per loop, or per routine basis
  • specific wait times when threads or processes are synchronized, waiting for messages etc.
  • program counter (PC) and sometimes callstack information to relate performance data to the program location at the analysis phase

Program Instrumentation

Instrumentation of the program can be done on various levels:

  • On the source code level,
    • the programmer or a source-to-source pre-processor can add statements to inquire and record timing or hardware counter information and call library routines belonging to some measurement system in order to collect additional application specific information.
    • The compiler can instrument the code (if it supports this task).
  • At link time profiling versions of libraries can be used, or routines can be wrapped by adapter routines collecting performance information.
  • The executable can be enhanced through binary instrumentation.
  • The binary program can be executed under the control of some measuring tool, as simple as a timer and as complex as a tracing utility which has access to a comprehensive dynamic tracing framework provided by the operating system.

Collecting runtime information

How can performance information be collected? Two different methods can be distinguished:

  • Profiling updates summary statistics of execution when an event occurs. It uses the occurrence of an event to keep track of statistics of performance metrics. These statistics are maintained at runtime as the process executes and are stored in profile data files when the program terminates.
    In profiling based on sampling, a hardware interval timer periodically interrupts the execution. Such interrupts can as well be triggered by performance counters, that measure events in hardware.
  • Tracing records a detailed log of timestamped events and their attributes. It reveals the temporal aspect of the execution. This allows the user to see when and where routine transitions, communication and user defined events take place.

The program’s image file and symbol table can be used after the program run to calculate the time spent in each routine.

Evaluation, analysis and presentation of the runtime information and performance properties

The output of performance tools can be as simple as a textual summary of a small amount of runtime statistics and as overwhelming as a huge amount of data to be presented by graphical user interfaces. These user interfaces provide different views on the data including timeline visualization revealing the behavior of the application over time.

The purpose of a performance tool is not to overwhelm the user with the sheer amount of data but to pinpoint the most important performance problems.

Sometimes employing a tool presenting only a few lines of text can be more economical, as it can be applied more frequently in scripts and even in batch mode. On the other hand, the costly presentation of the time flow of the application can be a very helpful for the programmer to give a feeling for what is going on at runtime and may even lead to the detection of program errors.

Outlook

Today, performance analysis of “regular” parallel programs is well understood. But analyzing “irregular” parallel programs applying nested, hybrid or even recursive parallelization, changing the number processes or threads on-the-fly is much harder and the subject of active research. On the other hand, the number processor cores employed in large applications is constantly increasing leading to enormous amounts of performance data, causing problems when storing, retrieving, analyzing and presenting them in an adequate and efficient way.

 

References

  1. Sameer Shende, Profiling and Tracing in Linux
    http://www.cs.uoregon.edu/research/paraducks/papers/linux99.pdf
  2. Intel VTune Amplifier
  3. Bernd Mohr,  Performance Analysis of MPI Programs with Vampir and Vampirtrace (PDF)

[:]

HPC Events : SunHPC 2008

07. Januar 2021 | von

[:de]

SunHPC 2008 Seminar

Monday, March 3 – Tuesday, March 4, 2008

followed by the first

VI-HPS Tuning Workshop

Wednesday, March 5 – Friday, March 7, 2008

Special Guest:

On Monday at 11:30 Denis Sheahan, Distinguished Engineer, Niagara Architecture Group, Sun Microsystems will talk about „Niagara 2 Architecture and Performance Overview, and a sneak peak at Victoria Falls“. Niagara 2 or UltrasSPARC T2 – the official product name – is a very innovative multicore-multithreading processor from Sun Microsystems.

Time

Location

SunHPC 2008 Seminar:

Monday, March 3, 9:00  –  Tuesday March 4, 17:30

Wednesday, March 5, 9:00  –  Friday March 7, 12:30

Center for Computing and Communication

RWTH Aachen University

Seffenter Weg 23
52074 Aachen
Lecture Room and Lab Room 3

Contents

  • Introduction
  • Participants
  • Agenda
  • Costs
  • Registration and Feedback
  • Logistics

Introduction

The SunHPC 2008 Seminar is the 8th event in a series of successful introductions into application performance tuning organized by the RWTH Aachen University and Sun Microsystems. We keep continuously changing the format a little bit over the years. This time it is combined with the first VI-HPS Tuning Workshop.

The SunHPC 2008 Seminar starts with a short overview on serial application performance tuning. This is followed by a detailed tutorial on shared memory parallelization.  After an extensive introduction into concepts related to parallelization, automatic parallelization as well as the OpenMP programming model are covered in great detail.
Through an instructor led lab there is an opportunity to try things out yourself using examples made available.

The general philosophy of the first part of the workshop is to build up understanding of key concepts that are relevant to obtain good application performance. Once this is achieved, it is much easier to use the development environment in the best possible way.

The Sun compilers, the Sun Performance Analyzer and the Sun Thread Analyzer are covered in detail. It is shown how these tools can be used to get optimal productivity and performance out of  UltraSPARC T2- and Opteron-based Sun systems.
We also briefly touch upon several third party software products, which augment the programmer’s tool suite on the Sun systems.

Participants

Attendees should be comfortable with C or Fortran programming and interested in learning more about the technical details of application tuning. Although there is no special coverage of C++ and the examples are in Fortran and/or C, C++ programmers will certainly benefit from this course as well.  Prepared lab exercises will be made available to participants. These exercises have been selected to demonstrate features discussed in the presentations.
The workshop language will be English.

The participants are cordially invited to also participate in the following VI-HPS tuning workshop to gain more hands-on experience. The Sun tools are part of this workshop as well. The focus is on tuning own applications. A suitable preparation of Makefiles and small plus medium sized data sets are of course desirable.

Agenda  please download here

  • Serial Performance Part 1 (RvdP) Video
  • Serial Performance Part 2 (RvdP) Video
  • Architecture of the UltraSPARC T2 processor (Denis Sheahan) Video
  • Serial Performance, Part 3 (RvdP) Video
  • Introduction into parallelization (RvdP) Video
  • Automatic parallelization
  • Introduction into OpenMP
  • Sun Studio support for OpenMP
  • OpenMP and Performance
  • Case Studies
  • The Acumem Virtual Perforamance Espert (Mats Nielsson) Video

Social Dinner

Sun Microsystems is sponsoring a social dinner in the restaurant „Kazan“ on Tuesday evening at 19:00.
See our citymap for the location.

The Costs

There is no seminar fee. All other costs (e.g. travel, hotel, and consumption) are at your own expenses.

Registration and Feedback

Registration is closed.

In order to improve future events we need your feedback. Please take a few minutes to fill out the online form here.

Logistics

Find more information about the logistics here …

[:en]

 

SunHPC 2008 Seminar

Monday, March 3 – Tuesday, March 4, 2008 

 

followed by the first

VI-HPS Tuning Workshop

Wednesday, March 5 – Friday, March 7, 2008

 

Special Guest:

On Monday at 11:30 Denis Sheahan, Distinguished Engineer, Niagara Architecture Group, Sun Microsystems will talk about „Niagara 2 Architecture and Performance Overview, and a sneak peak at Victoria Falls“. Niagara 2 or UltrasSPARC T2 – the official product name – is a very innovative multicore-multithreading processor from Sun Microsystems.

 

Time

Location

SunHPC 2008 Seminar:

Monday, March 3, 9:00  –  Tuesday March 4, 17:30

Wednesday, March 5, 9:00  –  Friday March 7, 12:30
Center for Computing and Communication

RWTH Aachen University

Seffenter Weg 23
52074 Aachen
Lecture Room and Lab Room 3

Contents

  • Introduction
  • Participants
  • Agenda
  • Costs
  • Registration and Feedback
  • Logistics

 

Introduction

The SunHPC 2008 Seminar is the 8th event in a series of successful introductions into application performance tuning organized by the RWTH Aachen University and Sun Microsystems. We keep continuously changing the format a little bit over the years. This time it is combined with the first VI-HPS Tuning Workshop.

The SunHPC 2008 Seminar starts with a short overview on serial application performance tuning. This is followed by a detailed tutorial on shared memory parallelization.  After an extensive introduction into concepts related to parallelization, automatic parallelization as well as the OpenMP programming model are covered in great detail.
Through an instructor led lab there is an opportunity to try things out yourself using examples made available.

The general philosophy of the first part of the workshop is to build up understanding of key concepts that are relevant to obtain good application performance. Once this is achieved, it is much easier to use the development environment in the best possible way.

The Sun compilers, the Sun Performance Analyzer and the Sun Thread Analyzer are covered in detail. It is shown how these tools can be used to get optimal productivity and performance out of  UltraSPARC T2- and Opteron-based Sun systems.
We also briefly touch upon several third party software products, which augment the programmer’s tool suite on the Sun systems.

 

Participants

Attendees should be comfortable with C or Fortran programming and interested in learning more about the technical details of application tuning. Although there is no special coverage of C++ and the examples are in Fortran and/or C, C++ programmers will certainly benefit from this course as well.  Prepared lab exercises will be made available to participants. These exercises have been selected to demonstrate features discussed in the presentations.
The workshop language will be English.

The participants are cordially invited to also participate in the following VI-HPS tuning workshop to gain more hands-on experience. The Sun tools are part of this workshop as well. The focus is on tuning own applications. A suitable preparation of Makefiles and small plus medium sized data sets are of course desirable.

 

Agenda  please download here

  • Serial Performance Part 1 (RvdP) Video
  • Serial Performance Part 2 (RvdP) Video
  • Architecture of the UltraSPARC T2 processor (Denis Sheahan) Video
  • Serial Performance, Part 3 (RvdP) Video
  • Introduction into parallelization (RvdP) Video
  • Automatic parallelization
  • Introduction into OpenMP
  • Sun Studio support for OpenMP
  • OpenMP and Performance
  • Case Studies
  • The Acumem Virtual Perforamance Espert (Mats Nielsson) Video
 

Social Dinner

Sun Microsystems is sponsoring a social dinner in the restaurant „Kazan“ on Tuesday evening at 19:00.
See our citymap for the location.

The Costs

There is no seminar fee. All other costs (e.g. travel, hotel, and consumption) are at your own expenses.

Registration and Feedback

Registration is closed.

In order to improve future events we need your feedback. Please take a few minutes to fill out the online form here.

Logistics

Find more information about the logistics here …

[:]

HPC Events : aiXcelerate 2017

07. Januar 2021 | von

[:de]

 

HPC Tuning Workshop

Tue, Dec 5  – Thu, Dec 7, 2017

IT Center

RWTH Aachen University

sponsored by:


Introduction

This year’s aiXcelerate HPC Tuning Workshop will focus on two topics:

  1. MPI+OPA: Tuning of parallel programs employing message passing with MPI and running on the Intel OmniPath fabric.
  2. KNL: Tuning for Intel’s Xeon Phi new many core processor

The nodes of Claix, the latest RWTH Aachen University’s Supercomputer, are equipped with Intel Broadwell processors and connected through a network with the Intel OmniPath architecture (OPA). Claix has recenlty been complemented by 16 nodes containing an Intel’s Xeon Phi 7210 processor – code named Knights Landing (KNL). At the Jülich Supercomputer Centre a cluster called „Booster“ will soon be installed which will contain Intel’s Xeon Phi many core 7250-F processors and an Intel OmniPath fabric. The KNL nodes at Aachen can well be used to prepare and tune programs for Booster.

Researchers from FZ Jülich and RWTH with high demand for compute power can apply for resources on these systems as part of the JARA-HPC partition (>>> more… ).
Researchers from all over Germany can apply for resources on Claix (>>> more… ).

The workshop will consist of presentations open to a broader auditorium and of a hands-on tuning workshop for a limited number of selected compute projects.
Presentations will be given in English.

We are proud to announce Michael Klemm and Christopher Dahnken, two HPC performance experts from Intel, who will give presentations and support you with your tuning efforts. They will be complemented by experts of the HPC Team of the IT Center at RWTH Aachen University.

Materials

  1. Agenda
  2. Michael Klemm (Intel): MPI & ITAC
  3. Dirk Schmidl (RWTH): NonIntel Tools
  4. Hristo Iliev (RWTH): Open MPI
  5. Paul Kapinos (RWTH): RWTH Compute Cluster environment
  6. Christopher Dahnken (Intel): Intel Xeon Phi Architecture
  7. Christopher Dahnken (Intel): OpenMP SIMD
  8. Michael Klemm (Intel): VTune, also for MPI

aiXcelerate – presentations

  1. MPI+OPA: On Tue morning, Dec 5, we will start at 10:00 with presentations on MPI programming and tuning using Intel MPI on the OmniPath fabric (OPA).
    Please  register here for the presentations on MPI+OPA
  2. KNL: On Wed morning, Dec 6, we will start at 10:00 with presentations on Xeon Phi (KNL) programming and tuning using Intel compiler and tools.
    Please  register here for the presentations on KNL

aiXcelerate – tuning workshop

In the afternoon after the presentations on Tue and Wed and also throughout Thu, Dec 7, we provide you with hands-on opportunity to adapt, analyze and tune your performance critical applications on Claix, with its Broadwell and Xeon Phi processors connected through the OmniPath Fabric.

Participation in the presentations is a prerequisite for participation in tuning workshop.

Attendees are kindly requested to prepare and bring in their own code. It is assumed that you have a good working knowledge with MPI and/or OpenMP, and C/C++ or Fortran, whatever your compute project employs. To maximize the efficiency of the workshop, we would like to ask you to prepare one or more test cases that reflect typical production runs, but do not take too long to execute – in the ideal case, a run should not take more than 5 to 10 minutes to finish. Members of the local HPC team will support  every accepted project in porting the code to Claix. By the start of the workshop the code should be ready to be analyzed.

If you plan to participate in the tuning activities please contact us by sending an email to hpcevent@itc.rwth-aachen.de.

Workshop participants are invited to take part in the social dinner on Wednesday evening in Restaurant Palladion .-please note if you like to take part in your registration. Thank you.

Costs

Attendance is free of charge and supported by our sponsors.
Travel and accommodation are at your own expense.

Links to previous Events

Travel Information

If required, please make your own hotel reservation.You can find some popular hotels listed here. You may find a complete list of hotels on the web pages of the Aachen Tourist Service. We recommend that you try to book a room in the „Novotel Aachen City„, “ Mercure am Graben“ or „Aachen Best Western Regence“ hotels. These are adequate hotels with reasonable prices at a walking distance (20-30 minutes) to the IT Center through the old city of Aachen. An alternative is the hotel „IBIS Aachen Marschiertor“ which is close to the main station, which is convenient if you are travelling by train and also want to commute to the IT Center by train (3 trains per hour, 2 stops)

Please, download a sketch of the city (pdf, 415 KB) with some points of interest marked.
You may find a description of how to reach us by plane, train or car here.
Bus route 33 connects the city and the stop „Mies-van-der-Rohe-Straße“ every 15 minutes.
Trains between Aachen and Düsseldorf stop at „Aachen West“ station which is a 5 minutes walk away from the IT Center.
From the bus stop and the train station just walk uphill the „Seffenter Weg“. The first building on the left hand side at the junction with „Kopernikusstraße“ is the IT Center.

The weather in Aachen is usually unpredictable. It is always a good idea to carry an umbrella. If you’ll bring one, it might be sunny!

Technical Information for Tuning Workshop Participants to get prepared for KNL

Workshop participants with an interest in tuning for KNL can prepare their jobs on the KNL systems:

KNL login node:

login-knl.hpc.itc.rwth-aachen.de

Job commands for submitting an LSF job to the 15 KNL batch nodes:

#BSUB -P hpclab
#BSUB -R knl
#BSUB -m c64m208k

Please contact us and send us your HPC account so we can add you to the project „hpclab“ for the workshop.

You can find some information on tuning for KNL here.

Contact

[:en]

                    

HPC Tuning Workshop

Tue, Dec 5  – Thu, Dec 7, 2017 

IT Center 

RWTH Aachen University 

sponsored by:


Introduction

This year’s aiXcelerate HPC Tuning Workshop will focus on two topics:

  1. MPI+OPA: Tuning of parallel programs employing message passing with MPI and running on the Intel OmniPath fabric.
  2. KNL: Tuning for Intel’s Xeon Phi new many core processor

The nodes of Claix, the latest RWTH Aachen University’s Supercomputer, are equipped with Intel Broadwell processors and connected through a network with the Intel OmniPath architecture (OPA). Claix has recenlty been complemented by 16 nodes containing an Intel’s Xeon Phi 7210 processor – code named Knights Landing (KNL). At the Jülich Supercomputer Centre a cluster called „Booster“ will soon be installed which will contain Intel’s Xeon Phi many core 7250-F processors and an Intel OmniPath fabric. The KNL nodes at Aachen can well be used to prepare and tune programs for Booster.

Researchers from FZ Jülich and RWTH with high demand for compute power can apply for resources on these systems as part of the JARA-HPC partition (>>> more… ).
Researchers from all over Germany can apply for resources on Claix (>>> more… ).

The workshop will consist of presentations open to a broader auditorium and of a hands-on tuning workshop for a limited number of selected compute projects.
Presentations will be given in English.

We are proud to announce Michael Klemm and Christopher Dahnken, two HPC performance experts from Intel, who will give presentations and support you with your tuning efforts. They will be complemented by experts of the HPC Team of the IT Center at RWTH Aachen University.

Materials

  1. Agenda
  2. Michael Klemm (Intel): MPI & ITAC
  3. Dirk Schmidl (RWTH): NonIntel Tools
  4. Hristo Iliev (RWTH): Open MPI
  5. Paul Kapinos (RWTH): RWTH Compute Cluster environment
  6. Christopher Dahnken (Intel): Intel Xeon Phi Architecture
  7. Christopher Dahnken (Intel): OpenMP SIMD
  8. Michael Klemm (Intel): VTune, also for MPI

aiXcelerate – presentations

  1. MPI+OPA: On Tue morning, Dec 5, we will start at 10:00 with presentations on MPI programming and tuning using Intel MPI on the OmniPath fabric (OPA).
    Please  register here for the presentations on MPI+OPA
  2. KNL: On Wed morning, Dec 6, we will start at 10:00 with presentations on Xeon Phi (KNL) programming and tuning using Intel compiler and tools.
    Please  register here for the presentations on KNL

aiXcelerate – tuning workshop

In the afternoon after the presentations on Tue and Wed and also throughout Thu, Dec 7, we provide you with hands-on opportunity to adapt, analyze and tune your performance critical applications on Claix, with its Broadwell and Xeon Phi processors connected through the OmniPath Fabric.

Participation in the presentations is a prerequisite for participation in tuning workshop.

Attendees are kindly requested to prepare and bring in their own code. It is assumed that you have a good working knowledge with MPI and/or OpenMP, and C/C++ or Fortran, whatever your compute project employs. To maximize the efficiency of the workshop, we would like to ask you to prepare one or more test cases that reflect typical production runs, but do not take too long to execute – in the ideal case, a run should not take more than 5 to 10 minutes to finish. Members of the local HPC team will support  every accepted project in porting the code to Claix. By the start of the workshop the code should be ready to be analyzed.

If you plan to participate in the tuning activities please contact us by sending an email to hpcevent@itc.rwth-aachen.de.

Workshop participants are invited to take part in the social dinner on Wednesday evening in Restaurant Palladion .-please note if you like to take part in your registration. Thank you.

 

Costs

Attendance is free of charge and supported by our sponsors.
Travel and accommodation are at your own expense.

Links to previous Events 

Travel Information

If required, please make your own hotel reservation.You can find some popular hotels listed here. You may find a complete list of hotels on the web pages of the Aachen Tourist Service. We recommend that you try to book a room in the „Novotel Aachen City„, “ Mercure am Graben“ or „Aachen Best Western Regence“ hotels. These are adequate hotels with reasonable prices at a walking distance (20-30 minutes) to the IT Center through the old city of Aachen. An alternative is the hotel „IBIS Aachen Marschiertor“ which is close to the main station, which is convenient if you are travelling by train and also want to commute to the IT Center by train (3 trains per hour, 2 stops)

Please, download a sketch of the city (pdf, 415 KB) with some points of interest marked.
You may find a description of how to reach us by plane, train or car here.
Bus route 33 connects the city and the stop „Mies-van-der-Rohe-Straße“ every 15 minutes.
Trains between Aachen and Düsseldorf stop at „Aachen West“ station which is a 5 minutes walk away from the IT Center.
From the bus stop and the train station just walk uphill the „Seffenter Weg“. The first building on the left hand side at the junction with „Kopernikusstraße“ is the IT Center.

The weather in Aachen is usually unpredictable. It is always a good idea to carry an umbrella. If you’ll bring one, it might be sunny!

 

Technical Information for Tuning Workshop Participants to get prepared for KNL

Workshop participants with an interest in tuning for KNL can prepare their jobs on the KNL systems:

KNL login node:

login-knl.hpc.itc.rwth-aachen.de

Job commands for submitting an LSF job to the 15 KNL batch nodes:

#BSUB -P hpclab
#BSUB -R knl
#BSUB -m c64m208k

Please contact us and send us your HPC account so we can add you to the project „hpclab“ for the workshop.

You can find some information on tuning for KNL here.

Contact

 

 

[:]

HPC Events : aiXcelerate 2018

07. Januar 2021 | von

[:de]

 

HPC Tuning Workshop

Mon, Dec 3  – Wed, Dec 5, 2018

IT Center

RWTH Aachen University

sponsored by:


Introduction

aiXcelerate HPC Tuning Workshop had the focus on the Intel Skylake (SKK) Microarchitecture, SIMD programming and performance tuning using Likwid, Intel VTune/Amplifier and Intel Advisor.

The nodes of CLAIX-2018, the latest RWTH Aachen University’s Supercomputer, were equipped with Intel Skylake processors and connected through a network with the Intel OmniPath architecture (OPA). It was a major extension of CLAIX-2016 system equipped with Intel  Broadwell processors and connected through OPA, too.

Researchers from FZ Jülich and RWTH with high demand for compute power can apply for resources on these systems as part of the JARA-HPC partition (>>> more… ).
Researchers from all over Germany can apply for resources on Claix (>>> more… ).

The workshop consisted of presentations open to a broader auditorium and of a hands-on tuning workshop for a limited number of selected computing projects.
Presentations was  given in English.

We awere proud to announce Michael Klemm and Christopher Dahnken, two HPC performance experts from Intel, who gave presentations and support your tuning efforts. Furthermore, a presentation on detecting performance limiting factors with hardware monitoring using likwid was given by Thomas Gruber (born Röhl) main developer of the Likwid tool. Experts of the HPC Team of the IT Center at RWTH Aachen University assisted in your tuning activietes as well.

Materials

  1. Agenda
  2. Sandra Wienke (RWTH): aiXcelerate Welcome & CLAIX-2018 Overview
  3. Marcus Wagner, Paul Kapinos (RWTH): SLURM and Modules for CLAIX-2018
  4. Jonas Hahnfeld (RWTH): Performance Monitoring on CLAIX
  5. Michael Klemm (Intel): Skylake Architecture
  6. Michael Klemm (Intel): Skylake Performance Considerations
  7. Michael Klemm (Intel): Intel VTune Amplifier
  8. Thomas Gruber (FAU Erlangen): LIKWID, Detecting Performance Limiting Factors with Hardware Monitoring

Registration

Closed

aiXcelerate 2018 – Presentations

  1. Dec 3, 11:00-13:00 – On Mon morning, Dec 3, we started at 11:00 with presentations on the new CLAIX-2018 cluster and the Skylake processor microarchitecture
  2. Dec 3, 14:00-15:00 – After lunch on Monday, we continued with presentations on SIMD programming and performance optimization.
  3. Dec 4, 11:00-13:00 – On Tuesday morning, Dec 4, we started at 11:00 with presentations on Likwid, Intel VTune/Amplifier and Advisor.

aiXcelerate 2018 – Tuning Workshop

After the presentations on Mon Dec 3, on Tue Dec4, and also throughout Wed, Dec 5, we provided you with hands-on opportunity to adapt, analyze and tune your performance critical applications on CLAIX, with its Skylake (and Broadwell) processors connected with the Intel OmniPath (OPA) fabric.

Participation in the presentations was a prerequisite for participation in tuning workshop.

Attendees were kindly requested to prepare and bring in their own code. It was assumed that you had a good working knowledge with MPI and/or OpenMP, and C/C++ or Fortran, whatever your compute project employs. To maximize the efficiency of the workshop, we asked the participants to prepare one or more test cases that reflected typical production runs, but did not take too long to execute – in the ideal case, a run didnot  take more than 5 to 10 minutes to finish. Members of the local HPC team supported every accepted project in porting the code to Claix. By the start of the workshop the code was asked be ready to be analyzed.

Furthermore, each participant (or participating group) was kindly asked to shortly present their project (1-2 slides) at the beginning of the tuning workshop and also shortly present the outcome of the tuning efforts at the end of the workshop.

Participants who were intested in the tuning activities contacted us by sending an email to hpcevent@itc.rwth-aachen.de.

Workshop participants were invited to take part in the social dinner on Tuesday, Dec 4 in the restaurant Palladion.

Costs

Attendance was free of charge and supported by our sponsors.
Travel and accommodation were at participants own expense.

Links to previous Events

Travel Information

If required, please make your own hotel reservation.You can find some popular hotels listed here. You may find a complete list of hotels on the web pages of the Aachen Tourist Service. We recommend that you try to book a room in the „Novotel Aachen City„, “ Mercure am Graben“ or „Aachen Best Western Regence“ hotels. These are adequate hotels with reasonable prices at a walking distance (20-30 minutes) to the IT Center through the old city of Aachen. An alternative is the hotel „IBIS Aachen Marschiertor“ which is close to the main station, which is convenient if you are travelling by train and also want to commute to the IT Center by train (3 trains per hour, 2 stops)

Please, download a sketch of the city (pdf, 415 KB) with some points of interest marked.
You may find a description of how to reach us by plane, train or car here.
Bus route 33 connects the city and the stop „Mies-van-der-Rohe-Straße“ every 15 minutes.
Trains between Aachen and Düsseldorf stop at „Aachen West“ station which is a 5 minutes walk away from the IT Center.
From the bus stop and the train station just walk uphill the „Seffenter Weg“. The first building on the left hand side at the junction with „Kopernikusstraße“ is the IT Center.

The weather in Aachen is usually unpredictable. It is always a good idea to carry an umbrella. If you’ll bring one, it might be sunny!

Technical Information for Tuning Workshop Participants

Workshop participants should have an account at the HPC Cluster in Aachen. Participants were asked not to bring in their own laptop. The IT Center provided a lend device; in this case please be equipped also with the ‚PC Pool‚ account (get it via Selfservice).

Please contact us and send us your HPC account (like ab123456) so we can add you to the project „hpclab“ for the workshop.

Contact

[:en]

                    

HPC Tuning Workshop

Mon, Dec 3  – Wed, Dec 5, 2018 

IT Center 

RWTH Aachen University 

sponsored by:


Introduction

aiXcelerate HPC Tuning Workshop had the focus on the Intel Skylake (SKK) Microarchitecture, SIMD programming and performance tuning using Likwid, Intel VTune/Amplifier and Intel Advisor.

The nodes of CLAIX-2018, the latest RWTH Aachen University’s Supercomputer, were equipped with Intel Skylake processors and connected through a network with the Intel OmniPath architecture (OPA). It was a major extension of CLAIX-2016 system equipped with Intel  Broadwell processors and connected through OPA, too.

Researchers from FZ Jülich and RWTH with high demand for compute power can apply for resources on these systems as part of the JARA-HPC partition (>>> more… ).
Researchers from all over Germany can apply for resources on Claix (>>> more… ).

The workshop consisted of presentations open to a broader auditorium and of a hands-on tuning workshop for a limited number of selected computing projects.
Presentations was  given in English.

We awere proud to announce Michael Klemm and Christopher Dahnken, two HPC performance experts from Intel, who gave presentations and support your tuning efforts. Furthermore, a presentation on detecting performance limiting factors with hardware monitoring using likwid was given by Thomas Gruber (born Röhl) main developer of the Likwid tool. Experts of the HPC Team of the IT Center at RWTH Aachen University assisted in your tuning activietes as well.

Materials

  1. Agenda
  2. Sandra Wienke (RWTH): aiXcelerate Welcome & CLAIX-2018 Overview
  3. Marcus Wagner, Paul Kapinos (RWTH): SLURM and Modules for CLAIX-2018
  4. Jonas Hahnfeld (RWTH): Performance Monitoring on CLAIX
  5. Michael Klemm (Intel): Skylake Architecture
  6. Michael Klemm (Intel): Skylake Performance Considerations
  7. Michael Klemm (Intel): Intel VTune Amplifier
  8. Thomas Gruber (FAU Erlangen): LIKWID, Detecting Performance Limiting Factors with Hardware Monitoring

Registration

Closed

aiXcelerate 2018 – Presentations

  1. Dec 3, 11:00-13:00 – On Mon morning, Dec 3, we started at 11:00 with presentations on the new CLAIX-2018 cluster and the Skylake processor microarchitecture
  2. Dec 3, 14:00-15:00 – After lunch on Monday, we continued with presentations on SIMD programming and performance optimization.
  3. Dec 4, 11:00-13:00 – On Tuesday morning, Dec 4, we started at 11:00 with presentations on Likwid, Intel VTune/Amplifier and Advisor.

aiXcelerate 2018 – Tuning Workshop

After the presentations on Mon Dec 3, on Tue Dec4, and also throughout Wed, Dec 5, we provided you with hands-on opportunity to adapt, analyze and tune your performance critical applications on CLAIX, with its Skylake (and Broadwell) processors connected with the Intel OmniPath (OPA) fabric.

Participation in the presentations was a prerequisite for participation in tuning workshop.

Attendees were kindly requested to prepare and bring in their own code. It was assumed that you had a good working knowledge with MPI and/or OpenMP, and C/C++ or Fortran, whatever your compute project employs. To maximize the efficiency of the workshop, we asked the participants to prepare one or more test cases that reflected typical production runs, but did not take too long to execute – in the ideal case, a run didnot  take more than 5 to 10 minutes to finish. Members of the local HPC team supported every accepted project in porting the code to Claix. By the start of the workshop the code was asked be ready to be analyzed.

Furthermore, each participant (or participating group) was kindly asked to shortly present their project (1-2 slides) at the beginning of the tuning workshop and also shortly present the outcome of the tuning efforts at the end of the workshop.

Participants who were intested in the tuning activities contacted us by sending an email to hpcevent@itc.rwth-aachen.de.

Workshop participants were invited to take part in the social dinner on Tuesday, Dec 4 in the restaurant Palladion.

 

Costs

Attendance was free of charge and supported by our sponsors.
Travel and accommodation were at participants own expense.

Links to previous Events 

Travel Information

If required, please make your own hotel reservation.You can find some popular hotels listed here. You may find a complete list of hotels on the web pages of the Aachen Tourist Service. We recommend that you try to book a room in the „Novotel Aachen City„, “ Mercure am Graben“ or „Aachen Best Western Regence“ hotels. These are adequate hotels with reasonable prices at a walking distance (20-30 minutes) to the IT Center through the old city of Aachen. An alternative is the hotel „IBIS Aachen Marschiertor“ which is close to the main station, which is convenient if you are travelling by train and also want to commute to the IT Center by train (3 trains per hour, 2 stops)

Please, download a sketch of the city (pdf, 415 KB) with some points of interest marked.
You may find a description of how to reach us by plane, train or car here.
Bus route 33 connects the city and the stop „Mies-van-der-Rohe-Straße“ every 15 minutes.
Trains between Aachen and Düsseldorf stop at „Aachen West“ station which is a 5 minutes walk away from the IT Center.
From the bus stop and the train station just walk uphill the „Seffenter Weg“. The first building on the left hand side at the junction with „Kopernikusstraße“ is the IT Center.

The weather in Aachen is usually unpredictable. It is always a good idea to carry an umbrella. If you’ll bring one, it might be sunny!

Technical Information for Tuning Workshop Participants

Workshop participants should have an account at the HPC Cluster in Aachen. Participants were asked not to bring in their own laptop. The IT Center provided a lend device; in this case please be equipped also with the ‚PC Pool‚ account (get it via Selfservice).

Please contact us and send us your HPC account (like ab123456) so we can add you to the project „hpclab“ for the workshop.

Contact

 

 

[:]

PPCES 2020 (Online)

02. Dezember 2020 | von

02Parallel Programming in Computational Engineering and Science 2020

 Since we had to cancel the PPCES workshop planned for March (March, 16 – 20,  2020 ) this year at short notice,we are offering this online workshop on December 2nd and 3rd, 2020

Online HPC Seminar and Workshop

December 2 – 3 , 2020

IT  Center RWTH Aachen University

           Please find information about the preceeding
 Introduction to HPC on March 03, 2020 >>>
About PPCES

This 2-days online event will continue the tradition of previous annual week-long events that take place in Aachen every spring since 2001. This time we will only cover the basics of parallel programming using OpenMP and MPI in Fortran and C/C++ and a frist step towards performance tuning. Hands-on exercises for each topic will be included.

The contents of the courses are generally applicable but will be specialized towards CLAIX the compute cluster which is the current system installed at the RWTH’s IT Center. It might be helpful to read through the information which was provided during the HPC introduction on March 3 this year. This is especially true if you want to actively use CLAIX after this event.

OpenMP  is a widely used approach for programming shared memory architectures, supported by most compilers nowadays. We will cover the basics of the programming paradigm as well as some advanced topics such as programming NUMA machines. The nodes of the RWTH Compute Cluster contain an increasing number of cores and thus we consider shared memory programming a vital alternative for applications that cannot be easily parallelized with MPI. We also expect a growing number of application codes to combine MPI and OpenMP for clusters of nodes with a growing number of cores.

The Message Passing Interface (MPI) is the de-facto standard for programming large HPC systems. We will introduce the basic concepts and give an overview of some advanced features. Also covered is hybrid parallelization, i.e., the combination of MPI and shared memory programming, which is gaining popularity as the number of cores per cluster node grows.

Prerequisites

Attendees should be comfortable with C/C++ or Fortran programming in a Linux environment and interested in learning more about the technical details of application tuning and parallelization.

All presentations will be given in English.

This event will be an online presentation.

Agenda

Day 1: OpenMP (Wednesday, December 2)

Start End Topic
09:30 09:50 Welcome + OpenMP Overview
09:50 10:15 OpenMP Worksharing
10:15 10:35 OpenMP Scoping
10:35 10:45 Exercise Setup
10:45 11:00 Break
11:00 11:30 Exercises
11:30 12:00 OpenMP & NUMA
12:00 14:00 Break
14:00 14:30 OpenMP Tasking: Basics
14:30 15:00 OpenMP Tasking: Scoping & Synchronization
15:00 15:15 Exercises
15:15 15:30 Break
15:30 16:00 Exercises
16:00 16:30 Solutions to Exercises + Q & A

Day 2: MPI (Thursday, December 3)

Start End Topic
09:30 09:40 Welcome & Overview (10 min)
09:40 10:00 MPI Basics (20 min)
10:00 10:35 Blocking Point-to-point communication (30 min)
10:35 11:45 Exercise Setup
10:45 11:00 Coffee Break
11:00 11:30 Exercises
11:30 12:00 Non-blocking (point-to-point) communication (30 min)
12:00 14:00 Lunch break
14:00 14:30 Collective Communication (30 min)
14:30 15:00 Exercises (30 min)
15:00 15:15 Coffee break
15:15 15:45 ARM Performance Reports (30 min)
15:45 16:00 Exercises (Perf Reports) (15 min)
16:00 16:30 Solutions to Exercises + Q & A (30 min)
16:30 16:45 Wrap-Up
Registration

Please register here until November 25, 2020  >>>  [REGISTRATION CLOSED]
Registered participants will receive further information by email.

Course Material
Day 1: OpenMP

Organization_PPCESonline.pdf

01_OpenMP_Introduction-Overview.pdf

02_OpenMP_Introduction-Worksharing.pdf

03_OpenMP_Introduction-Scoping.pdf

OpenMP_Exercises_2020_online.pdf

04_OpenMP_NUMA.pdf

05_OpenMP_Introduction-Tasking.pdf

06_OpenMP_TaskingAndScoping.pdf

07_OpenMP_TaskingAndSynchronization.pdf

Day 2: MPI

01_MPI-Overview.pdf

02_MPI_Concepts.pdf

03_MPI_Blocking_Point-to-Point_Communication.pdf

Exercises_MPI_2020_online.pdf

04_MPI_Non-blocking-Point-to-Point_Communication.pdf

05_MPI_Blocking_Collective_Communication.pdf

06_MPI_ARM_PerfReports.pdf

Q&A Seesion

 PPCES-2020_Q&A.pdf

Further Information

The OpenMP part is also available as online tutorial (including videos): https://hpc-wiki.info/hpc/OpenMP_in_Small_Bites

Contact

Tim Cramer / Dieter an Mey
Tel.: +49 (241) 80-24924 / 80-24377
E-mail: hpcevent@itc.rwth-aachen.de