- This event has passed.
aiXcelerate 2024
Monday, 9. December ► 9:00 - Wednesday, 11. December ► 17:30
in cooperation with
Topic: Machine Learning on NVIDIA GPUs
Description
The annual event aiXcelerate at RWTH Aachen University (NHR4CES@RWTH) is a tuning workshop for HPC users. It comprises lectures that are open to everyone, and hands-on parts where (invited) groups apply the learned concepts to their own codes. This year, aiXcelerate covers the topic “Machine Learning (ML) on NVIDIA GPUs” and focuses on using the GPUs of RWTH’s HPC cluster “CLAIX” with frameworks such as PyTorch or Tensorflow. This workshop provides insights to performance analysis and performance tuning of ML codes (and is not an introduction to ML). aiXcelerate will take place with the support of NVIDIA from December 9th to 11th, 2024. Catering is sponsored by NEC and NVIDIA.
Presentations (open to everyone)
The talks are distributed across the morning sessions of the three days of aiXcelerate. The topic of the first day is “Analyzing Performance of ML Codes” and covers the usage of the automatically-running RWTH performance monitoring system, as well as the usage of NVIDIA’s Nsight tool and how to find bottlenecks with it. The second day focuses on “Scaling ML Codes across Multiple GPUs/ Nodes”. Here, approaches with PyTorch (Distributed) and Tensorflow + Horovod are presented to speedup ML codes by using more hardware in parallel. On the third day, we will look at “Handling Datasets of ML Codes”. We will present the different options for storing and using ML data during runtime (on CLAIX). Furthermore, the usage of check pointing in ML codes will be introduced to save the application’s state in certain intervals to provide fault tolerance.
Talks are offered in a hybrid form: The presenters will do their talks in-person on the premises of the IT Center at RWTH Aachen University. Thus, participants are very welcome to attend the presentations in-person, too. Additionally, presentations will be streamed live in online form. Hence, participants can also attend remotely. Participants need to choose their setup (on-premise or online) during the registration process.
Code & Tuning Activities (limited seats)
The sessions on “Code and Tuning Activities” (see agenda) target at users with their own ML codes (e.g., using PyTorch or Tensorflow) and follow the principle of bring-your-own (BYO) code. Participants of these sessions will closely work together with and get support by our HPC/ML experts to, e.g., analyze the performance of their codes, to tune the performance of their codes, to scale their codes across multiple GPUs, or to improve data handling of their codes.
Code and tuning activities are scheduled for all three days of the aiXcelerate workshop to have sufficient time to work on the BYO codes. They start after the presentations in the late morning session and continue throughout the remainder of the day. Since participants will work together with our experts, this part can only be attended in-person and on-premise.
This part focuses on ML users working on the RWTH HPC cluster CLAIX. We (will) invite ML users with suitable compute time projects on CLAIX. Nevertheless, other interested people are also welcome to register for this part of the workshop. However, since seats are limited, we reserve the right to only accept certain projects/ users for this workshop part.
Requirements
To make good use of the time with our HPC/ML experts, it is necessary that…
- … the brought code is already running on CLAIX (software & hardware environment).
- … a (test) dataset is available that (a) runs on one or few GPUs for a short time (e.g., few minutes) and (b) still captures the performance profile of a regular dataset by triggering the same production parts of the application.
- … all needed data is already transferred to one of the CLAIX’ file systems (HOME, WORK, HPCWORK).
- … at least one developer of the targeted application registered. Several people using the same application is also possible (please register separately).
Support
The HPC/ML experts will work together with the participants (ML developers). Our experts are from the HPC team of the IT Center (NHR4CES@RWTH) and will be kindly supported by the HPC and ML expert Fabian Berressem from NVIDIA.
Organization
- There is no workshop fee.
- Presentations will be given in English. Slides will be available during or after the event (see “Course Material” below).
- Presentations are presented in hybrid form (see above). Please choose your setup during the registration process.
- Code activities are in-person on-premise only.
- Presentations and code activities focus on using the GPUs of RWTH’s HPC cluster “CLAIX”.
- Note: This is not an introduction to machine learning! We assume that you have already knowledge on ML and want to focus on performance analysis and tuning.
- Needed and gained skills are mentioned below.
Date & Venue
Date: December 9th – December 11th, 2024
Venue: IT Center of RWTH Aachen University, Kopernikusstr. 6, 52074 Aachen, seminar rooms 3 + 4 (or online via Zoom/Webex)
Registration
Registration link (via our NHR4CES website): https://eveeno.com/aixcelerate24
During the registration, please specify whether you want to participate in-person or online. We need this information to organize sufficient seating and catering, and send the participation information to the corresponding participants. Please also withdraw your registration in the registration system using the link that you got in your confirmation e-mail (or let us know) if you cannot attend anymore. Thanks!
Registration closing date: November 25th, 2024 Registration is closed
Agenda
Agenda is subject to slight changes.
Analyzing Performance of ML Codes
Day 1: Monday, December 9th, 2024
Time | Topic | Speaker |
---|---|---|
Organization | ||
9:00 – 9:05 | Welcome | Sandra Wienke (RWTH) |
9:05 – 9:20 | Application for Compute Time at RWTH Aachen University | Tim Cramer (RWTH) |
Performance Analysis of ML Codes | ||
9:20 – 9:50 | Verifying GPU Performance with the RWTH Job Monitoring | Christian Wassermann (RWTH) |
9:50 – 10:20 | Coffee Break | |
10:20 – 11:20 | Performance Analysis on GPUs with Nsight Systems | Fabian Berressem (NVIDIA) |
Code Activities (on-premise only) | ||
11:20 – 12:30 | Code Activities (BYO Code: Tuning with Experts) | |
12:30 – 13:30 | Lunch Break | |
13:30 – 15:00 | Code Activities (BYO Code: Tuning with Experts) | |
15:00 – 15:30 | Coffee Break | |
15:30 – 17:30 | Code Activities (BYO Code: Tuning with Experts) |
Scaling ML Codes across Multiple GPUs/ Nodes
Day 2: Tuesday, December 10th, 2024
Time | Topic | Speaker |
---|---|---|
9:00 – 9:05 | Welcome | Sandra Wienke (RWTH) |
9:05 – 9:15 | Multi-GPU Setup on CLAIX (see examples below) |
Jannis Klinkenberg (RWTH) |
9:15 – 10:00 | Using ‘PyTorch Distributed’ (see examples below) |
Fabian Berressem (NVIDIA) |
10:00 – 10:30 | Using Tensorflow and Horovod (see examples below) |
Jannis Klinkenberg (RWTH) |
10:30 – 11:00 | Coffee Break | |
Code Activities (on-premise only) | ||
11:00 – 12:30 | Code Activities (BYO Code: Tuning with Experts) | |
12:30 – 13:30 | Lunch Break | |
13:30 – 15:00 | Code Activities (BYO Code: Tuning with Experts) | |
15:00 – 15:30 | Coffee Break | |
15:30 – 17:30 | Code Activities (BYO Code: Tuning with Experts) |
Handling Datasets of ML Codes
Day 3: Wednesday, December 11th, 2024
Time | Topic | Speaker |
---|---|---|
9:00 – 9:05 | Welcome | Jannis Klinkenberg (RWTH) |
9:05 – 10:00 | Storage and I/O Options on CLAIX (see examples below) |
Dominik Viehhauser (RWTH), Jannis Klinkenberg (RWTH) |
10:00 – 10:15 | Checkpointing of ML Codes (see examples below) |
Dominik Viehhauser (RWTH) |
10:15 – 10:45 | Coffee Break | |
Code Activities (on-premise only) | ||
10:45 – 12:30 | Code Activities (BYO Code: Tuning with Experts) | |
12:30 – 13:30 | Lunch Break | |
13:30 – 15:00 | Code Activities (BYO Code: Tuning with Experts) | |
15:00 – 15:30 | Coffee Break | |
15:30 – 17:30 | Code Activities (BYO Code: Tuning with Experts) |
Course Material
- Examples for all three days (zip archive)
- Please find the slides linked in the agenda.
Skills
Course level: beginner to intermediate
Target audience:
- ML users
- ML developers
Prerequisites:
- Basic knowledge of GPU hardware architectures
- Knowledge of machine learning models (that are important for you)
- Knowledge of how to use those models, e.g., with PyTorch or Tensorflow
- Basic knowledge of parallelism
Gained skills:
- Knowledge of automatic performance monitoring on the RWTH HPC cluster
- Usage of the Nsight Systems tool to analyze performance of GPU codes
- Usage of techniques to run ML codes on multiple GPUs or multiple nodes
- Knowledge of various data storage options and their advantages and disadvantages for ML codes
Sponsoring
Catering is sponsored by