Categories
Pages
-

IT Center Events
Loading Events

« All Events

aiXcelerate 2024

Monday, 9. December ► 9:00 - Wednesday, 11. December ► 17:30

Logo NHR4CES

in cooperation with

HPC.NRW Kompetenznetzwerk European Digital Innovation Hub (EDIH) Rheinland KI-Servicezentrum im Westen Deutschlands

 

Topic: Machine Learning on NVIDIA GPUs

Description

The annual event aiXcelerate at RWTH Aachen University (NHR4CES@RWTH) is a tuning workshop for HPC users. It comprises lectures that are open to everyone, and hands-on parts where (invited) groups apply the learned concepts to their own codes. This year, aiXcelerate covers the topic “Machine Learning (ML) on NVIDIA GPUs” and focuses on using the GPUs of RWTH’s HPC cluster “CLAIX” with frameworks such as PyTorch or Tensorflow. This workshop provides insights to performance analysis and performance tuning of ML codes (and is not an introduction to ML). aiXcelerate will take place with the support of NVIDIA from December 9th to 11th, 2024.

Presentations (open to everyone)

The talks are distributed across the morning sessions of the three days of aiXcelerate. The topic of the first day is “Analyzing Performance of ML Codes” and covers the usage of the automatically-running RWTH performance monitoring system, as well as the usage of NVIDIA’s Nsight tool and how to find bottlenecks with it. The second day focuses on “Scaling ML Codes across Multiple GPUs/ Nodes”. Here, approaches with PyTorch (Distributed) and Tensorflow + Horovod are presented to speedup ML codes by using more hardware in parallel. On the third day, we will look at “Handling Datasets of ML Codes”. We will present the different options for storing and using ML data during runtime (on CLAIX). Furthermore, the usage of check pointing in ML codes will be introduced to save the application’s state in certain intervals to provide fault tolerance.

Talks are offered in a hybrid form: The presenters will do their talks in-person on the premises of the IT Center at RWTH Aachen University. Thus, participants are very welcome to attend the presentations in-person, too. Additionally, presentations will be streamed live in online form. Hence, participants can also attend remotely. Participants need to choose their setup (on-premise or online) during the registration process.

Code & Tuning Activities (limited seats)

The sessions on “Code and Tuning Activities” (see agenda) target at users with their own ML codes (e.g., using PyTorch or Tensorflow) and follow the principle of bring-your-own (BYO) code. Participants of these sessions will closely work together with and get support by our HPC/ML experts to, e.g., analyze the performance of their codes, to tune the performance of their codes, to scale their codes across multiple GPUs, or to improve data handling of their codes.

Code and tuning activities are scheduled for all three days of the aiXcelerate workshop to have sufficient time to work on the BYO codes. They start after the presentations in the late morning session and continue throughout the remainder of the day. Since participants will work together with our experts, this part can only be attended in-person and on-premise.

This part focuses on ML users working on the RWTH HPC cluster CLAIX. We (will) invite ML users with suitable compute time projects on CLAIX. Nevertheless, other interested people are also welcome to register for this part of the workshop. However, since seats are limited, we reserve the right to only accept certain projects/ users for this workshop part.

Requirements

To make good use of the time with our HPC/ML experts, it is necessary that…

  •  … the brought code is already running on CLAIX (software & hardware environment).
  • … a (test) dataset is available that (a) runs on one or few GPUs for a short time (e.g., few minutes) and (b) still captures the performance profile of a regular dataset by triggering the same production parts of the application.
  • … all needed data is already transferred to one of the CLAIX’ file systems (HOME, WORK, HPCWORK).
  • … at least one developer of the targeted application registered. Several people using the same application is also possible (please register separately).

Support

The HPC/ML experts will work together with the participants (ML developers). Our experts are from the HPC team of the IT Center (NHR4CES@RWTH) and will be kindly supported by the HPC and ML expert Fabian Berressem from NVIDIA.

Organization

  • There is no workshop fee.
  • Presentations will be given in English. Slides will be available during or after the event (see “Course Material” below).
  • Presentations are presented in hybrid form (see above). Please choose your setup during the registration process.
  • Code activities are in-person on-premise only.
  • Presentations and code activities focus on using the GPUs of RWTH’s HPC cluster “CLAIX”.
  • Note: This is not an introduction to machine learning! We assume that you have already knowledge on ML and want to focus on performance analysis and tuning.
  • Needed and gained skills are mentioned below.

Date & Venue

Date: December 9th – December 11th, 2024
Venue: IT Center of RWTH Aachen University, Kopernikusstr. 6, 52074 Aachen, seminar rooms 3 + 4 (or online via Zoom/Webex)

Registration

Registration link (via our NHR4CES website): https://eveeno.com/aixcelerate24

During the registration, please specify whether you want to participate in-person or online. We need this information to organize sufficient seating and catering, and send the participation information to the corresponding participants. Please also withdraw your registration in the registration system (or let us know) if you cannot attend anymore. Thanks!

Registration closing date: November 25th, 2024

Agenda

Agenda is subject to slight changes.

Analyzing Performance of ML Codes

Day 1: Monday, December 9th, 2024

Time Topic Speaker
Organization
9:00 – 9:05 Welcome Christian Terboven (RWTH)
9:05 – 9:20 Application for Compute Time at RWTH Aachen University Tim Cramer (RWTH)
Performance Analysis of ML Codes
9:20 – 9:50 Verifying GPU Performance with the RWTH Job Monitoring Christian Wassermann (RWTH)
9:50 – 10:20 Coffee Break
10:20 – 11:20 Performance Analysis on GPUs with Nsight Systems Fabian Berressem (NVIDIA)
Code Activities (on-premise only)
11:20 – 12:30 Code Activities (BYO Code: Tuning with Experts)
12:30 – 13:30 Lunch Break
13:30 – 15:00 Code Activities (BYO Code: Tuning with Experts)
15:00 – 15:30 Coffee Break
15:30 – 17:30 Code Activities (BYO Code: Tuning with Experts)

Scaling ML Codes across Multiple GPUs/ Nodes

Day 2: Tuesday, December 10th, 2024

Time Topic Speaker
9:00 – 9:05 Welcome Christian Terboven (RWTH)
9:05 – 9:15 Multi-GPU Setup on CLAIX tba (RWTH)
9:15 – 10:00 Using ‘PyTorch Distributed’ Fabian Berressem (NVIDIA)
10:00 – 10:30 Using Tensorflow and Horovod Jannis Klinkenberg (RWTH)
10:30 – 11:00 Coffee Break
Code Activities (on-premise only)
11:00 – 12:30 Code Activities (BYO Code: Tuning with Experts)
12:30 – 13:30 Lunch Break
13:30 – 15:00 Code Activities (BYO Code: Tuning with Experts)
15:00 – 15:30 Coffee Break
15:30 – 17:30 Code Activities (BYO Code: Tuning with Experts)

Handling Datasets of ML Codes

Day 3: Wednesday, December 11th, 2024

Time Topic Speaker
9:00 – 9:05 Welcome Christian Terboven (RWTH)
9:05 – 10:00 Storage and I/O Options on CLAIX Dominik Viehhauser (RWTH), Jannis Klinkenberg (RWTH)
10:00 – 10:15 Checkpointing of ML Codes Dominik Viehhauser (RWTH)
10:15 – 10:45 Coffee Break
Code Activities (on-premise only)
10:45 – 12:30 Code Activities (BYO Code: Tuning with Experts)
12:30 – 13:30 Lunch Break
13:30 – 15:00 Code Activities (BYO Code: Tuning with Experts)
15:00 – 15:30 Coffee Break
15:30 – 17:30 Code Activities (BYO Code: Tuning with Experts)

Course Material

The course material is coming soon.

Skills

Course level: beginner to intermediate

Target audience:

  • ML users
  • ML developers

Prerequisites:

  • Basic knowledge of GPU hardware architectures
  • Knowledge of machine learning models (that are important for you)
  • Knowledge of how to use those models, e.g., with PyTorch or Tensorflow
  • Basic knowledge of parallelism

Gained skills:

  • Knowledge of automatic performance monitoring on the RWTH HPC cluster
  • Usage of the Nsight Systems tool to analyze performance of GPU codes
  • Usage of techniques to run ML codes on multiple GPUs or multiple nodes
  • Knowledge of various data storage options and their advantages and disadvantages for ML codes

Details

Start:
Monday, 9. December ► 9:00
End:
Wednesday, 11. December ► 17:30
Event Categories:
,