Alongside theory and experiment, computer-aided simulations are described as the third pillar of science. It is therefore of great importance for technical universities such as RWTH Aachen University to continuously expand their capacities in the field of high-performance computing and to further develop their capabilities. We are therefore delighted that the full installation of the new Aix-la-Chapelle cluster (CLAIX) has been completed! Under the label CLAIX-2023, the new high-performance computer not only offers powerful Intel Xeon 8468 Sapphire Rapids CPUs with a total of 96 cores per computing node for a significant increase in performance. In addition, it has 52 special servers for artificial intelligence and machine learning applications, each with four NVIDIA H100 GPUs, which enables an impressive total performance of over 14 PFLOPS in the ML segment alone. (*)
Pilot Phase and Outstanding Features at a Glance
The modern system will be made available to researchers at RWTH and all German universities nationwide. Parallel to the acceptance tests by NEC and employees of the IT Center, the pilot phase with the first users began in January. The first scientists have already been able to gain experience on the new high-performance computer and contribute to stable and user-friendly system operation. (**)
The total of 632 directly water-cooled computing nodes for classic High Performance Computing (HPC) are not only characterized by a significant increase in performance, but are also state-of-the-art in terms of sustainability and energy efficiency. The two Intel Xeon 8468 Sapphire Rapids CPUs have a total of 96 cores in each computing node. Compared to the previous system, the performance of many applications with a similar configuration is increased by a factor of around two. The different memory expansion of the nodes (256, 512 or 1024 GB RAM) enables precise use with simultaneous cost optimization. The peak performance of this HPC segment is around 4 PFLOPs and up to 530 million core hours will be allocated each year.
Innovative Infrastructure for AI and ML
In order to take account of current developments in the field of artificial intelligence and machine learning in particular, 52 servers were also procured specifically for applications in these areas. In addition to the two CPUs, these computing nodes are each equipped with four very powerful and closely coupled NVIDIA H100 GPUs. Together with the 96 GB HBM2e memory per GPU, even very large ML-based models can be calculated, as the high-speed network in this segment is even more powerful. The total performance of the ML segment in relation to the GPUs is therefore over 14 PFLOPS.
To simulate large models, the highly scalable applications generally use a large number of computing nodes in parallel. To ensure that communication between the computing nodes does not become a bottleneck, the entire system was equipped with a very fast NDR Infiniband RDMA (Remote Data Memory Access) network. The system is rounded off with new login nodes and a special interactive partition that enables users to start interactive jobs via a JupyterHub without long waiting times. This modern access option makes it easier to enter the world of high-performance computing, especially for the many students and new employees at RWTH. A new high-performance parallel file system (Lustre) with a total capacity of 26 PiB will also be available for storing and processing research data.
Training Courses and Workshops
As the ageing predecessor system CLAIX-2018 has reached the end of its service life, the operation of this machine will only be continued for a short transition phase for economic and ecological reasons. In order to enable the many users to switch quickly and easily, the software stack has already been designed to support CLAIX-2018 and CLAIX-2023 at the same time. The changeover will be accompanied by numerous events. The “Introduction to High-Performance Computing 2024” from February 5 to 6 offered the first opportunity to get to know the system. In the “Porting and Tuning Workshop 2024″ from 26.02. to 01.03.24, the focus on the new computing architecture will be intensified once again and users will have the opportunity to work on the scripts and applications together with experts from the IT Center. In addition, the “PPCES 2024″ in March and the monthly “HPC Consultation Hour” offer plenty of opportunity for a productive exchange and thus round off the close support of the IT Center’s users.
HPC Landing Page
The HPC landing page provides an overview of current research projects, the services and events on offer and information on using the RWTH systems in order to create a central point of entry for all current and potential users and interested parties on the topic of RWTH high-performance computing. On the landing page you will find further information about CLAIX and its use and can take a closer look at the topic of high-performance computing at RWTH Aachen University.***
Tim Cramer and Christian Terboven are responsible for the content of this article.
Leave a Reply