Schlagwort: ‘Slurm’
Runtime limits for GPU Jobs
On the 06.03.2026 we changed how many GPU nodes from the c23g partition are allowed to run user GPU Jobs with runtimes longer than 24 hours.
The change has the following main goals:
- Increase the throughput and reduce the waiting times for GPU jobs with runtimes lower than 24 hours.
- Encourage users to submit shorter jobs and make use of resilience methods like checkpointing if necessary.
- Reduce the maintenance downtime of GPU nodes.
The change effectively limits the amount of long running GPU jobs to only half the GPU nodes in the c23g partition. Long running GPU jobs are user GPU jobs with runtimes longer than 24 hours. GPU jobs with less than 24 hours of runtime are considered short and will be scheduled to all available GPU nodes of the c23g partition.
It is understood that the waiting times of long running GPU jobs will increase and we therefore encourage users to change their workflows to accommodate shorter running GPU jobs.
This change is necessary to improve QoS for users of the c23g partition and to allow for faster maintenance works on the GPU nodes.
Slurm Update for Claix
We updated Slurm to a newer and more stable Version: 25.05.5
This upgrade fixed issues we had with our scheduling system and internal rights-management database.
We improved how Slurm calculates priority for pending Jobs based on user feedback and internal metrics. Here is a short summary with basic details.
In short:
– Job waiting times will be more predictable and intuitive.
– Longer waiting times will increase the priority of a pending jobs.
– Jobs will still be able to access resources quickly if recent resource usage quotas are low.
Details:
– A pending job will not be delayed by any other new jobs (see its expected start time increase into the future) after 24 hours of waiting.
– Note that other software or hardware malfunctions might still cause delays in jobs, but new jobs will no longer cause this after 24 hours.
– A pending job with low fair-share factor might still be delayed by new jobs with higher fair-share factors during the first 24 hours of waiting.
– The project used for a job (default or otherwise) will determine its fair-share priority factor based on recent resource usage.
– Projects that have already used their „fair-share“ of resources, will have a lower fair-share priority factor than projects with lower recent resource usage.
– Priorities and fair-share priority factors only matter for comparing jobs waiting for the same resources (e.g: partitions).
Slurm GPU HPC resource allocation changing on the 01.11.2025
The CLAIX HPC systems will be changing the way GPU resources are requested and allocated starting on the 01.11.2025.
Users submitting Slurm Jobs will no longer be able to request arbitrary amounts of CPU and Memory resources when using GPUs on GPU nodes.
Requesting an entire GPU node’s memory or all CPUs, but only a single GPU will no longer be possible.
Each GPU within a GPU node will have a corresponding strict maximum of CPUs and Memory that can be requested.
To obtain more than the strict maximum of CPUs or Memory per GPU, more GPUs will need to be requested too.
The specific limits per GPU on GPU nodes will be eventually documented separately.
Users are expected to modify their submission scripts or methods accordingly.
This change is driven by our efforts to update the HPC resource billing mechanism to comply with NHR HPC directives.
NHR requires that computing projects apply for CPU and GPU resources independently.
NHR also requires that HPC Centers track the use of these CPU and GPU resources.
The independent resources are then accounted for by Slurm jobs within our CLAIX nodes.
Therefore CPU nodes will only track CPUs (and equivalent memory) and GPU nodes will only track GPUs used.
The quota tools will eventually reflect this too.

