High performance computing (HPC) is largely concerned with parallelizing programs, especially complex simulations such as models for weather forecasting or climate. To ensure that computations finish quickly, the work to be done is divided among different processes in parallel. Depending on the use case, the processes must exchange their intermediate results with other processes at program runtime.
The Message Passing Interface (MPI) is a standard method of exchanging messages between multiple processes or computers on which a program uses distributed memory. If MPI is used incorrectly, errors can occur, some of which are very difficult to detect. The Marmot Umpire Scalable Tool (MUST) is a runtime correctness analysis tool that automatically detects non-standard use of MPI. MUST is being developed by the HPC group at the IT Center in cooperation with Lawrence Livermore National Laboratory and TU Darmstadt. MUST is available as open source software.
Application of MUST
MUST is placed before the target application, intercepts its MPI calls, and thereby performs correctness analysis at program runtime.
A typical error when using MPI is a deadlock. A deadlock is in computer science a condition in which at least two processes waiting for resources that are allocated by the other process.
Example: Deadlock
There are four cars approaching an intersection and competing to be the first to proceed. The right-before-left rule applies. A deadlock has occurred because no car can proceed without further communication. This deadlock can only be resolved when a person decides which car is allowed to drive on first.
In MPI, this would mean waiting indefinitely for a message to resolve the situation when such a deadlock occurs. MUST detects the deadlock by pre-switching and reports it to the developer. The developer can then correct the error in the program so that this deadlock does not occur again. MUST detects such errors directly during execution, so that in the case of calculations that run over several days, the developer detects the error promptly.
Further Use Cases of MUST
A deadlock is only one of numerous potential errors in MPI programming. MUST also detects:
– Illegal or incorrect arguments
– Resource usage errors and leaks
– Data type mismatches
– Overlaps in communication buffers
– Data races involving MPI function calls
Currently, MUST provides a comprehensive set of correctness checks where checks are needed for MPI programs with many parallel processes.
Source code and documentation of MUST are available on our website and more information on MUST can be found on the HPC wiki.
Responsible for the content of this article are Simon Schwitanski and Janin Vreydal.