ARMPerfReports
Arm Performance Reports is a tool to characterize and understand the performance of both scalar and MPI applications. Results are provided in a single page HTML file. Those results can be used to identify performance affecting problems such as optimization and scalability issues as well as I/O or network bottlenecks. A huge advantage of the tool is its low overhead. It uses Arm MAP's adaptive sampling technology which results in an overhead of 5% even for large scale applications with thousands of MPI processes.
Supported Platforms
Supported hardware architectures are:
- Intel and AMD (x86_64)
- Armv8-A (AArch64)
- Intel Xeon Phi (KNL)
- IBM Power (ppc64 and ppc64le)
Moreover the following MPI implementations are supported:
- Open MPI
- MPICH
- MVAPICH
- Intel MPI
- Cray MPT
- SGI MPI
- HPE MPI
- IBM Platform MPI
- Bullx MPI
- Spectrum MPI
Also lots of different compilers are supported including:
- GNU C/C++/Fortran
- LLVM Clang
- Intel Parallel Studio XE
- PGI Compiler
- Arm C/C++/Fortran Compiler
- Cray Compiling Environment
- NVIDIA CUDA Compiler
- IBM XL C/C++/Fortran Compiler
On Intel and AMD (x86_64) architectures Nvidia CUDA applications are also supported. Detailed information about specific version numbers of the supported platforms can be found here
Generating a performance report
In order to generate a performance report just wrap the provided perf-report
command around your normal (MPI) program startup like in the following example:
$ perf-report mpiexec <mpi-options> a.out
Arm Performance Reports will then generate and link the appropriate wrapper libraries before the program starts. At the end of the program run a performance report is created and saved to your current working directory in text as well as HTML format.
Examining a performance report
The basic structure of the performance report is always the same. So that different reports can easily be compared with each other. In the following the different sections of the performance report are explained.
Report summary
In the report summary the whole wallclock time spent by the program is divided into three parts:
- Compute - time spent running application code
- MPI - time spent in MPI calls
- I/O - time spent in filesystem I/O