ARMPerfReports

Arm Performance Reports (now part of ARM Forge) is a tool to characterize and understand the performance of both scalar and MPI applications. Results are provided in a single page HTML file. Those results can be used to identify performance affecting problems such as optimization and scalability issues as well as I/O or network bottlenecks. A huge advantage of the tool is its low overhead. It uses Arm MAP's adaptive sampling technology which results in an overhead of 5% even for large scale applications with thousands of MPI processes.

Supported Platforms

Supported hardware architectures are:

Intel and AMD (x86_64)
Armv8-A (AArch64)
Intel Xeon Phi (KNL)
IBM Power (ppc64 and ppc64le)

Moreover the following MPI implementations are supported:

Open MPI
MPICH
MVAPICH
Intel MPI
Cray MPT
SGI MPI
HPE MPI
IBM Platform MPI
Bullx MPI
Spectrum MPI

Also lots of different compilers are supported including:

GNU C/C++/Fortran
LLVM Clang
Intel Parallel Studio XE
PGI Compiler
Arm C/C++/Fortran Compiler
Cray Compiling Environment
NVIDIA CUDA Compiler
IBM XL C/C++/Fortran Compiler

On Intel and AMD (x86_64) architectures Nvidia CUDA applications are also supported. Detailed information about specific version numbers of the supported platforms can be found here

Generating a performance report

In order to generate a performance report just wrap the provided perf-report command around your normal (MPI) program startup like in the following example:

$ perf-report mpiexec <mpi-options> a.out

Arm Performance Reports will then generate and link the appropriate wrapper libraries before the program starts. At the end of the program run a performance report is created and saved to your current working directory in plain text as well as HTML format. Look out for files with name like <BinaryName>_NNp_Mn_Tt_YYYY-MM-DD_HH-MM.[txt|html] where

<BinaryName> is [part of the] name of the executable,
NNp = number of MPI ranks,
Mn = number of hosts/nodes,
Tt = number of threads ('1' for non-hybrid runs),
YYYY-MM-DD_HH-MM - date and time of the test run.

Examining a performance report

The basic structure of the performance report is always the same. So that different reports can easily be compared with each other. In the following the different sections of the performance report are explained.

Report summary

In the report summary the whole wallclock time spent by the program is broken down into three parts:

Compute - time spent running application/library code
MPI - time spent in MPI calls like MPI_Send, MPI_Reduce, MPI_Barrier
I/O - time spent in filesystem I/O like read, write, close

Each contribution is also rated from negligible to very high to give an advice which breakdown needs to be examined further. For each breakdown potential performance problems are identified and advices on optimization are given.

CPU breakdown

In the CPU breakdown the time spent in application and library code is further broken down into time spent to perform different kinds of instructions. The following metrics are used (only available on x86_64 architectures):

Single core code: Percentage of wall-cock time spent using only a single core per process. For multithreaded or OpenMP applications, a high value means that increasing the number of threads will not lead to a huge performance gain since the program's performance is bound by Amdahl's Law.
OpenMP code: Percentage of wall-clock time spent in OpenMP parallel regions (only shown if the program spent a measurable amount of time inside an OpenMP region).
Scalar numeric ops: Percentage of time spent executing scalar arithmetic instructions (e.g. add, mul, div).
Vector numeric ops: Percentage of time spent executing vectorized arithmetic instructions (e.g. Intel's SSE/AVX extensions). If possible most of the time should be spent here in order to fully exploit the capabilities of modern processors.
Memory accesses: Percentage of time spent in memory access operations (e.g. mov, load, store). High values indicate a memory bound application. Analyzing the memory access patterns of compute heavy loops and optimizing for them will increase the performance of the program significantly.
Waiting for accelerators: Percentage of time spent waiting for the accelerator.

CPU metrics breakdown

This breakdown contains information about key CPU performance measurements gathered using the Linux perf event subsystem. The following metrics are gathered (only available on Armv8 and IBM Power systems):

Cycles per instruction: Average amount of CPU cycles lapsed for each retired instruction.
Stalled cycles: Percentage of CPU cycles lapsed on operation instructions not issued.
L2 cache misses: Percentage of L2 data cache accesses that were a miss.
L3 cache miss per instruction: Ratio of L3 data cache misses to instructions completed.
FLOPS scalar lower bound: A lower bound for the rate at which floating-point scalar operations are performed.
FLOPS vector lower bound: A lower bound for the rate at which floating-point vector operations are performed.
Memory accesses: Rate at which the processor's data cache was reloaded from local, remote or distant memory.

OpenMP breakdown

In this section the time spent in OpenMP constructs is further broken down to identify performance problems related to OpenMP. If the code spent a measurable amount of time inside OpenMP regions the following contributions are shown:

Computation: Percentage of time threads spent for actual computation and not waiting or sleeping. This value should be as high as possible to ensure good scalability of the OpenMP code. If this is high already and there is still a performance issue then consult the CPU breakdown to find out if the CPU cores are mostly performing floating-point-operations or waiting for memory accesses.
Synchronization: Percentage of time threads in OpenMP regions spent waiting or sleeping. High values indicate load imbalances or a too fine-grained threading.
Physical core utilization: Values greater than 100 indicate that the number of OpenMP threads is larger than physical cores available. This may impact performance due to more time is spent inside OpenMP synchronization constructs.
System load: Ratio of active (running or runnable) threads to the number of physical CPU cores. Values above 100% indicate an oversubscription caused by using too many OpenMP threads or by other system processes taking away CPU time from your program. A value smaller than 100% may indicate that the program does not take full advantage of all available CPU resources.

MPI breakdown

If the program spends a significant amount of time in MPI calls the MPI breakdown shows in which kind of calls the time is spent. The rates in the following are measured from the process to the MPI API. In case a multithreaded program makes MPI calls from multiple threads (i.e. the MPI thread environment was initialized with MPI_THREAD_SERIALIZED or MPI_THREAD_MULTIPLE) only MPI calls made on the main thread are considered in the computation of the metrics below.

Time in collective calls: Percentage of time spent in collective MPI operations (e.g. MPI_Scatter, MPI_Reduce, MPI_Barrier).
Time in point-to-point calls: Percentage of time spent in point-to-point MPI operations (e.g. MPI_Send, MPI_Recv).
Effective process collective rate: Average per-process transfer rate during collective operations.
Effective process point-to-point rate: Average per-process transfer rate during point-to-point operations. Overlapping communication and computation using asynchronous calls (e.g. MPI_ISend) may achieve high effective transfer rates.

I/O breakdown

The amount of time spent in I/O related call is further broken down into the following parts:

Time in reads: Percentage of time spent on average in read operations.
Time in writes: Percentage of time spent on average in write and sync operations as well as opening and closing files.
Effective process read rate: Average transfer rate during read operations. Cached reads have much higher read rate than reads from a physical disk. Optimizing the program for cached reads can improve performance significantly.
Effective process write rate: Average transfer rate during write and sync operations.

Moreover, if a Lustre filesystem is mounted additional I/O metrics are gathered by a Lustre client process running on each node. This means these metrics will not only cover the I/O operations performed by the profiled application but also by other applications as well. However, for an I/O intensive HPC application that reads and writes a large amount of data the contributions of background processes are negligible and the Lustre data gives good estimate. The Lustre metrics include:

Lustre read transfer: Number of bytes read per second from Lustre.
Lustre write transfer: Number of bytes written per second to Lustre.
Lustre file opens: Number of file open operations per second on a Lustre filesystem.
Lustre metadata operations: Number of metadata operations per second on a Lustre filesystem

Lustre stores metadata separately from the usual data. These metadata is updated whenever new files are opened, closed or files are resized. Frequent metadata operations may slow down the performance of I/O to Lustre since they increase the latency when accessing data.

Memory breakdown

The memory breakdown summarizes memory usage across all processes and nodes over the entire duration.

Mean process memory usage: Average amount of memory used per-process across the entire length of the job.
Peak process memory usage: Peak memory usage seen by one process at any moment during the job. Significant differences between this and the mean process memory usage may indicate a [Load Balancing | load imbalance] or a memory leak on one process.
Peak node memory usage: Peak percentage of memory used on any single node during the entire run. Values close to 100% may indicate performance loss caused by swapping memory between main memory and disk memory. Low values indicate that it may be more efficient to run the job with a smaller amount of nodes but a larger workload per node.

Accelerator breakdown

If your programs runs on an accelerator using Nvidia CUDA this section summarizes how the accelerator was utilized in the program run. Therefor the following metrics are obtained:

GPU utilization: Average percentage of the GPUs that were being used per node
Global memory accesses: Average percentage of time spent reading or writing to global (device) memory.
Mean GPU memory usage: Average amount of memory used on the GPU cards.
Peak GPU memory usage: Maximum amount of memory used on the GPu cards.

Energy breakdown

The energy consumption of your program is shown in the energy breakdown. It is furthen broken down into the following metrics:

CPU: Percentage of the total energy used by the CPUs
Accelerator: Percentage of the total energy used by the accelerators.
System: Percentage of the total energy used by other components of the system which are not the CPU and the accelerators.
Mean node power: Average of mean power consumption of all nodes in Watts.
Peak node power: Highest power consumption measured measured on one of the nodes.

Limited Number of licenses

The number of license tokens is usually limited; you need one token per MPI rank. Some tokens could be in use by other users.

To analyze an application run with more processes than license tokens available, you can start so many processes as you want and let analyze only some of them. In this case you should start the Allinea-client on processes to be analysed with a shell script. A wrapper script 'start-with-allinea.sh' to analyze first 64 MPI ranks by using openmpi can look this way:

#/bin/bash
if test $OMPI_COMM_WORLD_RANK -lt 64 ;
then
  /rwthfs/rz/SW/ddt/forge-20.1.2-RHEL7/bin/allinea-client $@
else
  $@
fi

The corresponding environment variable for a process rank by Intel MPI is $PMI_RANK (instead of $OMPI_COMM_WORLD_RANK)

To use the Allinea-client the application should be linked with Allinea sampler:

-L/rwthfs/rz/SW/ddt/forge-20.1.2-RHEL7/lib/64 -lmap-sampler  -Wl, --eh-frame-hdr

To run the analysis of linked application with allinea-client:

$ export LD_LIBRARY_PATH=/rwthfs/rz/SW/ddt/forge-20.1.2-RHEL7/lib/64:$LD_LIBRARY_PATH

$ perf-report -manual & sleep 2;  $MPI_BINDIR/mpirun -np 96 bash ./start-with-allinea.sh ./a_allinea.out

With this command and the bash script before 96 MPI processes will be started and only 64 processes of them will be analyzed with ARM Performance Reports.

The same command you can use in your batch script.

Site-specific notes

RWTH

In order to use the Arm Performance Reports tool on the RWTH Cluster the corresponding module needs to be loaded. The tool is part of Arm Forge toolset in the DEVELOP module group, which needs to be loaded first using the following command:

$ module load DEVELOP

Then the installed versions of the tool can be shown by:

$ module avail forge

Finally, the Arm Forge can be loaded using the command:

$ module load forge/<version>

If you omit the <version> then the module system will load a default version of the Arm Forge.

Note about interactive measuremets of MPI applications: as of the MPIEXEC Wrapper used in the RWTH Cluster would add a wrapping level, please avoid it by

$ perf-report $MPI_BINDIR/mpiexec <mpi-options> a.out

Note about measurements in the (SLURM) batch system: Here the batch system specific way to start the MPI jobs is to be used, as the MPI vendor's 'mpiexec' does not necessarily understand the batch environment. So please start the MPI application the same way as usial and add 'perf-report' in front:

$ perf-report $MPIEXEC $FLAGS_MPI_BATCH a.out

Ontherwise your measurement over multiple nodes will not be started propperly.

References

Arm Performance Reports User Guide