Difference between revisions of "ARMPerfReports"

Revision as of 13:13, 22 August 2019

Arm Performance Reports is a tool to characterize and understand the performance of both scalar and MPI applications. Results are provided in a single page HTML file. Those results can be used to identify performance affecting problems such as optimization and scalability issues as well as I/O or network bottlenecks. A huge advantage of the tool is its low overhead. It uses Arm MAP's adaptive sampling technology which results in an overhead of 5% even for large scale applications with thousands of MPI processes.

Supported Platforms

Supported hardware architectures are:

Intel and AMD (x86_64)
Armv8-A (AArch64)
Intel Xeon Phi (KNL)
IBM Power (ppc64 and ppc64le)

Moreover the following MPI implementations are supported:

Open MPI
MPICH
MVAPICH
Intel MPI
Cray MPT
SGI MPI
HPE MPI
IBM Platform MPI
Bullx MPI
Spectrum MPI

Also lots of different compilers are supported including:

GNU C/C++/Fortran
LLVM Clang
Intel Parallel Studio XE
PGI Compiler
Arm C/C++/Fortran Compiler
Cray Compiling Environment
NVIDIA CUDA Compiler
IBM XL C/C++/Fortran Compiler

On Intel and AMD (x86_64) architectures Nvidia CUDA applications are also supported. Detailed information about specific version numbers of the supported platforms can be found here

Generating a performance report

In order to generate a performance report just wrap the provided perf-report command around your normal (MPI) program startup like in the following example:

$ perf-report mpiexec <mpi-options> a.out

Arm Performance Reports will then generate and link the appropriate wrapper libraries before the program starts. At the end of the program run a performance report is created and saved to your current working directory in text as well as HTML format.

Examining a performance report

The basic structure of the performance report is always the same. So that different reports can easily be compared with each other. In the following the different sections of the performance report are explained.

Report summary

In the report summary the whole wallclock time spent by the program is broken down into three parts:

Compute - time spent running application/library code
MPI - time spent in MPI calls like MPI_Send, MPI_Reduce, MPI_Barrier
I/O - time spent in filesystem I/O like read, write, close

Each contribution is also rated from negligible,very low to very high to give an advice which breakdown needs to be examined further.

CPU breakdown

In the CPU breakdown the time spent in application and library code is further broken down into time spent to perform different kinds of instructions. These are:

single core code: Percentage of wall-cock time spent using only a single core per process. For multithreaded or OpenMp applications, a high value means that increasing the number of threads will not lead to a huge performance gain since the program's performance is bound by Amdahl's Law.

@@ Line 48: / Line 48: @@
 === Report summary ===
-In the report summary the whole wallclock time spent by the program is divided into three parts:
+In the report summary the whole wallclock time spent by the program is broken down into three parts:
-* Compute - time spent running application code
+* Compute - time spent running application/library code
-* MPI - time spent in MPI calls
+* MPI - time spent in MPI calls like <code>MPI_Send</code>, <code>MPI_Reduce</code>, <code>MPI_Barrier</code>
-* I/O - time spent in filesystem I/O
+* I/O - time spent in filesystem I/O like <code>read</code>, <code>write</code>, <code>close</code>
+Each contribution is also rated from '''negligible''','''very low''' to '''very high''' to give an advice which breakdown needs to be examined further.
+=== CPU breakdown ===
+In the CPU breakdown the time spent in application and library code is further broken down into time spent to perform different kinds of instructions. These are:
+* single core code: Percentage of wall-cock time spent using only a single core per process. For multithreaded or OpenMp applications, a high value means that increasing the number of threads will not lead to a huge performance gain since the program's performance is bound by [[Amdahl's Law]].