Difference between revisions of "Runtime profiling"

Latest revision as of 10:27, 5 September 2019

Introduction

The initial task in a performance analysis is to find out where the runtime is spent. You want to focus on the regions of the code that consume significant runtime. Runtime profilers allow to measure the runtime distribution across the application code. Two profiler variants exist: Instrumentation based and sampling based. Instrumentation based profilers insert function calls to measure the time at points in the program. Additional tasks performed are e.g. determining the function call stack. While it is possible to insert instrumentation calls on the binary level usually the instrumentation calls are added during compilation. The standard tool is gprof and almost any compiler supports to instrument the code for gprof. Statistical sampling based profiling on the other hand is probing the program call stack at regular intervals. A widespread tool for sampling based profiling is the perf tool which builds on the builtin profiling infrastructure in recent Linux kernels. Both approaches have advantages and disadvantages: Instrumentation produces more accurate results but introduces more overhead and sampling has less overhead but produce less accurate results. Special care is necessary for runtime profiling of parallel (OpenMP or MPI) applications and C++ programs (due to name mangling resolving issues).

How to use gprof

The first step is to compile and link the program with profiling enabled. For most compilers this is done by setting the -pg flag. For Intel compilers it is important to set the optimisation flag afterward as the default optimisation level is set to -O0 when profiling is enabled.

Example for build options with Intel tool chain:

icc -pg -O3 -c  myfile1.c
icc -pg -O3 -c  myfile2.c
icc -o a.out -pg  myfile1.o myfile2.o

When generated application is executed it generates a gmon.out file containing the profiling output. To analyse the profile the tool gprof is used with the executable as argument:

gprof ./a.out

Refer to the gprof man page for additional command line options. The result will be printed on stdout. It is recommended to either redirect the output to a file or use a pipe to the less pager command.

gprof ./a.out | less

The default output consists of three parts: A flat profile, the call graph, and an alphabetical index of routines. For most purposes the flat profile is what you are looking for.

Example output of the flat profile for the Mantevo miniMD proxy app:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 66.86     26.14    26.14      502     0.05     0.05  ForceLJ::compute(Atom&, Neighbor&, Comm&, int)
 30.77     38.17    12.03       26     0.46     0.46  Neighbor::build(Atom&)
  1.43     38.73     0.56        1     0.56    38.46  Integrate::run(Atom&, Force*, Neighbor&, Comm&, Thermo&, Timer&)
  0.36     38.87     0.14     2850     0.00     0.00  Atom::pack_comm(int, int*, double*, int*)
  0.15     38.93     0.06     2850     0.00     0.00  Atom::unpack_comm(int, int, double*)
  0.13     38.98     0.05       26     0.00     0.00  Atom::pbc()
  0.10     39.02     0.04                             __intel_ssse3_rep_memcpy
  0.08     39.05     0.03       25     0.00     0.00  Atom::sort(Neighbor&)
  0.08     39.08     0.03        1     0.03     0.03  create_atoms(Atom&, int, int, int, double)
  0.05     39.10     0.02       26     0.00     0.00  Comm::borders(Atom&)
  0.00     39.10     0.00  1221559     0.00     0.00  Atom::pack_border(int, double*, int*)
  0.00     39.10     0.00  1221559     0.00     0.00  Atom::unpack_border(int, double*)
  0.00     39.10     0.00   131072     0.00     0.00  Atom::addatom(double, double, double, double, double, double)
  0.00     39.10     0.00     1025     0.00     0.00  Timer::stamp(int)
  0.00     39.10     0.00      502     0.00     0.00  Thermo::compute(int, Atom&, Neighbor&, Force*, Timer&, Comm&)
  0.00     39.10     0.00      500     0.00     0.00  Timer::stamp()
  0.00     39.10     0.00      475     0.00     0.00  Comm::communicate(Atom&)
  0.00     39.10     0.00       26     0.00     0.00  Comm::exchange(Atom&)
  0.00     39.10     0.00       25     0.00     0.00  Timer::stamp_extra_stop(int)
  0.00     39.10     0.00       25     0.00     0.00  Timer::stamp_extra_start()
  0.00     39.10     0.00       25     0.00     0.00  Neighbor::binatoms(Atom&, int)
  0.00     39.10     0.00        7     0.00     0.00  Timer::barrier_stop(int)
  0.00     39.10     0.00        1     0.00     0.00  create_box(Atom&, int, int, int, double)
  0.00     39.10     0.00        1     0.00     0.00  create_velocity(double, Atom&, Thermo&)

The output is sorted according to the total time spent in it. The interesting columns are self seconds (the time spent in the routine itself), calls (how often it was called) and self s/call (how much time was spent per call).

How to use perf

Runtime profiling with perf is very simple. You execute your executable wrapped with the perf call:

perf record ./miniMD

After the application finished the results can be analysed with

perf report

which opens a ncurses based presentation of the results:

Samples: 30K of event 'cycles:uppp', Event count (approx.): 20629160088                                                                                                                         
Overhead  Command          Shared Object         Symbol                                                                                                                                         
  64.19%  miniMD-ICC       miniMD-ICC            [.] ForceLJ::compute
  31.54%  miniMD-ICC       miniMD-ICC            [.] Neighbor::build
   1.47%  miniMD-ICC       miniMD-ICC            [.] Integrate::run
   0.67%  miniMD-ICC       [kernel]              [k] irq_return
   0.40%  miniMD-ICC       miniMD-ICC            [.] Atom::pack_comm
   0.35%  mpiexec          [kernel]              [k] sysret_check
   0.21%  miniMD-ICC       miniMD-ICC            [.] create_atoms
   0.18%  miniMD-ICC       miniMD-ICC            [.] Atom::unpack_comm
   0.15%  miniMD-ICC       [kernel]              [k] sysret_check
   0.15%  miniMD-ICC       miniMD-ICC            [.] Comm::borders
   0.10%  miniMD-ICC       miniMD-ICC            [.] __intel_ssse3_rep_memcpy
   0.09%  miniMD-ICC       miniMD-ICC            [.] Atom::sort
   0.07%  miniMD-ICC       miniMD-ICC            [.] Neighbor::binatoms
   0.05%  mpiexec          [kernel]              [k] irq_return
   0.04%  miniMD-ICC       miniMD-ICC            [.] Atom::pbc
   0.03%  miniMD-ICC       miniMD-ICC            [.] Atom::unpack_border
   0.03%  miniMD-ICC       miniMD-ICC            [.] Atom::addatom
   0.02%  miniMD-ICC       miniMD-ICC            [.] Atom::pack_border
   0.02%  hydra_pmi_proxy  [kernel]              [k] sysret_check
   0.01%  miniMD-ICC       miniMD-ICC            [.] create_velocity
   0.01%  mpiexec          libc-2.17.so          [.] vfprintf
   0.01%  miniMD-ICC       ld-2.17.so            [.] _dl_lookup_symbol_x
   0.01%  miniMD-ICC       ld-2.17.so            [.] do_lookup_x
   0.01%  hydra_bstrap_pr  [kernel]              [k] irq_return
   0.01%  hydra_pmi_proxy  [kernel]              [k] irq_return
   0.01%  hydra_bstrap_pr  [kernel]              [k] sysret_check
   0.01%  miniMD-ICC       libmpi.so.12.0.0      [.] MPIR_T_CVAR_REGISTER_impl
   0.01%  miniMD-ICC       libc-2.17.so          [.] getenv
   0.01%  miniMD-ICC       libmpi.so.12.0.0      [.] MPL_wtime

As can be seen the result is similar but not identical to gprof. Perf also reports on time spent in external entities as the kernel, mpiexec and system libraries.

Runtime profiling for parallel applications

gprof as well as perf are designed to be used for single process applications. While there are possibilities to use gprof in a MPI setting (see the LRZ documentation) a lightweight alternative to complex full trace tools is the command line version of Intel Amplifier. It is also using the builtin perf infrastructure in the Linux kernel.

Example usage with MPI:

mpirun -np 1 amplxe-cl -collect hotspots -result-dir myresults --  ../miniMD-ICC  --half_neigh 0

The analysis is printed on stdout:

Elapsed Time: 8.650s
    CPU Time: 8.190s
        Effective Time: 8.190s
            Idle: 0.020s
            Poor: 8.170s
            Ok: 0s
            Ideal: 0s
            Over: 0s
        Spin Time: 0s
        Overhead Time: 0s
    Total Thread Count: 2
    Paused Time: 0s

Top Hotspots
Function                     Module      CPU Time
---------------------------  ----------  --------
ForceLJ::compute_fullneigh   miniMD-ICC    4.940s
Neighbor::build              miniMD-ICC    2.820s
Integrate::finalIntegrate    miniMD-ICC    0.100s
Integrate::initialIntegrate  miniMD-ICC    0.060s
__intel_ssse3_rep_memcpy     miniMD-ICC    0.040s
[Others]                     N/A           0.230s
Effective Physical Core Utilization: 4.9% (0.976 out of 20)
 | The metric value is low, which may signal a poor physical CPU cores
 | utilization caused by:
 |     - load imbalance
 |     - threading runtime overhead
 |     - contended synchronization
 |     - thread/process underutilization
 |     - incorrect affinity that utilizes logical cores instead of physical
 |       cores
 | Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism
 | or run the Locks and Waits analysis to identify parallel bottlenecks for
 | other parallel runtimes.
 |  
    Effective Logical Core Utilization: 2.4% (0.976 out of 40)
     | The metric value is low, which may signal a poor logical CPU cores
     | utilization. Consider improving physical core utilization as the first
     | step and then look at opportunities to utilize logical cores, which in
     | some cases can improve processor throughput and overall performance of
     | multi-threaded applications.
     |
Collection and Platform Info
    Application Command Line: ../miniMD-ICC "--half_neigh" "0"
    Operating System: 3.10.0-957.10.1.el7.x86_64 NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/"  CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"
    Computer Name: e1125
    Result Size: 3 MB
    Collection start time: 13:11:20 02/04/2019 UTC
    Collection stop time: 13:11:29 02/04/2019 UTC
    Collector Type: Driverless Perf per-process counting,User-mode sampling and tracing
    CPU 
        Name: Intel(R) Xeon(R) E5/E7 v2 Processor code named Ivytown
        Frequency: 2.200 GHz
        Logical CPU Count: 40

This was also the only tool that resolved the inlined C++ routine. There are many more collect modules available including hardware performance monitoring metrics. Please refer to the Intel documentation for more detailed information.

Links and further information

@@ Line 1: / Line 1: @@
+[[Category:HPC-Developer]]
 == Introduction ==
-The initial task in any performance analysis is to figure out in which parts of the code the runtime is spent. One wants to focus for optimisation on those regions of the code to achieve an overall speedup of the code. The tool helping to get an overview of where the time is spent is called a runtime profiler. There exist two flavours: Instrumentation based and sampling based profilers. Instrumentation based profilers insert function calls to measure the time at points in the program. Additional tasks may be performed as e.g. determining the function call stack. While it is possible to insert instrumentation calls on the binary level the common way is that the compiler adds instrumentation functions. The standard tool in this area is [https://en.wikipedia.org/wiki/Gprof gprof] and almost any compiler supports to instrument the code for gprof. Statistical sampling based profiling on the other hand are based on probing of the programs call stack triggered by operating system interrupts at regular intervals. A widespread tool for sampling based profiling is the [https://en.wikipedia.org/wiki/Perf_(Linux) perf] tool which builds on the builtin profiling infrastructure in recent Linux kernels. Both approaches have advantages and disadvantages: Instrumentation produces more accurate results but introduces more overhead and sampling has less overhead but produce less accurate results. Special care is necessary for runtime profiling of parallel (OpenMP or MPI) applications.
+The initial task in a performance analysis is to find out where the runtime is spent. You want to focus on the regions of the code that consume significant runtime.  Runtime profilers allow to measure the runtime distribution across the application code. Two profiler variants exist: Instrumentation based and sampling based. Instrumentation based profilers insert function calls to measure the time at points in the program. Additional tasks performed are e.g. determining the function call stack. While it is possible to insert instrumentation calls on the binary level usually the instrumentation calls are added during compilation. The standard tool is [https://en.wikipedia.org/wiki/Gprof gprof] and almost any compiler supports to instrument the code for gprof. Statistical sampling based profiling on the other hand is probing the program call stack at regular intervals. A widespread tool for sampling based profiling is the [https://en.wikipedia.org/wiki/Perf_(Linux) perf] tool which builds on the builtin profiling infrastructure in recent Linux kernels. Both approaches have advantages and disadvantages: Instrumentation produces more accurate results but introduces more overhead and sampling has less overhead but produce less accurate results. Special care is necessary for runtime profiling of parallel (OpenMP or MPI) applications and C++ programs (due to name mangling resolving issues).
 == How to use gprof ==
-The first step is to compile and link the program with profiling enabled. For most compilers this is achieved by setting the ```-pg``` flag. For Intel compilers it is important to set the optimization flag afterward as the default optimization level is set to ```-O0``` when enabling profiling.
+The first step is to compile and link the program with profiling enabled. For most compilers this is done by setting the '''-pg''' flag. For Intel compilers it is important to set the optimisation flag afterward as the default optimisation level is set to '''-O0''' when profiling is enabled.
 Example for build options with Intel tool chain:
@@ Line 16: / Line 16: @@
 </syntaxhighlight>
-The generated application is executed and generates a ```gmon.out``` file containing the profiling output.
+When generated application is executed it generates a '''gmon.out''' file containing the profiling output.
-To analyse the profile the tool ```gprof``` is used with the executable as argument:
+To analyse the profile the tool '''gprof''' is used with the executable as argument:
 <syntaxhighlight lang="sh">
-gprof ./miniMD
+gprof ./a.out
 </syntaxhighlight>
@@ Line 26: / Line 26: @@
 <syntaxhighlight lang="sh">
-gprof ./miniMD | less
+gprof ./a.out | less
 </syntaxhighlight>
@@ Line 68: / Line 68: @@
 Runtime profiling with perf is very simple. You execute your executable wrapped with the perf call:
 <syntaxhighlight lang="sh">
@@ Line 75: / Line 74: @@
 After the application finished the results can be analysed with
 <syntaxhighlight lang="sh">
@@ Line 82: / Line 80: @@
 which opens a ncurses based presentation of the results:
 <syntaxhighlight lang="sh">
@@ Line 117: / Line 114: @@
 .01%  miniMD-ICC       libmpi.so.12.0.0      [.] MPL_wtime
 </syntaxhighlight>
+As can be seen the result is similar but not identical to gprof. Perf also reports on time spent in external entities as the kernel, mpiexec and system libraries.
+== Runtime profiling for parallel applications ==
+gprof as well as perf are designed to be used for single process applications. While there are possibilities to use gprof in a MPI setting (see the [https://www.lrz.de/services/compute/supermuc/tuning/gprof/ LRZ documentation]) a lightweight alternative to complex full trace tools is the command line version of Intel Amplifier. It is also using the builtin perf infrastructure in the Linux kernel.
+Example usage with MPI:
+<syntaxhighlight lang="sh">
+mpirun -np 1 amplxe-cl -collect hotspots -result-dir myresults --  ../miniMD-ICC  --half_neigh 0
+</syntaxhighlight>
+The analysis is printed on stdout:
+<syntaxhighlight lang="sh">
+Elapsed Time: 8.650s
+    CPU Time: 8.190s
+        Effective Time: 8.190s
+            Idle: 0.020s
+            Poor: 8.170s
+            Ok: 0s
+            Ideal: 0s
+            Over: 0s
+        Spin Time: 0s
+        Overhead Time: 0s
+    Total Thread Count: 2
+    Paused Time: 0s
+Top Hotspots
+Function                     Module      CPU Time
+---------------------------  ----------  --------
+ForceLJ::compute_fullneigh   miniMD-ICC    4.940s
+Neighbor::build              miniMD-ICC    2.820s
+Integrate::finalIntegrate    miniMD-ICC    0.100s
+Integrate::initialIntegrate  miniMD-ICC    0.060s
+__intel_ssse3_rep_memcpy     miniMD-ICC    0.040s
+[Others]                     N/A           0.230s
+Effective Physical Core Utilization: 4.9% (0.976 out of 20)
+ | The metric value is low, which may signal a poor physical CPU cores
+ | utilization caused by:
+ |     - load imbalance
+ |     - threading runtime overhead
+ |     - contended synchronization
+ |     - thread/process underutilization
+ |     - incorrect affinity that utilizes logical cores instead of physical
+ |       cores
+ | Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism
+ | or run the Locks and Waits analysis to identify parallel bottlenecks for
+ | other parallel runtimes.
+ |
+    Effective Logical Core Utilization: 2.4% (0.976 out of 40)
+     | The metric value is low, which may signal a poor logical CPU cores
+     | utilization. Consider improving physical core utilization as the first
+     | step and then look at opportunities to utilize logical cores, which in
+     | some cases can improve processor throughput and overall performance of
+     | multi-threaded applications.
+     |
+Collection and Platform Info
+    Application Command Line: ../miniMD-ICC "--half_neigh" "0"
+    Operating System: 3.10.0-957.10.1.el7.x86_64 NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/"  CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"
+    Computer Name: e1125
+    Result Size: 3 MB
+    Collection start time: 13:11:20 02/04/2019 UTC
+    Collection stop time: 13:11:29 02/04/2019 UTC
+    Collector Type: Driverless Perf per-process counting,User-mode sampling and tracing
+    CPU
+        Name: Intel(R) Xeon(R) E5/E7 v2 Processor code named Ivytown
+        Frequency: 2.200 GHz
+        Logical CPU Count: 40
+</syntaxhighlight>
+This was also the only tool that resolved the inlined C++ routine. There are many more collect modules available including hardware performance monitoring metrics. Please refer to the [https://software.intel.com/en-us/vtune-amplifier-help-command-line-interface Intel documentation] for more detailed information.
+== Links and further information ==
+* [https://linux.die.net/man/1/gprof gprof man page]
+* [https://perf.wiki.kernel.org/index.php/Main_Page Perf Wiki pages]
+* [https://software.intel.com/en-us/vtune-amplifier-help-command-line-interface Intel Amplifier Command line Tool Documentation]