Difference between revisions of "Application benchmarking"

From HPC Wiki
Jump to navigation Jump to search
Line 1: Line 1:
 
== Overview ==
 
== Overview ==
  
Application benchmarking is an elementary skill for any performance engineering effort. Because it is the base for any other acitivity it is crucial to measure result in an accurate, deterministic and reproducible way. The following components are required for meaningful application benchmarking:
+
Application benchmarking is an elementary skill for any performance engineering effort. Because it is the base for any other acitivity, it is crucial to measure the result in an accurate, deterministic and reproducible way. The following components are required for meaningful application benchmarking:
  
* '''Timing''': How to accuratly measure time in software.
+
* '''Timing''': How to accurately measure time in software.
* '''Documentation''': Because there are many influences it is essential to document all possible performance relvant influences.
+
* '''Documentation''': Because there are many influences, it is essential to document all possible performance-relevant influences.
* '''System configuration''': Modern systems allow to adjust many performance relevant settings as clock speed, memory settings, cache organisation as well as OS settings.
+
* '''System configuration''': Modern systems allow adjusting many performance-relevant settings like clock speed, memory settings, cache organisation as well as OS settings.
* '''Resource allocation and affinity control''': What resources are used and how is worked mapped on resources.
+
* '''Resource allocation and affinity control''': What resources are used and how work is mapped onto resources.
  
Because so many things can go wrong while benchmarking it is imporatant to have a sceptical attitude against good results. Especially for very good results one has to check if the result is reasonable. Further results must be deterministic and reproducable, if required statistic distribution over multiple runs has to be documented.
+
Because so many things can go wrong while benchmarking, it is imporatant to have a sceptical attitude towards good results. Especially for very good results one has to check if the result is reasonable. Further results must be deterministic and reproducible, if required statistic distribution over multiple runs has to be documented.
  
 
Prerequisite for any benchmarking activity is to get a quite '''EXCLUSIVE SYSTEM'''!
 
Prerequisite for any benchmarking activity is to get a quite '''EXCLUSIVE SYSTEM'''!
Line 16: Line 16:
 
== Preparation ==
 
== Preparation ==
  
At the beginning it must be defined what configuration and/or test case is examined. Especially with larger codes with a wide range of functionality this is essential.
+
At the beginning it must be defined what configuration and/or test case is examined. Especially with larger codes with a wide range of functionality, this is essential.
 
Application benchmarking requires to run the code under observation many times with different settings or variants. A test case therefore should have a short runtime which is long enough to be measured reliably but does not run too long for a quick turnaround cycle. Ideally a benchmark runs from several seconds to a few minutes.
 
Application benchmarking requires to run the code under observation many times with different settings or variants. A test case therefore should have a short runtime which is long enough to be measured reliably but does not run too long for a quick turnaround cycle. Ideally a benchmark runs from several seconds to a few minutes.
  
For really large complex codes one can extract performance critical parts into a so called proxy app which is easier to handle and benchmark, but still resembles the behaviour of the real application code.
+
For really large complex codes, one can extract performance-critical parts into a so-called proxy app which is easier to handle and benchmark, but still resembles the behaviour of the real application code.
  
After deciding on a test case it is required to specify a performance metric. A performance metric is usually useful work per time unit and allows to compare the performance of different test cases or setups. It is also interpreted easier when larger is better. If it is difficult to define an application specific work unit one over time or MFlops/s might be a fallback solution. Examples for useful work are requests answered, lattice site updates, voxel updates, frames per second and so on.
+
After deciding on a test case, it is required to specify a performance metric. A performance metric is usually useful work per time unit and allows comparing the performance of different test cases or setups. If it is difficult to define an application-specific work unit one over time or MFlops/s might be a fallback solution. Examples for useful work are requests answered, lattice site updates, voxel updates, frames per second and so on.
  
 
== Timing ==
 
== Timing ==
  
For benchmarking an accurate so called wallclock timer (end to end stop watch) is required. Every timer has a minimal time resolution that can be measured. Therefore if the code region to be measured is running shorter the measurement must be extended until it reaches a time duration that can be resolved by the timer used. There are OS specific routine (POSIX and Windows) and programming model or programming language specific solution available. The latter have the advantage to be portable across operating systems. In any case one has to read the documentation of the implementation used to ensure the exact properties of the routine used.
+
For benchmarking, an accurate so-called wallclock timer (end-to-end stop watch) is required. Every timer has a minimal time resolution that can be measured. Therefore, if the code region to be measured is running shorter, the measurement must be extended until it reaches a time duration that can be resolved by the timer used. There are OS-specific routines (POSIX and Windows) and programming models or programming-language-specific solutions available. The latter have the advantage to be portable across operating systems. In any case, one has to read the documentation of the implementation used to ensure the exact properties of the routine used.
  
 
Recommended timing routines are
 
Recommended timing routines are
  
 
* <code>clock_gettime()</code>, POSIX compliant timing function ([https://linux.die.net/man/3/clock_gettime man page]) which is recommended as a replacement to the widespread <code>gettimeofday()</code>
 
* <code>clock_gettime()</code>, POSIX compliant timing function ([https://linux.die.net/man/3/clock_gettime man page]) which is recommended as a replacement to the widespread <code>gettimeofday()</code>
* <code>MPI_Wtime</code> and <code>omp_get_wtime</code>, standardized programming model specific timing routine for MPI and OpenMP
+
* <code>MPI_Wtime</code> and <code>omp_get_wtime</code>, standardized programming-model-specific timing routine for MPI and OpenMP
 
* Timing in instrumented Likwid regions based on cycle counters for very short measurements
 
* Timing in instrumented Likwid regions based on cycle counters for very short measurements
  
While there exist also programming language specific solutions (e.g. in C++ and Fortran) it is recommended to use the OS solution. In case of Fortran this requires to provide a wrapper function to the C call (see example below).
+
While there are also programming language specific solutions (e.g. in C++ and Fortran), it is recommended to use the OS solution. In case of Fortran this requires providing a wrapper function to the C call (see example below).
  
 
=== Examples ===
 
=== Examples ===
Line 89: Line 89:
 
==== Example code ====
 
==== Example code ====
  
This example code contains a ready to use timing routine together with C and F90 examples as well as a more advanced timer C module based on the RDTSC instruction.
+
This example code contains a ready-to-use timing routine with C and F90 examples as well as a more advanced timer C module based on the RDTSC instruction.
  
You can download an archive containing working timing routine with examples [https://github.com/RRZE-HPC/Code-teaching/releases/download/v1.0-demos/timing-demo-1.0.zip here].
+
You can download an archive containing working timing routines with examples [https://github.com/RRZE-HPC/Code-teaching/releases/download/v1.0-demos/timing-demo-1.0.zip here].
  
 
== Documentation ==
 
== Documentation ==
  
Without a proper documentation of code generation, system state and runtime modalities it can be difficult to reproduce performance results. Best practice is to automate the automatic logging of build settings, system state and runtime settings using automated benchmark scripts. Still too much automation might also result in errors or hinder a fast workflow due to inflexibilities in benchmarking or intransparency what actually happens. Therefore it is recommended to also execute steps by hand in addition to automated benchmark execution.
+
Without a proper documentation of code generation, system state and runtime modalities, it can be difficult to reproduce performance results. Best practice is to automate the automatic logging of build settings, system state and runtime settings using automated benchmark scripts. Still, too much automation might also result in errors or hinder a fast workflow due to inflexibilities in benchmarking or intransparency of what actually happens. Therefore it is recommended to also execute steps by hand in addition to automated benchmark execution.
  
 
== System configuration ==
 
== System configuration ==

Revision as of 16:37, 24 January 2019

Overview

Application benchmarking is an elementary skill for any performance engineering effort. Because it is the base for any other acitivity, it is crucial to measure the result in an accurate, deterministic and reproducible way. The following components are required for meaningful application benchmarking:

  • Timing: How to accurately measure time in software.
  • Documentation: Because there are many influences, it is essential to document all possible performance-relevant influences.
  • System configuration: Modern systems allow adjusting many performance-relevant settings like clock speed, memory settings, cache organisation as well as OS settings.
  • Resource allocation and affinity control: What resources are used and how work is mapped onto resources.

Because so many things can go wrong while benchmarking, it is imporatant to have a sceptical attitude towards good results. Especially for very good results one has to check if the result is reasonable. Further results must be deterministic and reproducible, if required statistic distribution over multiple runs has to be documented.

Prerequisite for any benchmarking activity is to get a quite EXCLUSIVE SYSTEM!

In the following all examples use the Likwid Performance Tools for tool support.

Preparation

At the beginning it must be defined what configuration and/or test case is examined. Especially with larger codes with a wide range of functionality, this is essential. Application benchmarking requires to run the code under observation many times with different settings or variants. A test case therefore should have a short runtime which is long enough to be measured reliably but does not run too long for a quick turnaround cycle. Ideally a benchmark runs from several seconds to a few minutes.

For really large complex codes, one can extract performance-critical parts into a so-called proxy app which is easier to handle and benchmark, but still resembles the behaviour of the real application code.

After deciding on a test case, it is required to specify a performance metric. A performance metric is usually useful work per time unit and allows comparing the performance of different test cases or setups. If it is difficult to define an application-specific work unit one over time or MFlops/s might be a fallback solution. Examples for useful work are requests answered, lattice site updates, voxel updates, frames per second and so on.

Timing

For benchmarking, an accurate so-called wallclock timer (end-to-end stop watch) is required. Every timer has a minimal time resolution that can be measured. Therefore, if the code region to be measured is running shorter, the measurement must be extended until it reaches a time duration that can be resolved by the timer used. There are OS-specific routines (POSIX and Windows) and programming models or programming-language-specific solutions available. The latter have the advantage to be portable across operating systems. In any case, one has to read the documentation of the implementation used to ensure the exact properties of the routine used.

Recommended timing routines are

  • clock_gettime(), POSIX compliant timing function (man page) which is recommended as a replacement to the widespread gettimeofday()
  • MPI_Wtime and omp_get_wtime, standardized programming-model-specific timing routine for MPI and OpenMP
  • Timing in instrumented Likwid regions based on cycle counters for very short measurements

While there are also programming language specific solutions (e.g. in C++ and Fortran), it is recommended to use the OS solution. In case of Fortran this requires providing a wrapper function to the C call (see example below).

Examples

Calling clock_gettime

Put the following code in a C module.

#include <time.h>

double mysecond()
{
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (double)ts.tv_sec + (double)ts.tv_nsec * 1.e-9;
}

You can use it in your code like that:

double S, E;

S = mysecond();
/* Your code to measure */
E = mysecond();

printf("Time: %f s\n",E-S);

Fortran example

In Fortran just add the following wrapper to above C module. You may have to adjust the name mangling to your Fortran compiler. Then you can just link with your Fortran application against the object file.

double mysecond_()
{
    return mysecond();
}

Use in your Fortran code as follows:

DOUBLE PRECISION s, e

 s = mysecond()
! Your code
 e = mysecond()

print *, "Time: ",e-s,"s"

Example code

This example code contains a ready-to-use timing routine with C and F90 examples as well as a more advanced timer C module based on the RDTSC instruction.

You can download an archive containing working timing routines with examples here.

Documentation

Without a proper documentation of code generation, system state and runtime modalities, it can be difficult to reproduce performance results. Best practice is to automate the automatic logging of build settings, system state and runtime settings using automated benchmark scripts. Still, too much automation might also result in errors or hinder a fast workflow due to inflexibilities in benchmarking or intransparency of what actually happens. Therefore it is recommended to also execute steps by hand in addition to automated benchmark execution.

System configuration

Affinity control

Best practices