Application benchmarking

From HPC Wiki
Jump to navigation Jump to search

Overview

Application benchmarking is an elementary skill for any performance engineering effort. Because it is the base for any other acitivity, it is crucial to measure the result in an accurate, deterministic and reproducible way. The following components are required for meaningful application benchmarking:

  • Timing: How to accurately measure time in software.
  • Documentation: Because there are many influences, it is essential to document all possible performance-relevant influences.
  • System configuration: Modern systems allow adjusting many performance-relevant settings like clock speed, memory settings, cache organisation as well as OS settings.
  • Resource allocation and affinity control: What resources are used and how work is mapped onto resources.

Because so many things can go wrong while benchmarking, it is imporatant to have a sceptical attitude towards good results. Especially for very good results one has to check if the result is reasonable. Further results must be deterministic and reproducible, if required statistic distribution over multiple runs has to be documented.

Prerequisite for any benchmarking activity is to get a quite EXCLUSIVE SYSTEM!

In the following all examples use the Likwid Performance Tools for tool support.

Preparation

At the beginning it must be defined what configuration and/or test case is examined. Especially with larger codes with a wide range of functionality, this is essential. Application benchmarking requires to run the code under observation many times with different settings or variants. A test case therefore should have a short runtime which is long enough to be measured reliably but does not run too long for a quick turnaround cycle. Ideally a benchmark runs from several seconds to a few minutes.

For really large complex codes, one can extract performance-critical parts into a so-called proxy app which is easier to handle and benchmark, but still resembles the behaviour of the real application code.

After deciding on a test case, it is required to specify a performance metric. A performance metric is usually useful work per time unit and allows comparing the performance of different test cases or setups. If it is difficult to define an application-specific work unit one over time or MFlops/s might be a fallback solution. Examples for useful work are requests answered, lattice site updates, voxel updates, frames per second and so on.

Timing

For benchmarking, an accurate so-called wallclock timer (end-to-end stop watch) is required. Every timer has a minimal time resolution that can be measured. Therefore, if the code region to be measured is running shorter, the measurement must be extended until it reaches a time duration that can be resolved by the timer used. There are OS-specific routines (POSIX and Windows) and programming models or programming-language-specific solutions available. The latter have the advantage to be portable across operating systems. In any case, one has to read the documentation of the implementation used to ensure the exact properties of the routine used.

Recommended timing routines are

  • clock_gettime(), POSIX compliant timing function (man page) which is recommended as a replacement to the widespread gettimeofday()
  • MPI_Wtime and omp_get_wtime, standardized programming-model-specific timing routine for MPI and OpenMP
  • Timing in instrumented Likwid regions based on cycle counters for very short measurements

While there are also programming language specific solutions (e.g. in C++ and Fortran), it is recommended to use the OS solution. In case of Fortran this requires providing a wrapper function to the C call (see example below).

Examples

Calling clock_gettime

Put the following code in a C module.

#include <time.h>

double mysecond()
{
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (double)ts.tv_sec + (double)ts.tv_nsec * 1.e-9;
}

You can use it in your code like that:

double S, E;

S = mysecond();
/* Your code to measure */
E = mysecond();

printf("Time: %f s\n",E-S);

Fortran example

In Fortran just add the following wrapper to above C module. You may have to adjust the name mangling to your Fortran compiler. Then you can just link with your Fortran application against the object file.

double mysecond_()
{
    return mysecond();
}

Use in your Fortran code as follows:

DOUBLE PRECISION s, e

 s = mysecond()
! Your code
 e = mysecond()

print *, "Time: ",e-s,"s"

Example code

This example code contains a ready-to-use timing routine with C and F90 examples as well as a more advanced timer C module based on the RDTSC instruction.

You can download an archive containing working timing routines with examples here.

Documentation

Without a proper documentation of code generation, system state and runtime modalities, it can be difficult to reproduce performance results. Best practice is to automate the automatic logging of build settings, system state and runtime settings using automated benchmark scripts. Still, too much automation might also result in errors or hinder a fast workflow due to inflexibilities in benchmarking or intransparency of what actually happens. Therefore it is recommended to also execute steps by hand in addition to automated benchmark execution.

Node topology

Knowledge about node topology and properties is essential to plan benchmarking and interpret results. Important questions to ask are (Also see the extended list below):

  • What is the topology and size of all memory hierarchy levels.
  • Which caches levels are private to cores and which are shared?
  • How many and which processors share memory hierarchy levels?
  • What is the NUMA topology? This means how many and which processors share a memory interface and how many memory interfaces are there?
  • Is SMT available? How many and which processors share a core?
  • On Intel processors: Is cluster on die mode enabled? With cluster on die the memory interfaces within one socket are cut in two parts.

The Likwid tools provide a single tool likwid-topology reporting all required topology and memory hierarchy info from a single source.

Example usage:

$ likwid-topology -g
--------------------------------------------------------------------------------
CPU name:	Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz
CPU type:	Intel Xeon Haswell EN/EP/EX processor
CPU stepping:	2
********************************************************************************
Hardware Thread Topology
********************************************************************************
Sockets:		2
Cores per socket:	14
Threads per core:	2
--------------------------------------------------------------------------------
HWThread	Thread		Core		Socket		Available
0		0		0		0		*
1		0		1		0		*
2		0		2		0		*
shortened
53		1		25		1		*
54		1		26		1		*
55		1		27		1		*
--------------------------------------------------------------------------------
Socket 0:		( 0 28 1 29 2 30 3 31 4 32 5 33 6 34 7 35 8 36 9 37 10 38 11 39 12 40 13 41 )
Socket 1:		( 14 42 15 43 16 44 17 45 18 46 19 47 20 48 21 49 22 50 23 51 24 52 25 53 26 54 27 55 )
--------------------------------------------------------------------------------
********************************************************************************
Cache Topology
********************************************************************************
Level:			1
Size:			32 kB
Cache groups:		( 0 28 ) ( 1 29 ) ( 2 30 ) ( 3 31 ) ( 4 32 ) ( 5 33 ) ( 6 34 ) ( 7 35 ) ( 8 36 ) ( 9 37 ) ( 10 38 ) ( 11 39 ) ( 12 40 ) ( 13 41 ) ( 14 42 ) ( 15 43 ) ( 16 44 ) ( 17 45 ) ( 18 46 ) ( 19 47 ) ( 20 48 ) ( 21 49 ) ( 22 50 ) ( 23 51 ) ( 24 52 ) ( 25 53 ) ( 26 54 ) ( 27 55 )
--------------------------------------------------------------------------------
Level:			2
Size:			256 kB
Cache groups:		( 0 28 ) ( 1 29 ) ( 2 30 ) ( 3 31 ) ( 4 32 ) ( 5 33 ) ( 6 34 ) ( 7 35 ) ( 8 36 ) ( 9 37 ) ( 10 38 ) ( 11 39 ) ( 12 40 ) ( 13 41 ) ( 14 42 ) ( 15 43 ) ( 16 44 ) ( 17 45 ) ( 18 46 ) ( 19 47 ) ( 20 48 ) ( 21 49 ) ( 22 50 ) ( 23 51 ) ( 24 52 ) ( 25 53 ) ( 26 54 ) ( 27 55 )
--------------------------------------------------------------------------------
Level:			3
Size:			18 MB
Cache groups:		( 0 28 1 29 2 30 3 31 4 32 5 33 6 34 ) ( 7 35 8 36 9 37 10 38 11 39 12 40 13 41 ) ( 14 42 15 43 16 44 17 45 18 46 19 47 20 48 ) ( 21 49 22 50 23 51 24 52 25 53 26 54 27 55 )
--------------------------------------------------------------------------------
********************************************************************************
NUMA Topology
********************************************************************************
NUMA domains:		4
--------------------------------------------------------------------------------
Domain:			0
Processors:		( 0 28 1 29 2 30 3 31 4 32 5 33 6 34 )
Distances:		10 21 31 31
Free memory:		15409.4 MB
Total memory:		15932.8 MB
--------------------------------------------------------------------------------
Domain:			1
Processors:		( 7 35 8 36 9 37 10 38 11 39 12 40 13 41 )
Distances:		21 10 31 31
Free memory:		15298.2 MB
Total memory:		16125.3 MB
--------------------------------------------------------------------------------
Domain:			2
Processors:		( 14 42 15 43 16 44 17 45 18 46 19 47 20 48 )
Distances:		31 31 10 21
Free memory:		15869.4 MB
Total memory:		16125.3 MB
--------------------------------------------------------------------------------
Domain:			3
Processors:		( 21 49 22 50 23 51 24 52 25 53 26 54 27 55 )
Distances:		31 31 21 10
Free memory:		15876.6 MB
Total memory:		16124.4 MB
--------------------------------------------------------------------------------


********************************************************************************
Graphical Topology
********************************************************************************
Socket 0:
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| |  0 28  | |  1 29  | |  2 30  | |  3 31  | |  4 32  | |  5 33  | |  6 34  | |  7 35  | |  8 36  | |  9 37  | | 10 38  | | 11 39  | | 12 40  | | 13 41  | |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| +--------------------------------------------------------------------------+ +--------------------------------------------------------------------------+ |
| |                                   18 MB                                  | |                                   18 MB                                  | |
| +--------------------------------------------------------------------------+ +--------------------------------------------------------------------------+ |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
Socket 1:
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| | 14 42  | | 15 43  | | 16 44  | | 17 45  | | 18 46  | | 19 47  | | 20 48  | | 21 49  | | 22 50  | | 23 51  | | 24 52  | | 25 53  | | 26 54  | | 27 55  | |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |  32 kB | |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | |
| +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ |
| +--------------------------------------------------------------------------+ +--------------------------------------------------------------------------+ |
| |                                   18 MB                                  | |                                   18 MB                                  | |
| +--------------------------------------------------------------------------+ +--------------------------------------------------------------------------+ |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------

Dynamic processor clock

Older processors supported to dynamically decrease the clock to save power. Newer multicore processors are capable of dynamically overclocking beyond the nominal frequency. On Intel processors this feature is called "turbo mode".

Useful tool in this context are likwid-powermeter and likwid-setFrequencies. The former reports on Turbo mode steps, the latter on the current frequency settings.

$likwid-powermeter -i
--------------------------------------------------------------------------------
CPU name:       Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz
CPU type:       Intel Xeon Haswell EN/EP/EX processor
CPU clock:      2.30 GHz
--------------------------------------------------------------------------------
Base clock:     2300.00 MHz
Minimal clock:  1200.00 MHz
Turbo Boost Steps:
C0 3300.00 MHz
C1 3300.00 MHz
C2 3100.00 MHz
C3 3000.00 MHz
C4 2900.00 MHz
C5 2800.00 MHz
C6 2800.00 MHz
C7 2800.00 MHz
C8 2800.00 MHz
C9 2800.00 MHz
C10 2800.00 MHz
C11 2800.00 MHz
C12 2800.00 MHz
C13 2800.00 MHz
--------------------------------------------------------------------------------
Info for RAPL domain PKG:
Thermal Spec Power: 120 Watt
Minimum Power: 70 Watt
Maximum Power: 120 Watt
Maximum Time Window: 46848 micro sec

Info for RAPL domain DRAM:
Thermal Spec Power: 21.5 Watt
Minimum Power: 5.75 Watt
Maximum Power: 21.5 Watt
Maximum Time Window: 44896 micro sec

Info about Uncore:
Minimal Uncore frequency: 2300 MHz
Maximal Uncore frequency: 2300 MHz

Performance energy bias: 7 (0=highest performance, 15 = lowest energy)

--------------------------------------------------------------------------------

System configuration

The following node level settings can influence performance results.

CPU related

CPU clock

Influence on everything.

Recommended setting: Make sure to use acpi_cpufreq, fix frequency, make sure the CPU's power management unit doesn't interfere (e.g., likwid-perfctr)

Turbo mode on/off

Influence on CPU clock.

Recommended setting: for benchmarking deactivate

SMT on/off topology

Influence on resource sharing on core

Recommended setting: Can be left on on modern processors without penalty

Frequency governor (performance,...)

Influence on clock speed ramp-up.

Recommended setting: For benchmarking set so that clock speed is always fixed

Turbo steps

Influence on freq vs. # cores.

Recommended setting: For benchmarking switch off turbo

Memory related

Transparent huge pages

Influence on (memory) bandwidth.

Recommended setting: /sys/kernel/mm/transparent_hugepage/enabled should be set to ‘always’

Cluster on die (COD) / Sub NUMA clustering (SNC) mode

Influence on L3 and memory latency, (memory bandwidth via snoop mode on HSW/BDW)

Recommended setting: Set in BIOS, check using numactl -H or likwid-topology (MSR would be better)

LLC prefetcher

Influence on single-core memory bandwidth.

Recommended setting: Set in BIOS, no way to check without MSR

"Known" prefetchers

Influence on latency and bandwidth of various levels in the cache/memory hierarchy.

Recommended setting: Set in BIOS or likwid-features, query status using likwid-features

Numa balancing

Influence on (memory) data volume and performance.

Recommended setting: /proc/sys/kernel/numa_balancing if 1 the page migration is 'on', else 'off'

Memory configuration (channels, DIMM frequency, Single Rank/Dual Rank)

Influence on Memory performance.

Recommended setting: Check with dmidecode or look at DIMMs

NUMA interleaving (BIOS setting)

Influence on Memory BW.

Recommended setting: set in BIOS, switch off

Chip/package/node related

Uncore clock

Influence on L3 and memory bandwidth.

Recommended setting: Set it to maximum supported frequency (e.g., using likwid-setFrequency), make sure the CPU's power management unit doesn't interfere (e.g., likwid-perfctr)


QPI Snoop mode

Influence on memory bandwidth.

Recommended setting: Set in BIOS, no way to check without MSR.


Power cap

Influence on freq throttling Recommended setting: Don't use.

Affinity control

Affinity control allows to specify on which execution resources (cores or SMT threads) Affinity control is crucial to

  • eliminate performance variation
  • make use of architectural features
  • avoid resource contention

Almost any runtime environment comes with some kind of affinity control. With OpenMP 4 a standardized pinning interface was introduced. Most solutions are environment variable based. A command line wrapper alternative is available in the Likwid tools: likwid-pin and likwid-mpirun.

Best practices

There are two main variation dimensions for application benchmarking: Core count and data set size.

Scaling core count

Scaling the number of workers (and therefore processor cores) used tests the parallel scalability of an application and also reveals scaling bottlenecks in node architectures. To separate influences good practice is to first scale out inside a memory domain. Main memory bandwidth within one memory domain is currently the most important performance limiting shared resource on compute nodes. After scaling from 1 to n cores within one memory domain next is to scale across memory domains and finally across sockets with the previous case being the baseline for speedup. Finally one scales across nodes, again the baseline is now the single node result. This helps to separate different influences on scaling. One must be aware that there is no way to separate the pure parallel scalability influenced by e.g. serial fraction and load imbalance. For all scalability measurements the machine should be operated with fixed clock, that means Turbo mode has to be disabled. With Turbo mode turned on the result is influenced in addition by how sensitive the code is to frequency changes. For finding the optimal operating point for production it still might be important to also measure with Turbo mode enabled.

For plotting performance results larger should be better. Use either useful work per time or just inverse runtime as performance metric.

Besides the performance scaling you should also plot results as parallel speedup and efficiency. Parallel speedup is defined as

where N is the number of parallel workers. Ideal speedup is . The parallel efficiency is defined as

.

A reasonable threshold for acceptable parallel efficiency could be for example 0.5.

To wrap it up here is what needs to be done:

  • Set a fixed frequency
  • Measure sequential baseline
  • Scale within a memory domain with baseline sequential result
  • Scale across memory domains with baseline one memory domain
  • (if applicable) Scale across sockets with baseline one socket
  • Scale across nodes with baseline one node

Scaling data set size

The target for this scaling variation is to ensure that all (most) data is loaded from a specific memory hierarchy level (e.g. L1 cache, last level cache or main memory). In some cases it is not possible to vary the data set size in fine steps, in such case the data set size should be varied such that data is located in every memory hierarchy location. This experiment should be initially performed with one worker and reveals if runtime contributions from data transfers add to the critical path. If performance is insensitive to where the data is loaded from it is likely that the code is not limited by data access costs. It is important to assure with hardware performance profiling that measured data volumes are in line with the desired target memory hierarchy level.

SMT feature

Many processors today support Simultaneous multithreading (SMT) as a technology to increase the usage of instruction level parallel processing (ILP). The processor allows to run multiple threads (common is 2, 4 or 8) simultaneously on one core, which gives the instruction scheduler more independent instructions to feed the execution pipelines. SMT may increase the efficient use of ILP but comes at the cost of synchronization penalty. For application benchmarking good practice is to measure for each topological entity once with and once without SMT to quantify the effect.

Interpretation of results