Performance profiling
In a performance engineering context performance profiling is to relate performance metric measurements to source code execution. The data sources are typically either the operating system, a runtime system or measurement facilities in the hardware. The following explanations focus on hardware performance monitoring based metrics.
Introduction
Every modern processor has support for so called hardware performance monitoring (HPM) units that allow to measure events or metrics. This article focuses on HPM related performance metrics. To use HPM units dedicated profiling tools must be used. HPM metrics allow to get a very detailed view on software-hardware interaction and introduce only small or no overhead. Every serious performance engineering effort should use a HPM tool for profiling. A very good overview about HPM capabilities of many X86 processor architectures can be found in the Likwid Wiki.
HPM units consist of programmable counters in different parts of the chip. Every processor on Intel processors has at least 4 general purpose counters plus many more counters in different parts of the Uncore (the part of the chip that is shared by cores).
There are two basic ways to use HPM units:
- End-to-end measurements: A counter is programmed and started. It measures everything executed on its part of the hardware. The counter can be read while running or after being stopped. The advantage is that no overhead is introduced during the measurement. The measurement is very accurate but only averages for regions of code can be measured. To measure regions usually an instrumentation API must be used and the code must be pinned to specific processors. Also only one fixed event set can be measured per run. The Likwid tool likwid-perfctr is based on this approach.
- Sampling based measurements: Events are related to source code by statistical sampling. Counters are configured and started and when they exceed some value an interrupt is triggered reading out the program counter. This information is stored and later analysed. Sampling based tools introduce overhead by triggering interrupts and additional book keeping during the measurement. There is also the possibility of measurement errors since the result is based on statistical evaluation. Advantages are that a code does not need to be pinned nor instrumented. The complete application can be measured and analysed in one run. Also measuring multiple events is no problem. Most advanced tools employ sampling. Sampling requires extensive kernel support but is accessible using the Linux Perf interface.
A major complexity for software developers is to choose the right raw events to accurately measure metrics he is interested in. Most of these metrics cannot measured directly but require a set of events from which so called derived metrics are computed. Also processor vendors usually take no responsibility for events counting wrong. One has to hope that the tool he is using chooses the right event sets and validated the results.
Strategy
HPM allows to measure resource utilisation, executed instruction decomposition, as well as diagnostic analysis of software-hardware interaction. We recommend to measure resource utilisation and instruction decomposition first for all regions at the top of the runtime profiling list.
Metrics to measure (Typical metric in parentheses):
- Operation throughput (Flops/s)
- Overall instruction throughput (CPI)
- Instruction counts broken down to instruction types (FP instructions, loads and stores, branch instructions, other instructions)
- Instruction counts broken down to SIMD width (scalar, SSE, AVX, AVX512 for X86). This is restricted to arithmetic instruction on most architectures.
- Data volumes and bandwidth to main memory (GB and GB/s)
- Data volumes and bandwidth to different cache levels (GB and GB/s)
Useful diagnostic metrics are:
- Clock (GHz)
- Power (W)
Tools
likwid-perfctr
The Likwid tools provide the command line tool likwid-perfctr for measuring HPM events as well as other data sources as e.g. RAPL counters. likwid-perfctr supports all modern X86 architectures as well as early support for Power and ARM processors. It is available for the Linux operating system. Because it performs end-to-end measurements only it requires to pin the application to cores, affinity control is already built into the tool through. Some notable features are:
- Lightweight tool with low learning curve
- As far as possible full event support for core and uncore counters
- Uses flexible thread group syntax for specifying which cores to measure.
- Portable performance groups with preconfigured event sets and validated derived metrics
- Offers own user space implementation using low level MSR kernel interface as well as perf backend
- Functionality is also available as part of the Likwid library API
- Marker API can also be used as very accurate runtime profiler
- Multiple modes:
- Wrapper mode (end to end measurement of application run)
- Stethoscope mode (measure for specified duration events on set of cores)
- Timeline mode (Time resolved measurement outputs performance metric in specified frequency (can be ms or s))
- Marker API (Lightweight C and F90 API with region markers. This what is usually used in full scale production codes)
All recommended metrics in the strategy section can be measured using the MEM_DP/MEM_SP, BRANCH, DATA, L2 and L3 performance groups.