Performance profiling

From HPC Wiki
Jump to navigation Jump to search

In a performance engineering context performance profiling means to relate performance metric measurements to source code execution. Data sources are typically either operating system, execution environments or measurement facilities in the hardware. The following explanations focus on hardware performance monitoring (HPM) based metrics.

Introduction

Every modern processor has support for so called hardware performance monitoring (HPM) units. To use HPM units dedicated profiling tools need to be used. HPM metrics give a very detailed view on software-hardware interaction and add almost no overhead. Every serious performance engineering effort should use a HPM tool for profiling. A very good overview about HPM capabilities of many X86 processor architectures is given in the Likwid Wiki.

HPM units consist of programmable counters in different parts of the chip. Every Intel processor for example has at least 4 general purpose counters per core plus many more counters in different parts of the uncore (the part of the chip that is shared by multiple cores).

There are two basic ways to use HPM units:

  • End-to-end measurements: A counter is programmed and started. It measures everything executed on its part of the hardware. The counter can be read while running or after being stopped. The advantage is that no overhead is introduced during the measurement. The measurement is very accurate but only provides averages. To measure code regions an instrumentation API has to be used and the code has to be pinned to specific processors. Also only one fixed event set can be measured at a time. The Likwid tool likwid-perfctr is based on this approach.
  • Sampling based measurements: Events are related to source code by statistical sampling. Counters are configured and started and when they exceed a configured threshold an interrupt is triggered reading out the program counter. This information is stored and later analysed. Sampling based tools introduce overhead by triggering interrupts and additional book keeping during the measurement. There is also the possibility of measurement errors due to statistical errors. Advantages are that a code does not need to be pinned nor instrumented. The complete application can be measured and analysed in one run. Also measuring multiple event sets is no problem. Most advanced tools employ sampling. Sampling functionality requires extensive kernel support but is accessible using the Linux Perf interface.

It is often difficult for software developers to choose the correct set of raw events to accurately measure the interesting high-level metrics. Most of those metrics cannot be directly measured, but require a set of raw events, from which derived metrics can be computed. Also processor vendors usually take no responsibility for wrong event counts. One has to trust the tool to choose the right event sets and employ validation of the results.

Recommended usage

HPM allows to measure resource utilization, executed instruction decomposition, as well as diagnostic analysis of software-hardware interaction. We recommend to measure resource utilisation and instruction decomposition first for all regions at the top of the runtime profiling list.

Metrics to measure (Typical metric units in parentheses):

  • Operation throughput (Flops/s)
  • Overall instruction throughput (CPI)
  • Instruction counts broken down to instruction types (FP instructions, loads and stores, branch instructions, other instructions)
  • Instruction counts broken down to SIMD width (scalar, SSE, AVX, AVX512 for X86). This is restricted to arithmetic instruction on most architectures.
  • Data volumes and bandwidth to main memory (GB and GB/s)
  • Data volumes and bandwidth to different cache levels (GB and GB/s)

Useful diagnostic metrics are:

  • Clock (GHz)
  • Power (W)

Tools

likwid-perfctr

The Likwid tools provide the command line tool likwid-perfctr for measuring HPM events as well as other data sources as e.g. RAPL counters. likwid-perfctr supports all modern X86 architectures as well as early support for Power and ARM processors. It is only available for the Linux operating system. Because it performs end-to-end measurements it requires to pin the application to cores, affinity control is already built into the tool through. Some notable features are:

  • Lightweight tool with low learning curve
  • As far as possible full event support for core and uncore counters
  • Uses flexible thread group syntax for specifying which cores to measure.
  • Portable performance groups with preconfigured event sets and validated derived metrics
  • Offers own user space implementation using low level MSR kernel interface as well as perf backend
  • Functionality is also available as part of the Likwid library API
  • Marker API can also be used as very accurate runtime profiler
  • Multiple modes:
    • Wrapper mode (end to end measurement of application run)
    • Stethoscope mode (measure for specified duration events on set of cores)
    • Timeline mode (Time resolved measurement outputs performance metric in specified frequency (can be ms or s))
    • Marker API (Lightweight C and F90 API with region markers. This what is usually used in full scale production codes)

All recommended metrics can be measured using the MEM_DP/MEM_SP, BRANCH, DATA, L2 and L3 performance groups.

Intel Amplifier

Example