Perf

From HPC Wiki
Jump to navigation Jump to search

Overview and Installation

The Linux Perf tool provides a variety of possibilities to measure, monitor, and present performance data. It builds on top of the Linux perf_event_open system call [1] provided since 2.6.32.

To install Perf, use the linux-tools-common package on Debian based systems and perf on SuSE.

Some of the features might need special permissions to be granted to users. This can be done by tweaking the pseudo file /proc/sys/kernel/perf_event_paranoid. According to the perf help text (Linux 4.18) the contents of this file can have the following properties with the respective meanings.

  • -1 : Allow use of (almost) all events by all users. Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
  • >=0 : Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN. Disallow raw tracepoint access by users without CAP_SYS_ADMIN
  • >= 1 : Disallow CPU event access by users without CAP_SYS_ADMIN
  • >= 2 : Disallow kernel profiling by users without CAP_SYS_ADMIN

Introduction

In this section, some basics are described.

Hardware Performance Monitoring Counters (PMCs)

Hardware PMCs are small extensions of processors, which usually consist of a set of at least two registers. In one register, the software (operating system) specifies a processor internal event that shall be counted and provides measures to control start and stop the measurement and other control features. The second register increments each time the event occurs. In addition, the operating system can set-up a threshold after which the PMCs will generate an interrupt. For example, an interrupt shall be generated for each 1 million instructions, or 300,000 cache misses. This will be used for multiple purposes.

Available Events

Use the perf list command to list available events. These can be distinguished in multiple types. (The availability of the events depends on the kernel version and the permissions.


Hardware Events

A small list of Hardware Performance Monitoring Counter events. This set was defined as Architectural performance events (Intel Manual 3, Section 19.1 [2] ) by Intel at some point. While these seem to be more-or-less valid on Intel platforms, they are not necessarily reliable on AMD. For example, some Linux versions counted Level-1-Instruction-Cache-Misses as cache-misses, while Intel usually uses them for Last-Level cache misses. Use them carefully, since the Linux developers (mostly those of the processor vendor) will specify the underlying events.

Software Events

Software events are not counted via Hardware PMCs, but are generated, monitored, and handled by the operating system.

Hardware Cache Events

These events relate to processor events for different caches, TLBs, and branch prediction. As for Hardware Events, these have to be specified by kernel developers. Some of the events might not be available on the system. For example if there is no Hardware PMC event theat relates to a given Hardware Cache Event.

Kernel PMU Events

In addition to Hardware PMCs, which are provided per hardware thread, other components of the processor (or devices) can provide their own performance monitoring counters. This can relate to incremental registers, like RAPL, TSC, MPERF, APERF, uncore components, or iGPUs.

Raw Events

You can specify hardware PMC events also by their actual ID. Refer to the processor manual to find the ID for the event that you want to monitor. Usually, the Umask is defined in Bits 8-15 and the event is specified in Bits 0-7. For the event LD_BLOCKS.STORE_FORWARD on 4th Generation Intel Core Processors, the umask and event are 0x02 and 0x03, respectively (Intel Manual 3, Table 19-7 [3] ). Hence, the raw event encoding would be r302.


Tracepoint Events

In addition to increasing counters, the kernel is instrumented, whioch provides you with the possibility to grap any specified event within the kernel. Most of these also provide access to some arguments, which are highly event-specific. For a definition of these events, check your tracefs mountpoint, usually under /sys/kernel/debug/tracing.

Hint You can define new tracepoints with perf probe

Measuring Events with perf stat

Use the perf stat command to measure available events. This will set-up the hardware and software counters and either collect the information for the applied process or the CPU(s) (meaning hardware threads) that are requested. perf stat provides various command line arguments (see perf-stat man page [4] ). Some of the important ones are:

Selected Argments

  • -d / --detailed collects more events, can be provided multiple times
  • -I / --interval-print <ms> provides measurement every ms milliseconds
  • -e / --event= specify the event(s) to be measured
  • -x / --field-separator SEP will print statistics CSV like, SEP will be used as separator.
  • -C / --cpu=<cpu-list> will measure the events on the list of given CPUS
  • -A / --no-aggr Do not aggregate counts across all monitored CPUs
  • -a / --all-cpus Monitor all CPUs
  • -o / --output <file> Specifies the output file, default: stderr

Examples

perf stat make -j Will provide a general overview on how well make -j performed.

perf stat -d -d make -j Will provide a more detailed overview on how well make -j performed.

perf stat -a -I 1000 Will provide statistics for the whole system every second.

perf stat -e instructions -I 1000 -x , -o stat.csv Will provide instructions statistics for the whole system every second and save it in stat.csv. Can be used to monitor IPS over time.

Overprovision of Events

If you specify more events than can be counted on the hardware, the operating system will measure them in time slices where for each time slice a different event is chosen. This information is provided to the user. Watch out for percentage signs in braces after the results.

Child-processes

When monitoring a process, all child processes and threads will also be monitored. This can be avoided by providing the -i, --no-inherit flag.

Event-based Sampling with perf record and other Tools

With perf record, you can set-up an event based sampling. The samples will be written to a log file, which is typically later processed with perf report to create a profile.

Caution with File Size The sample file sizes can get quite large. This should not be done on a slow file system.

Recording with perf record

When recording, the execution of processes is interrupted at intervals. The current state (PID, instruction pointer, time, ...) is taken and stored to the log file. As for perf stat, hardware threads (CPUs) or processes/threads can be monitored. Based on the log file, either a trace or profile can be shown or the single entries of the file can be processed using python or perl.

Interpret with Caution! The interrupt interval is not necessarily equidistant in time! The default event cycles depends on processor idle states as much as frequency. Other events that can be used as a base for the interrupt are not equidistant either (e.g., instructions, cache misses). Moreover, the captured instruction pointer can have a small offset different from the actual instruction.

Selected Arguments

  • -c / --count= Event period to sample. Higher -> more pertubation, lower -> less samples
  • -e / --events= as for perf stat, you can specify events here. For example, if you are interested in functions where cache misses happen, use -e cache-misses
  • -g / --callpath add call path profiling. If available, perf will try to not only sample the current function, but also the callpath. This works better on x86 if the framepointer is available. Please compile your code with -fno-omit-framepointer. If you do not do so, perf will try other means, which can also influence the monitoring overhead.
  • --user-regs also capture user registers
  • --switch-events also capture switch events (for statistics on when exactly processes/threads are switched by the OS)
  • -b / --branch-any Record any branch instruction that is taken. The type of branch instruction can be filtered with -j / --branch-filter. This enables you to "replay" the execution of a program. Please note, that there can be an offset for the branch instruction (i.e., the first instruction after the branch could be sampled)
  • -G name,... / --cgroup name,... Monitor a CGroup. Can also be used with SLURM, which can create CGroups for Jobs/Jobsteps.
  • -R / --raw-samples collect the raw samples. Will provide arguments for tracepoint events

Examples

  • perf record ./foo record the execution of program foo
  • perf record -g ./foo record the execution of program foo, including call stack
  • perf record -a sleep 60 record the whole system for 60 seconds
  • perf record -g ./foo record the branches taken during the execution of program foo
  • perf record -e instructions -c 1000000 ./foo record every 1 millionth instruction during execution of foo


Profile with perf report

You can show the profile of the created log file with perf record. Depending on the installation, perf report will open a TUI or a GTK interface. If you are using pipes to process the output, it will provide a stdio output. Within the profile (TUI), you can zoom and look at the executed code (function level, instruction level==Assembler). Depending on the availability of debug information and the provided flags, the source code is embedded with the assembler.

Trace and Processing with perf script

In addition to watching the profile, you can also process it with a script. By default, all entries of the log file will be printed. And you can use command line tools to process them further. However, you usually want to make use of the perl or python (depending on your perf installation) interface to process the single samples.

Example

  • perf script -g python generate the stub perf-script.pyfor processing the log file in Python
  • perf script report perf-script.py run perf-script.py on the default log file (perf.data)

Archiving of Scientific Data

You will not be able to report or script a log file on a different computer. Even on the system-under-test, the shared objects will change (recompilation, updates) and the information within the recorded log-files will be incomplete, if not useless. Use the perf archive function [5], to create an archive that contains all information on the used shared objects. These archives can become quite large. Still the data can now be processed on all other systems.

Sources

  • Brendan Gregg , Linux perf Examples [6]
  • Perf Wiki [7]
  • Man pages 😉