Perf
Overview and Installation
The Linux Perf tool provides a variety of possibilities to measure, monitor, and present performance data. It builds on top of the Linux perf_event_open system call [1] provided since 2.6.32.
To install Perf, use the linux-tools-common
package on Debian based systems and perf
on SuSE.
Some of the features might need special permissions to be granted to users. This can be done by tweaking the pseudo file `/proc/sys/kernel/perf_event_paranoid`. According to the perf help text (Linux 4.18) the contents of this file can have the following properties with the respective meanings.
-1
: Allow use of (almost) all events by all users. Ignore mlock limit afterperf_event_mlock_kb
withoutCAP_IPC_LOCK
>=0
: Disallow ftrace function tracepoint by users withoutCAP_SYS_ADMIN
. Disallow raw tracepoint access by users withoutCAP_SYS_ADMIN
>= 1
: Disallow CPU event access by users withoutCAP_SYS_ADMIN
>= 2
: Disallow kernel profiling by users withoutCAP_SYS_ADMIN
Introduction
In this section, some basics are described.
Hardware Performance Monitoring Counters (PMCs)
Hardware PMCs are small extensions of processors, which usually consist of a set of at least two registers. In one register, the software (operating system) specifies a processor internal event that shall be counted and provides measures to control start and stop the measurement and other control features. The second register increments each time the event occurs. In addition, the operating system can set-up a threshold after which the PMCs will generate an interrupt. For example, an interrupt shall be generated for each 1 million instructions, or 300,000 cache misses. This will be used for multiple purposes.
Available Events
Use the perf list
command to list available events. These can be distinguished in multiple types. (The availability of the events depends on the kernel version and the permissions.
Hardware Events
A small list of Hardware Performance Monitoring Counter events. This set was defined as Architectural performance events (Intel Manual 3, Section 19.1 [2] ) by Intel at some point. While these seem to be more-or-less valid on Intel platforms, they are not necessarily reliable on AMD. For example, some Linux versions counted Level-1-Instruction-Cache-Misses as cache-misses
, while Intel usually uses them for Last-Level cache misses. Use them carefully, since the Linux developers (mostly those of the processor vendor) will specify the underlying events.
Software Events
Software events are not counted via Hardware PMCs, but are generated, monitored, and handled by the operating system.
Hardware Cache Events
These events relate to processor events for different caches, TLBs, and branch prediction. As for Hardware Events, these have to be specified by kernel developers. Some of the events might not be available on the system. For example if there is no Hardware PMC event theat relates to a given Hardware Cache Event.
Kernel PMU Events
In addition to Hardware PMCs, which are provided per hardware thread, other components of the processor (or devices) can provide their own performance monitoring counters. This can relate to incremental registers, like RAPL, TSC, MPERF, APERF, uncore components, or iGPUs.
Raw Events
You can specify hardware PMC events also by their actual ID. Refer to the processor manual to find the ID for the event that you want to monitor. Usually, the Umask is defined in Bits 8-15 and the event is specified in Bits 0-7. For the event `LD_BLOCKS.STORE_FORWARD` on 4th Generation Intel Core Processors, the umask and event are 0x02 and 0x03, respectively (Intel Manual 3, Table 19-7 [3] ). Hence, the raw event encoding would be r302
.
Tracepoint Events
In addition to increasing counters, the kernel is instrumented, whioch provides you with the possibility to grap any specified event within the kernel. Most of these also provide access to some arguments, which are highly event-specific. For a definition of these events, check your tracefs
mountpoint, usually under /sys/kernel/debug/tracing
.
Measuring Events with perf stat
Use the perf stat
command to measure available events. This will set-up the hardware and software counters and either collect the information for the applied process or the CPU(s) (meaning hardware threads) that are requested. perf stat
provides various command line arguments (see perf-stat man page [4] ). Some of the important ones are:
-d / --detailed
collects more events, can be provided multiple times-I / --interval-print <ms>
provides measurement every ms milliseconds-e / --event=
specify the event(s) to be measured-x / --field-separator SEP
will print statistics CSV like, SEP will be used as separator.`-C / --cpu=<cpu-list>
will measure the events on the list of given CPUS-A / --no-aggr
Do not aggregate counts across all monitored CPUs-a / --all-cpus
Monitor all CPUs-o / --output <file>
Specifies the output file, default: stdio
Examples
perf stat make -j
Will provide a general overview on how well make -j
performed.
perf stat -d -d make -j
Will provide a more detailed overview on how well make -j
performed.
perf stat -a -I 1000
Will provide statistics for the whole system every second.
perf stat -e instructions -I 1000 -x , -o stat.csv
Will provide instructions statistics for the whole system every second and save it in stat.csv. Can be used to monitor IPS over time.
Overprovision of Events
If you specify more events than can be counted on the hardware, the operating system will measure them in time slices where for each time slice a different event is chosen. This information is provided to the user. Watch out for percentage signs in braces after the results.
Child-processes
When monitoring a process, all child processes and threads will also be monitored. This can be avoided by providing the -i, --no-inherit
flag.
Event-based Sampling with perf record
With perf record
, you can set-up an event based sampling. The samples will be written to a log file, which is typically later processed with perf report
to create a profile.
(TODO)