Perf
Overview and Installation
The Linux Perf tool provides a variety of possibilities to measure, monitor, and present performance data. It builds on top of the Linux perf_event_open system call [1] provided since 2.6.32.
To install Perf, use the linux-tools-common
package on Debian based systems and perf
on SuSE.
Some of the features might need special permissions to be granted to users. This can be done by tweaking the pseudo file /proc/sys/kernel/perf_event_paranoid
. According to the perf help text (Linux 4.18) the contents of this file can have the following properties with the respective meanings.
-1
: Allow use of (almost) all events by all users. Ignore mlock limit afterperf_event_mlock_kb
withoutCAP_IPC_LOCK
>=0
: Disallow ftrace function tracepoint by users withoutCAP_SYS_ADMIN
. Disallow raw tracepoint access by users withoutCAP_SYS_ADMIN
>= 1
: Disallow CPU event access by users withoutCAP_SYS_ADMIN
>= 2
: Disallow kernel profiling by users withoutCAP_SYS_ADMIN
Introduction
In this section, some basics are described.
Hardware Performance Monitoring Counters (PMCs)
Hardware PMCs are small extensions of processors, which usually consist of a set of at least two registers. In one register, the software (operating system) specifies a processor internal event that shall be counted and provides measures to control start and stop the measurement and other control features. The second register increments each time the event occurs. In addition, the operating system can set-up a threshold after which the PMCs will generate an interrupt. For example, an interrupt shall be generated for each 1 million instructions, or 300,000 cache misses. This will be used for multiple purposes.
Available Events
Use the perf list
command to list available events. These can be distinguished in multiple types. (The availability of the events depends on the kernel version and the permissions.
Hardware Events
A small list of Hardware Performance Monitoring Counter events. This set was defined as Architectural performance events (Intel Manual 3, Section 19.1 [2] ) by Intel at some point. While these seem to be more-or-less valid on Intel platforms, they are not necessarily reliable on AMD. For example, some Linux versions counted Level-1-Instruction-Cache-Misses as cache-misses
, while Intel usually uses them for Last-Level cache misses. Use them carefully, since the Linux developers (mostly those of the processor vendor) will specify the underlying events.
Software Events
Software events are not counted via Hardware PMCs, but are generated, monitored, and handled by the operating system.
Hardware Cache Events
These events relate to processor events for different caches, TLBs, and branch prediction. As for Hardware Events, these have to be specified by kernel developers. Some of the events might not be available on the system. For example if there is no Hardware PMC event theat relates to a given Hardware Cache Event.
Kernel PMU Events
In addition to Hardware PMCs, which are provided per hardware thread, other components of the processor (or devices) can provide their own performance monitoring counters. This can relate to incremental registers, like RAPL, TSC, MPERF, APERF, uncore components, or iGPUs.
Raw Events
You can specify hardware PMC events also by their actual ID. Refer to the processor manual to find the ID for the event that you want to monitor. Usually, the Umask is defined in Bits 8-15 and the event is specified in Bits 0-7. For the event LD_BLOCKS.STORE_FORWARD
on 4th Generation Intel Core Processors, the umask and event are 0x02 and 0x03, respectively (Intel Manual 3, Table 19-7 [3] ). Hence, the raw event encoding would be r302
.
Tracepoint Events
In addition to increasing counters, the kernel is instrumented, whioch provides you with the possibility to grap any specified event within the kernel. Most of these also provide access to some arguments, which are highly event-specific. For a definition of these events, check your tracefs
mountpoint, usually under /sys/kernel/debug/tracing
.
Hint You can define new tracepoints with perf probe
Measuring Events with perf stat
Use the perf stat
command to measure available events. This will set-up the hardware and software counters and either collect the information for the applied process or the CPU(s) (meaning hardware threads) that are requested. perf stat
provides various command line arguments (see perf-stat man page [4] ). Some of the important ones are:
Selected Argments
-d / --detailed
collects more events, can be provided multiple times-I / --interval-print <ms>
provides measurement every ms milliseconds-e / --event=
specify the event(s) to be measured-x / --field-separator SEP
will print statistics CSV like, SEP will be used as separator.-C / --cpu=<cpu-list>
will measure the events on the list of given CPUS-A / --no-aggr
Do not aggregate counts across all monitored CPUs-a / --all-cpus
Monitor all CPUs-o / --output <file>
Specifies the output file, default: stderr
Examples
perf stat make -j
Will provide a general overview on how well make -j
performed.
perf stat -d -d make -j
Will provide a more detailed overview on how well make -j
performed.
perf stat -a -I 1000
Will provide statistics for the whole system every second.
perf stat -e instructions -I 1000 -x , -o stat.csv
Will provide instructions statistics for the whole system every second and save it in stat.csv. Can be used to monitor IPS over time.
Overprovision of Events
If you specify more events than can be counted on the hardware, the operating system will measure them in time slices where for each time slice a different event is chosen. This information is provided to the user. Watch out for percentage signs in braces after the results.
Child-processes
When monitoring a process, all child processes and threads will also be monitored. This can be avoided by providing the -i, --no-inherit
flag.
Event-based Sampling with perf record
and other Tools
With perf record
, you can set-up an event based sampling. The samples will be written to a log file, which is typically later processed with perf report
to create a profile.
Caution with File Size The sample file sizes can get quite large. This should not be done on a slow file system.
Recording with perf record
When recording, the execution of processes is interrupted at intervals. The current state (PID, instruction pointer, time, ...) is taken and stored to the log file. As for perf stat
, hardware threads (CPUs) or processes/threads can be monitored. Based on the log file, either a trace or profile can be shown or the single entries of the file can be processed using python or perl.
Interpret with Caution! The interrupt interval is not necessarily equidistant in time! The default event cycles
depends on processor idle states as much as frequency. Other events that can be used as a base for the interrupt are not equidistant either (e.g., instructions, cache misses). Moreover, the captured instruction pointer can have a small offset different from the actual instruction.
Selected Arguments
-c / --count=
Event period to sample. Higher -> more pertubation, lower -> less samples-e / --events=
as forperf stat
, you can specify events here. For example, if you are interested in functions where cache misses happen, use-e cache-misses
-g / --callpath
add call path profiling. If available,perf
will try to not only sample the current function, but also the callpath. This works better on x86 if the framepointer is available. Please compile your code with-fno-omit-framepointer
. If you do not do so, perf will try other means, which can also influence the monitoring overhead.--user-regs
also capture user registers--switch-events
also capture switch events (for statistics on when exactly processes/threads are switched by the OS)-b / --branch-any
Record any branch instruction that is taken. The type of branch instruction can be filtered with-j / --branch-filter
. This enables you to "replay" the execution of a program. Please note, that there can be an offset for the branch instruction (i.e., the first instruction after the branch could be sampled)-G name,... / --cgroup name,...
Monitor a CGroup. Can also be used with SLURM, which can create CGroups for Jobs/Jobsteps.-R / --raw-samples
collect the raw samples. Will provide arguments for tracepoint events
Examples
perf record ./foo
record the execution of programfoo
perf record -g ./foo
record the execution of programfoo
, including call stackperf record -a sleep 60
record the whole system for 60 secondsperf record -g ./foo
record the branches taken during the execution of programfoo
perf record -e instructions -c 1000000 ./foo
record every 1 millionth instruction during execution offoo
Profile with perf report
You can show the profile of the created log file with perf record
. Depending on the installation, perf report will open a TUI or a GTK interface. If you are using pipes to process the output, it will provide a stdio output.
Within the profile (TUI), you can zoom and look at the executed code (function level, instruction level==Assembler). Depending on the availability of debug information and the provided flags, the source code is embedded with the assembler.
Trace and Processing with perf script
In addition to watching the profile, you can also process it with a script. By default, all entries of the log file will be printed. And you can use command line tools to process them further. However, you usually want to make use of the perl or python (depending on your perf installation) interface to process the single samples.
Example
perf script -g python
generate the stubperf-script.py
for processing the log file in Pythonperf script report perf-script.py
runperf-script.py
on the default log file (perf.data
)
Archiving of Scientific Data
You will not be able to report or script a log file on a different computer. Even on the system-under-test, the shared objects will change (recompilation, updates) and the information within the recorded log-files will be incomplete, if not useless. Use the perf archive
function [5], to create an archive that contains all information on the used shared objects. These archives can become quite large. Still the data can now be processed on all other systems.