Overview and Installation
The Linux Perf tool provides a variety of possibilities to measure, monitor, and present performance data. It builds on top of the Linux perf_event_open system call  provided since 2.6.32.
To install Perf, use the
linux-tools-common package on Debian based systems and
perf on SuSE.
Some of the features might need special permissions to be granted to users. This can be done by tweaking the pseudo file
/proc/sys/kernel/perf_event_paranoid. According to the perf help text (Linux 4.18) the contents of this file can have the following properties with the respective meanings.
-1: Allow use of (almost) all events by all users. Ignore mlock limit after
>=0: Disallow ftrace function tracepoint by users without
CAP_SYS_ADMIN. Disallow raw tracepoint access by users without
>= 1: Disallow CPU event access by users without
>= 2: Disallow kernel profiling by users without
In this section, some basics are described.
Hardware Performance Monitoring Counters (PMCs)
Hardware PMCs are small extensions of processors, which usually consist of a set of at least two registers. In one register, the software (operating system) specifies a processor internal event that shall be counted and provides measures to control start and stop the measurement and other control features. The second register increments each time the event occurs. In addition, the operating system can set-up a threshold after which the PMCs will generate an interrupt. For example, an interrupt shall be generated for each 1 million instructions, or 300,000 cache misses. This will be used for multiple purposes.
perf list command to list available events. These can be distinguished in multiple types. (The availability of the events depends on the kernel version and the permissions.
A small list of Hardware Performance Monitoring Counter events. This set was defined as Architectural performance events (Intel Manual 3, Section 19.1  ) by Intel at some point. While these seem to be more-or-less valid on Intel platforms, they are not necessarily reliable on AMD. For example, some Linux versions counted Level-1-Instruction-Cache-Misses as
cache-misses, while Intel usually uses them for Last-Level cache misses. Use them carefully, since the Linux developers (mostly those of the processor vendor) will specify the underlying events.
Software events are not counted via Hardware PMCs, but are generated, monitored, and handled by the operating system.
Hardware Cache Events
These events relate to processor events for different caches, TLBs, and branch prediction. As for Hardware Events, these have to be specified by kernel developers. Some of the events might not be available on the system. For example if there is no Hardware PMC event theat relates to a given Hardware Cache Event.
Kernel PMU Events
In addition to Hardware PMCs, which are provided per hardware thread, other components of the processor (or devices) can provide their own performance monitoring counters. This can relate to incremental registers, like RAPL, TSC, MPERF, APERF, uncore components, or iGPUs.
You can specify hardware PMC events also by their actual ID. Refer to the processor manual to find the ID for the event that you want to monitor. Usually, the Umask is defined in Bits 8-15 and the event is specified in Bits 0-7. For the event
LD_BLOCKS.STORE_FORWARD on 4th Generation Intel Core Processors, the umask and event are 0x02 and 0x03, respectively (Intel Manual 3, Table 19-7  ). Hence, the raw event encoding would be
In addition to increasing counters, the kernel is instrumented, whioch provides you with the possibility to grap any specified event within the kernel. Most of these also provide access to some arguments, which are highly event-specific. For a definition of these events, check your
tracefs mountpoint, usually under
Hint You can define new tracepoints with
Measuring Events with
perf stat command to measure available events. This will set-up the hardware and software counters and either collect the information for the applied process or the CPU(s) (meaning hardware threads) that are requested.
perf stat provides various command line arguments (see perf-stat man page  ). Some of the important ones are:
-d / --detailedcollects more events, can be provided multiple times
-I / --interval-print <ms>provides measurement every ms milliseconds
-e / --event=specify the event(s) to be measured
-x / --field-separator SEPwill print statistics CSV like, SEP will be used as separator.
-C / --cpu=<cpu-list>will measure the events on the list of given CPUS
-A / --no-aggrDo not aggregate counts across all monitored CPUs
-a / --all-cpusMonitor all CPUs
-o / --output <file>Specifies the output file, default: stderr
perf stat make -j
Will provide a general overview on how well
make -j performed.
perf stat -d -d make -j
Will provide a more detailed overview on how well
make -j performed.
perf stat -a -I 1000
Will provide statistics for the whole system every second.
perf stat -e instructions -I 1000 -x , -o stat.csv
Will provide instructions statistics for the whole system every second and save it in stat.csv. Can be used to monitor IPS over time.
Overprovision of Events
If you specify more events than can be counted on the hardware, the operating system will measure them in time slices where for each time slice a different event is chosen. This information is provided to the user. Watch out for percentage signs in braces after the results.
When monitoring a process, all child processes and threads will also be monitored. This can be avoided by providing the
-i, --no-inherit flag.
Event-based Sampling with
perf record and other Tools
perf record, you can set-up an event based sampling. The samples will be written to a log file, which is typically later processed with
perf report to create a profile.
Caution with File Size The sample file sizes can get quite large. This should not be done on a slow file system.
When recording, the execution of processes is interrupted at intervals. The current state (PID, instruction pointer, time, ...) is taken and stored to the log file. As for
perf stat, hardware threads (CPUs) or processes/threads can be monitored. Based on the log file, either a trace or profile can be shown or the single entries of the file can be processed using python or perl.
Interpret with Caution! The interrupt interval is not necessarily equidistant in time! The default event
cycles depends on processor idle states as much as frequency. Other events that can be used as a base for the interrupt are not equidistant either (e.g., instructions, cache misses). Moreover, the captured instruction pointer can have a small offset different from the actual instruction.
-c / --count=Event period to sample. Higher -> more pertubation, lower -> less samples
-e / --events=as for
perf stat, you can specify events here. For example, if you are interested in functions where cache misses happen, use
-g / --callpathadd call path profiling. If available,
perfwill try to not only sample the current function, but also the callpath. This works better on x86 if the framepointer is available. Please compile your code with
-fno-omit-framepointer. If you do not do so, perf will try other means, which can also influence the monitoring overhead.
--user-regsalso capture user registers
--switch-eventsalso capture switch events (for statistics on when exactly processes/threads are switched by the OS)
-b / --branch-anyRecord any branch instruction that is taken. The type of branch instruction can be filtered with
-j / --branch-filter. This enables you to "replay" the execution of a program. Please note, that there can be an offset for the branch instruction (i.e., the first instruction after the branch could be sampled)
-G name,... / --cgroup name,...Monitor a CGroup. Can also be used with SLURM, which can create CGroups for Jobs/Jobsteps.
-R / --raw-samplescollect the raw samples. Will provide arguments for tracepoint events
perf record ./foorecord the execution of program
perf record -g ./foorecord the execution of program
foo, including call stack
perf record -a sleep 60record the whole system for 60 seconds
perf record -g ./foorecord the branches taken during the execution of program
perf record -e instructions -c 1000000 ./foorecord every 1 millionth instruction during execution of
You can show the profile of the created log file with
perf record. Depending on the installation, perf report will open a TUI or a GTK interface. If you are using pipes to process the output, it will provide a stdio output.
Within the profile (TUI), you can zoom and look at the executed code (function level, instruction level==Assembler). Depending on the availability of debug information and the provided flags, the source code is embedded with the assembler.
Trace and Processing with
In addition to watching the profile, you can also process it with a script. By default, all entries of the log file will be printed. And you can use command line tools to process them further. However, you usually want to make use of the perl or python (depending on your perf installation) interface to process the single samples.
perf script -g pythongenerate the stub
perf-script.pyfor processing the log file in Python
perf script report perf-script.pyrun
perf-script.pyon the default log file (
Archiving of Scientific Data
You will not be able to report or script a log file on a different computer. Even on the system-under-test, the shared objects will change (recompilation, updates) and the information within the recorded log-files will be incomplete, if not useless. Use the
perf archive function , to create an archive that contains all information on the used shared objects. These archives can become quite large. Still the data can now be processed on all other systems.