Performance metrics

This post lists common performance metrics used in Performance profiling and Performance Monitoring. The metrics can be measured for single resources (e.g. a single CPU core) or for groups of resources (e.g. all CPU cores of a node). They touch on every hardware component that can move or process data (CPU, memory, accelerators, disks, network, ...).

CPU, GPU, and memory utilization metrics are the most useful to get a quick impression of the components workload. Metrics regarding power are very important to assess the energy requirements of the computations. Derived metrics can combine metrics in a clearer representation.

Keep in mind that 100% utilization of available hardware is only efficient, if the computation is necessary and not redundant, e.g. providing new insight.

Metrics are recorded at a certain hardware level. For example, CPU time can be sampled at core level or aggregated to a single value per node. At what level a metric can be sampled might also depend on the capabilities of the hardware. As such the CPU power consumption metric might only be available per socket. Here is a list of possible levels: logical core (thread), physical core, NUMA domain, socket, node, accelerator, job. The order of the levels in the list is not necessarily fixed in a hierarchical sense. If a system has two sockets, but one NUMA domain, then the order of the levels would be different.

Performance metrics
Metric	Unit	Short description
CPU load	count	Number of threads ready to be executed
CPU utilization	%	Percentage of time the CPU was busy
CPU time	s	Time the CPU spends in an activity (user, system, idle, I/O, ...)
Memory usage	GB	Used memory
Flops	GFlops/s	Executed floating point operations per second (single or double precision)
Clock	GHz	CPU clock
IPC	count	Executed instructions per CPU cycle
GPU utilization	%	Percentage of time the GPU was busy
GPU memory usage	GB	Used memory of the GPU
Power	W	Power consumption (Package, DRAM, GPU)
PCIx bandwidth	GB/s	Used bandwidth to/from PCIx devices (e.g. accelerators)
Memory bandwidth	GB/s	Used bandwidth of memory subsystem
Network transfers	packets/s	Packets read/written over the network
Network transfers	GB/s	Data read/written over the network
Filesystem operations	requests/s	Open/close/statfs requests to the filesystem
Filesystem transfers	GB/s	Data read/written from/to the filesystem

CPU load

The CPU load metric is based on the number of CPU threads in a certain state. In a typical modern CPU system the operating system handles a number of threads belonging to the processes. The operating system manages the available CPU cores in a time-sliced scheme and decides, which thread is executed in each time slice and CPU core. A thread can be in different states, e.g.

runnable: if assigned to a CPU core, the thread can be executed
waiting: if assigned to a CPU core, the thread would be waiting for resources (e.g. when reading a file from disk)

The operating system load metric is a moving average that encompasses the system state of the past minutes. See Load and Scheduler Statistics for an in-depth discussion.

The CPU load metric is defined by the number of runnable threads. If no threads are runnable, then the CPU cores are idling. If too many threads are in a runnable state, then the system is overloaded, i.e. there is too much work. Therefore, a preferred utilization number would be 1 per core.

On the node-level the total number of threads on the system is considered and has to be compared to the number of available CPU cores. On the core-level the number of threads assigned to the respective core is considered.

CPU time

In contrast to the CPU load metric, the CPU time metric simply counts the percentage of time a CPU core was busy with executing a thread. The CPU time can be counted to time spent in user-space (counting towards the application) or kernel-space (counting towards system operations), else the CPU core is idling. Also, a busy CPU core does not necessarily work to progress the application state, but may be busy waiting in a loop for a resource.

On the node-level the metric can be defined by the sum of the core-level percentages.

Memory usage

The memory usage metric is defined by the system memory utilization. A more specific definition depends on the concrete accounting of the memory system. In general, the memory system works at the granularity of memory pages. Pages can reside in RAM or SWAP (memory space on disk) and can have different roles and states, such as resident memory or shared memory and may belong to operating system caches and buffers. See Virtual memory for details.

Node-level memory usage may account for the usage of the complete system, while job-level memory usage only counts memory belonging to a job's processes.

Flops

The Flops metric accounts for floating point operations per second (GFlops/s) executed on the CPU. The metric may account for single-precision operations, double-precision operations or the (possibly weighted) sum of both. The maximum achievable value depends on the CPU model, the clock and the type instructions used. Normal applications usually can't achieve these values.

Core-level Flops account for instructions per CPU core, while node-level Flops are a sum of the core-level metric.

Clock

The clock metric is defined by the CPU clock. The CPU clocks dynamically depending on the current workload and temperature of the CPU cores. With a multi-core architecture, if a single CPU core is busy it may be clocked higher than if all CPU cores are busy in parallel.

On a node-level the metric might be the average of the core-level values.

IPC

The IPC metric is defined by the instructions per cycle executed by the CPU. Modern CPUs support concurrent execution of multiple instructions. The maximum achievable IPC depends on the number of instructions that can be issued, executed and retired per cycle and the type of instructions.

Power

The Power metric accounts for the power usage of the measured hardware system. The power can be measured at several places, e.g. the CPU core, the CPU package, the mainboard, specific accelerator cards or the power supply of the complete system.

Memory bandwidth

The Memory bandwidth metric accounts for the transferred data between the CPU socket and the main memory or other CPU sockets. This metric does not account for all memory accesses, because ideally many memory accesses are handled by the cache hierarchy.

Network transfers

The Network transfers metric accounts for data received or transmitted over the network interfaces. The metric may count packets or amount of transferred data. A system may have multiple network interfaces, e.g. Ethernet, Infiniband or Omni-Path.

Filesystem access

The Filesystem access metric accounts for filesystem operations. The metric may count number of read or write operations or amount of read or written data. A system may have multiple filesystems, e.g. local filesystem or network filesystems such as NFS or Lustre.

Accelerator metrics

There can be a number of metrics measured for acceleration cards, such as graphics cards. The metrics may be similar to CPU metrics, such as utilization, memory usage, power usage etc.

Derived metrics

Performance metrics are often combined to derived metrics, e.g. to show accelerator utilization vs. power draw, which can be a better indicator of computational efficiency.

A roofline plot is a popular representation of application performance, since it combines computational bound or memory bound regions in a single graph.