Performance metrics

From HPC Wiki
Jump to navigation Jump to search

Here a list of common performance metrics used in Performance profiling and Performance Monitoring are discussed. The metrics can be measured for single resources (e.g. a single CPU core) or for groups of resources (e.g. all CPU cores of a node).

Performance metrics
Metric Unit Short description
CPU load count Number of threads ready to be executed
CPU time % Percentage of time the CPU was busy
Memory usage GB Used memory
Flops GFlops/s Executed floating point operations per second
Clock GHz CPU clock
IPC count Executed instructions per CPU cycle
Power W Power consumption
Memory bandwidth GB/s Used bandwidth of memory subsystem
Network transfers packets/s Packets read/written over the network
Network transfers GB/s Data read/written over the network
Filesystem operations requests/s Open/close/statfs requests to the filesystem
Filesystem transfers GB/s Data read/written from/to the filesystem

CPU load

The CPU load metric is based on the number of CPU threads in a certain state. In a typical modern CPU system the operating system handles a number of threads belonging to the processes. The operating system manages the available CPU cores in a time-sliced scheme and decides, which thread is executed in each time slice and CPU core. A thread can be in different states, e.g.

  • runnable: if assigned to a CPU core, the thread can be executed
  • waiting: if assigned to a CPU core, the thread would be waiting for resources (e.g. when reading a file from disk)

The operating system load metric is a moving average that encompasses the system state of the past minutes. See Load and Scheduler Statistics for an in-depth discussion.

The CPU load metric is defined by the number of runnable threads. If no threads are runnable, then the CPU cores are idling. If too many threads are in a runnable state, then the system is overloaded, i.e. there is too much work. Therefore, a preferred utilization number would be 1 per core.

On the node-level the total number of threads on the system is considered and has to be compared to the number of available CPU cores. On the core-level the number of threads assigned to the respective core is considered.

CPU time

In contrast to the CPU load metric, the CPU time metric simply counts the percentage of time a CPU core was busy with executing a thread. The CPU time can be counted to time spent in user-space (counting towards the application) or kernel-space (counting towards system operations), else the CPU core is idling. Also, a busy CPU core does not necessarily work to progress the application state, but may be busy waiting in a loop for a resource.

On the node-level the metric can be defined by the sum of the core-level percentages.

Memory usage

The memory usage metric is defined by the system memory utilization. A more specific definition depends on the concrete accounting of the memory system. In general, the memory system works at the granularity of memory pages. Pages can reside in RAM or SWAP (memory space on disk) and can have different roles and states, such as resident memory or shared memory and may belong to operating system caches and buffers. See Virtual memory for details.

Node-level memory usage may account for the usage of the complete system, while job-level memory usage only counts memory belonging to a job's processes.


The Flops metric accounts for floating point operations per second (GFlops/s) executed on the CPU. The metric may account for single-precision operations, double-precision operations or the (possibly weighted) sum of both. The maximum achievable value depends on the CPU model, the clock and the type instructions used. Normal applications usually can't achieve these values.

Core-level Flops account for instructions per CPU core, while node-level Flops are a sum of the core-level metric.


The clock metric is defined by the CPU clock. The CPU clocks dynamically depending on the current workload and temperature of the CPU cores. With a multi-core architecture, if a single CPU core is busy it may be clocked higher than if all CPU cores are busy in parallel.

On a node-level the metric might be the average of the core-level values.


The IPC metric is defined by the instructions per cycle executed by the CPU. Modern CPUs support concurrent execution of multiple instructions. The maximum achievable IPC depends on the number of instructions that can be issued, executed and retired per cycle and the type of instructions.


The Power metric accounts for the power usage of the measured hardware system. The power can be measured at several places, e.g. the CPU core, the CPU package, the mainboard, specific accelerator cards or the power supply of the complete system.

Memory bandwidth

The Memory bandwidth metric accounts for the transferred data between the CPU socket and the main memory or other CPU sockets. This metric does not account for all memory accesses, because ideally many memory accesses are handled by the cache hierarchy.

Network transfers

The Network transfers metric accounts for data received or transmitted over the network interfaces. The metric may count packets or amount of transferred data. A system may have multiple network interfaces, e.g. Ethernet, Infiniband or Omni-Path.

Filesystem access

The Filesystem access metric accounts for filesystem operations. The metric may count number of read or write operations or amount of read or written data. A system may have multiple filesystems, e.g. local filesystem or network filesystems such as NFS or Lustre.

Accelerator metrics

There can be a number of metrics measured for acceleration cards, such as graphics cards. The metrics may be similar to CPU metrics, such as utilization, memory usage, power usage etc.