Job efficiency

From HPC Wiki
Revision as of 14:31, 27 February 2024 by Alex-wiens-5f0c@uni-paderborn.de (talk | contribs) (Add basic measurement considerations)
Jump to navigation Jump to search


Job efficiency describes how well a job makes use of the available resources. Efficiency can be viewed from the perspective of the time, that the resource is allocated, or the energy, that is consumed by the hardware during the job runtime, which usually depends on the first. Therefore, job efficiency is tied to the application performance in that an increase in performance (and decrease in runtime) usually leads to a better efficiency (lower runtime and/or lower energy consumption).

This guideline discusses basics about efficiency measurement, how to spot and mitigate common job efficiency pitfalls and Performance Engineering.

Efficiency measurement

For job efficiency assessment, the utilization of the allocated hardware resources has to be measured. This can be done by performing Performance profiling or by accessing the cluster's Performance Monitoring. The measured Performance metrics give insight into the utilization of the allocated resources and possible deficiencies.

The runtime (or walltime) is easily measured as the time a job was executed from start to finish.

For the energy consumption measurement is more complicated. For one, the energy consumption of the involved hardware has to be measured precisely, which the hardware has to support. On the other hand, it has to be decided which hardware (or percentage thereof) is involved in the execution of the job, which might be difficult for compute systems running shared jobs or resources, such as network hardware, used by many jobs.

For the interpretation of Performance metrics, it is necessary to understand the source of the measured value. Sampled values capture only a very reduced view on the complete system and may be subject to measurement artifacts. Therefore, one has to question the validity of the values and examine how they were produced by the system. Basic knowledge about the measured values includes the minimal and maximal possible values and what system behavior can produce such measurements. For example, how do measurements look like for an idling system and how do they look like for different kinds of synthetic benchmarks?


Common job efficiency pitfalls

The following pitfalls can be spotted using Performance Monitoring or Performance profiling. If Performance Monitoring is available, one can check the measured resource utilization for unexpected characteristics.

Resource oversubscription

Oversubscription of resources happens when the application's assignment of work to resources is flawed. For example, when there are more compute threads than CPU cores, then this leads to thread scheduling overhead and inefficient core utilization. Usually, this indicates a misconfiguration. Either, the application accidentally spawns more worker threads than intended or the job allocation includes too few CPU cores.


Resource underutilization

Load imbalance

Filesystem access

Scaling

Performance expectation and reality