Job efficiency

From HPC Wiki
Revision as of 18:14, 27 February 2024 by Alex-wiens-5f0c@uni-paderborn.de (talk | contribs) (Add oversubscription example)
Jump to navigation Jump to search


Job efficiency describes how well a job makes use of the available resources. Efficiency can be viewed from the perspective of the time, that the resource is allocated, or the energy, that is consumed by the hardware during the job runtime, which usually depends on the first. Therefore, job efficiency is tied to the application performance in that an increase in performance (and decrease in runtime) usually leads to a better efficiency (lower runtime and/or lower energy consumption).

This guideline discusses basics about efficiency measurement, how to spot and mitigate common job efficiency pitfalls and Performance Engineering.

Efficiency measurement

For job efficiency assessment, the utilization of the allocated hardware resources has to be measured. This can be done by performing Performance profiling or by accessing the cluster's Performance Monitoring. The measured Performance metrics give insight into the utilization of the allocated resources and possible deficiencies.

The runtime (or walltime) is easily measured as the time a job was executed from start to finish.

For the energy consumption measurement is more complicated. For one, the energy consumption of the involved hardware has to be measured precisely, which the hardware has to support. On the other hand, it has to be decided which hardware (or percentage thereof) is involved in the execution of the job, which might be difficult for compute systems running shared jobs or resources, such as network hardware, used by many jobs.

For the interpretation of Performance metrics, it is necessary to understand the source of the measured value. Sampled values capture only a very reduced view on the complete system and may be subject to measurement artifacts. Therefore, one has to question the validity of the values and examine how they were produced by the system. Basic knowledge about the measured values includes the minimal and maximal possible values and what system behavior can produce such measurements. For example, how do measurements look like for an idling system and how do they look like for different kinds of synthetic benchmarks?


Common job efficiency pitfalls

The following pitfalls can be spotted using Performance Monitoring or Performance profiling. If Performance Monitoring is available, one can check the measured resource utilization for unexpected characteristics.

Resource oversubscription

Oversubscription example: Two jobs executed on 16 cores respectively. Top job executes 2 threads per core. Bottom job executes 16 threads per core.

Oversubscription of resources happens when the application's assignment of work to resources is flawed. For example, when there are more compute threads than CPU cores, then this leads to thread scheduling overhead and inefficient core utilization. Usually, this indicates a misconfiguration. Either, the application accidentally spawns more worker threads than intended or the job allocation includes too few CPU cores.

The oversubscription example shows two jobs, with more threads than cores. The first job with 2 threads per core and the second one with 16 threads per core. Three metrics are shown: node-level CPU load, core-level CPU load and core-level CPU time. Note the ramp-up phase for the metrics. For the first job, the CPU time measurement shows 100% utilization, although the two threads are competing for the CPU core. For the second job, the CPU time measurement depicts degraded utilization, because of the thread scheduling overhead.


Resource underutilization

Load imbalance

Filesystem access

Scaling

Performance expectation and reality