Difference between revisions of "Performance Monitoring"
m (Update link to PIKA) |
m |
||
(2 intermediate revisions by one other user not shown) | |||
Line 4: | Line 4: | ||
Monitoring the performance on a cluster-level gives the operator insight into the stable operation, utilization and possible defects of the cluster. | Monitoring the performance on a cluster-level gives the operator insight into the stable operation, utilization and possible defects of the cluster. | ||
If the monitored data is available to the user, it can give insights into [[Job efficiency]]. | If the monitored data is available to the user, it can give insights into [[Job efficiency]]. | ||
− | The available metrics are a trade-off between usefulness and interference with job execution. | + | The available [[Performance metrics]] are a trade-off between usefulness and interference with job execution. |
Take a look at the [[Site-specific documentation]] to figure out if performance monitoring is available at your cluster. | Take a look at the [[Site-specific documentation]] to figure out if performance monitoring is available at your cluster. | ||
Line 12: | Line 12: | ||
* [https://www.clustercockpit.org/ ClusterCockpit] | * [https://www.clustercockpit.org/ ClusterCockpit] | ||
* [https://compendium.hpc.tu-dresden.de/software/pika/ PIKA] | * [https://compendium.hpc.tu-dresden.de/software/pika/ PIKA] | ||
+ | |||
+ | HPC-Admins that want to set up background monitoring can take a look at [[Background Performance Monitoring Considerations]]. | ||
+ | |||
+ | |||
+ | Individual metrics are typically collected with low level tools like: | ||
+ | |||
+ | * [[Likwid]] | ||
+ | * [[Perf]] | ||
+ | * RAPL | ||
+ | * ibstat | ||
+ | * nvidia-smi, nvml | ||
+ | * lctl, beegfs-sctl, mmpmon, nfsstat |
Latest revision as of 14:20, 20 November 2024
Performance monitoring can be done in the form of Performance profiling by the user or developer of an application or as a background service by the operator of an HPC cluster.
Monitoring the performance on a cluster-level gives the operator insight into the stable operation, utilization and possible defects of the cluster.
If the monitored data is available to the user, it can give insights into Job efficiency.
The available Performance metrics are a trade-off between usefulness and interference with job execution.
Take a look at the Site-specific documentation to figure out if performance monitoring is available at your cluster.
The following list shows some performance monitoring solutions:
HPC-Admins that want to set up background monitoring can take a look at Background Performance Monitoring Considerations.
Individual metrics are typically collected with low level tools like: