Regression Testing (Admin Guide)
(Semi-)Automatic Regression Testing of Cluster Systems
Node checks after vendor repair
In case of maintenance of individual nodes through the vendor of the system (e.g. after any defective hardware exchange, heat paste renewal, bios/firmware update, etc.), it is a good practice to check the health of the released compute node before putting it back into the batch system. Such a health check should include both: a functional test and a stress test.
For that in Aachen we run a (single node) HPL and a Stream benchmark in a loop for multiple hours and check whether the acceptance criteria where reached for all the individual runs. In Aachen the acceptance criteria are the same as defined for the procurement. For instance a linpack performance of more than 1.72 TFLOPS and a Stream memory bandwidth of about 208.6 GB/s on our two socket Intel Xeon Platinum 8160 nodes. If those benchmark were not used during the acceptance test after the procurement, any other reasonable threshold can be defined, of course. In order to further check the functionality of the compute node, the hybrid version of the HPL is used, where we place one MPI process per socket and fill up the node with OpenMP threads. Here, it is important to use a significant amount of the available memory. The vendor is informed (i.e., the node reported as “still defective”) in any of the following cases:
- The average Stream memory bandwidth is below the defined threshold
- The average HPL performance is below the defined threshold
- The compute node crashes or has some other unexpected behavior
- The compute node is getting to hot
This process avoids putting back compute node, which are still defective, back into the production system. Furthermore, HPL and Stream are well-known benchmarks. Thus, a vendor can easily reproduce the results and analyze a possible problem of the node.