Pathologic Job Classes

This is a list of job classification rules that was created as part of the NHR PathoJobs project. The purpose of this list is to automatically detect jobs that require attention or provide an optimization opportunity. The rules use the metric naming as specified as part of the ClusterCockpit Job data json schema.

Categories

A - Resource Allocation

Job classes related to faulty resource allocation and process placement.

A.1 Low CPU utilisation

Pattern description: Job uses less resources than requested
Pattern detection: mean(cpu_load) < normal(cpu_load)
Applicable situations: Node exclusive jobs only, because cpu_load is a node scope metric
Message to user: Your job uses significantly less resources than requested.
Pattern mitigation: Check job script for allocation and placement errors

"parameter":["cpuload_threshold"],
"rule_terms":[
{"load_mean":"cpu_load.mean('all')"},
{"lowload":"load_mean < job.numHwthreads * cpuload_threshold"}
],
"output":"lowload"

A.2 Low GPU utilisation

Pattern description: Job uses less GPUs than requested
Pattern detection: mean(acc_utilization) < alert(acc_utilization) on any of the requested GPUs
Applicable situations: *
Message to user: Your job does not use some of the requested GPUs.
Pattern mitigation: Check job script for allocation and placement errors. Check application code.

"parameter":["gpuload_threshold"],                                                    
"rule_terms":[
{"load_mean":"acc_utilization.mean('all')"},
{"load_thres":"load_mean < gpuload_threshold"},
{"lowload":"load_thres.any('all')"}
],
"output":"lowload"

A.3 Resource oversubscription

Pattern description: Job with too high cpu load
Pattern detection: avg(cpu_load) < peak(cpu_load)
Applicable situations: peak value needs to be scaled for shared jobs.
Message to user: Job overloads available resources.
Pattern mitigation: Check job script and parallelisation for errors.

"parameter":["oversubscription_threshold"],
"rule_terms":[
{"load_mean":"cpu_load.mean('all')"},
{"load_thres":"(job.numHwthreadsjob.numNodes) * oversubscription_threshold"},
{"oversubscription":"load_mean > load_thres"}
],
"output":"oversubscription"

A.4 Short job

Pattern description: Job is very short
Pattern detection: duration < threshold
Applicable situations: *
Message to user: Your jobs duration is very short.
Pattern mitigation: Try to combine jobs. Implement job scheduling within one job.

"parameter":["duration_threshold"],
"rule_terms":[
{"short_job":"job.duration < duration_threshold"}
],
"output":"short_job"

A.5 Failing Chain-/Array-Jobs

Pattern description: Jobs fail in rapid succession
Pattern detection: job_state == failed && duration < threshold within some time. Needs state?
Applicable situations: *
Message to user: Many of your jobs fail.
Pattern mitigation: Fix error in job script and/or application setup.

B - Resource Utilisation

Job classes related to low hardware utilization.

B.1 Idle job

Pattern description: Job runs but does not show any activity
Pattern detection: mean(flops_any) < alert(flops_any) && mean(mem_bw) < alert(mem_bw) && mean(net_bw) < alert(net_bw)
Applicable situations: Set of metrics to check has to be adapted for node exclusive or shared jobs. In case of shared jobs thresholds have to be scaled to the applicable value.
Message to user: Your job does not use any resources.
Pattern mitigation: Check application code or log for potential errors or deadlocks.

B.2 Low CPU resource utilisation

Pattern description: Job runs but has very low resource utilisation
Pattern detection: mean(flops_any) < caution(flops_any) && mean(mem_bw) < caution(mem_bw)
Applicable situations: Set of metrics to check has to be adapted for node exclusive or shared jobs. In case of shared jobs thresholds have to be scaled to the applicable value.
Message to user: Your job has a very low resource utilisation.
Pattern mitigation: Check compile options and/or configuration for more optimal settings. Consider to use a more efficient application. Contact support for help to optimise code.

B.3 Low GPU resource utilisation

Pattern description: Job runs but has very low GPU utilisation
Pattern detection: mean(acc_utilisation) < caution(acc_utilisation)
Applicable situations: *
Message to user: Your job has a very low GPU utilisation.
Pattern mitigation: Check compile options and/or configuration for more optimal settings. Consider to use a more efficient application. Contact support for help to optimise code.

B.4 Memory leak

Pattern description: Usage of memory capacity monotonically increases
Pattern detection: avgSlope(mem_used) >> 0
Applicable situations: *
Message to user: Your job might have a memory leak.
Pattern mitigation: Check application code, use other MPI implementation.

B.5 Exceed available memory capacity

Pattern description: Main memory usage exceeds the available or requested memory capacity
Pattern detection: max(mem_used) > alert(mem_used)
Applicable situations: Metrics have to be scaled for shared jobs. For shared jobs the (not yet available) mem_requested has to be used.
Message to user: Your job almost exceeds the available memory capacity.
Pattern mitigation: Check application code to reduce memory allocation, use other system with more memory capacity.

B.6 Access temporary files on network filesystem

B.7 Small file IO on parallel filesystem

B.8 Excessive CPU Load

Pattern description: Job uses more threads / compute entities than allocated
Pattern detection: mean(cpu_load) > peak(cpu_load)*0.95 && threads > cores
Applicable situations: TODO
Message to user: Your job can use more execution units.
Pattern mitigation: Request more cores/processes in your job script.

C - Resource Contention

Job classes related to overloading cluster components.

C.1 Excessive File IO

Pattern description: Job is limited by file IO to a network file system
Pattern detection: mean(FS[*].read_bw) > critical(FS[*].read_bw)*0.95 && mean(FS[*].write_bw) > critical(FS[*].read_bw)
Applicable situations: *
Message to user: Your job is doing too much file io.
Pattern mitigation: Change source code or app configuration to reduce file IO. Use local scratch file system if available.

C.1 Excessive File Metadata operations

Pattern description: Job does too many metadata operations
Pattern detection: mean(cpu_load) > peak(cpu_load)*0.95 && threads > cores
Applicable situations: File access on shared network file system
Message to user: Your job can use more execution units.
Pattern mitigation: Request more cores/processes in your job script.

C.3 Excessive Network IO

C.4 Multi-process GPU utilisation

D - Load Balancing

Job classes related to load balancing problems.

D.1 Load imbalance

Pattern description: unequal distribution of compute load
Pattern detection: max(coreavg(cpu_load)) - min(coreavg(cpu_load)) > threshold || max(nodeavg(cpu_load)) - min(nodeavg(cpu_load)) > threshold
Rule:

"parameter": ["balance_threshold"]
"rule_terms":[
{"max_core": "max(cpu_load, 'core')"}
{"min_core": "min(cpu_load, 'core')"}
{"core_balance": "(max_core - min_core) > balance_threshold"}]
"output": "core_balance"

"parameter": ["balance_threshold"]
"rule_terms":[
{"max_node": "max(cpu_load, 'node')"}
{"min_node": "min(cpu_load, 'node')"}
{"node_balance": "(max_node - min_node) > balance_threshold"}]
"output": "node_balance"

E - Abuse

Job classes related to unintended or forbidden usage of clusters.

E.1 Cryptomining

E.2 Network Backdoors

E.3 Excess Batch System use

E.4 Short Job

Pattern description: The job is too short to qualify for using a batch system.
Pattern detection: Peak(Runtime) < T_Short
Applicable situations: *
Message to user: TODO
Pattern mitigation: TODO

Job Classification

Contents