Job Classification
Jump to navigation
Jump to search
Pathologic Job Classes
This is a list of job classification rules that was created as part of the NHR PathoJobs project. The purpose of this list is to automatically detect jobs that require attention or provide an optimization opportunity. The rules use the metric naming as specified as part of the ClusterCockpit Job data json schema.
Categories
A - Resource Allocation
Job classes related to faulty resource allocation and process placement.
A.1 Low CPU utilisation
- Pattern description: Job uses less resources than requested
- Pattern detection:
mean(cpu_load) < normal(cpu_load)
- Applicable situations: Node exclusive jobs only, because
cpu_load
is a node scope metric - Message to user: Your job uses significantly less resources than requested.
- Pattern mitigation: Check job script for allocation and placement errors
"parameter":["cpuload_threshold"], "rule_terms":[ {"load_mean":"cpu_load.mean('all')"}, {"lowload":"load_mean < job.numHwthreads * cpuload_threshold"} ], "output":"lowload"
A.2 Low GPU utilisation
- Pattern description: Job uses less GPUs than requested
- Pattern detection:
mean(acc_utilization) < alert(acc_utilization)
on any of the requested GPUs - Applicable situations: *
- Message to user: Your job does not use some of the requested GPUs.
- Pattern mitigation: Check job script for allocation and placement errors. Check application code.
"parameter":["gpuload_threshold"], "rule_terms":[ {"load_mean":"acc_utilization.mean('all')"}, {"load_thres":"load_mean < gpuload_threshold"}, {"lowload":"load_thres.any('all')"} ], "output":"lowload"
A.3 Resource oversubscription
- Pattern description: Job with too high cpu load
- Pattern detection:
avg(cpu_load) < peak(cpu_load)
- Applicable situations: peak value needs to be scaled for shared jobs.
- Message to user: Job overloads available resources.
- Pattern mitigation: Check job script and parallelisation for errors.
"parameter":["oversubscription_threshold"], "rule_terms":[ {"load_mean":"cpu_load.mean('all')"}, {"load_thres":"(job.numHwthreadsjob.numNodes) * oversubscription_threshold"}, {"oversubscription":"load_mean > load_thres"} ], "output":"oversubscription"
A.4 Short job
- Pattern description: Job is very short
- Pattern detection:
duration < threshold
- Applicable situations: *
- Message to user: Your jobs duration is very short.
- Pattern mitigation: Try to combine jobs. Implement job scheduling within one job.
"parameter":["duration_threshold"], "rule_terms":[ {"short_job":"job.duration < duration_threshold"} ], "output":"short_job"
A.5 Failing Chain-/Array-Jobs
- Pattern description: Jobs fail in rapid succession
- Pattern detection:
job_state == failed && duration < threshold
within some time. Needs state? - Applicable situations: *
- Message to user: Many of your jobs fail.
- Pattern mitigation: Fix error in job script and/or application setup.
B - Resource Utilisation
Job classes related to low hardware utilization.
B.1 Idle job
- Pattern description: Job runs but does not show any activity
- Pattern detection:
mean(flops_any) < alert(flops_any) && mean(mem_bw) < alert(mem_bw) && mean(net_bw) < alert(net_bw)
- Applicable situations: Set of metrics to check has to be adapted for node exclusive or shared jobs. In case of shared jobs thresholds have to be scaled to the applicable value.
- Message to user: Your job does not use any resources.
- Pattern mitigation: Check application code or log for potential errors or deadlocks.
B.2 Low CPU resource utilisation
- Pattern description: Job runs but has very low resource utilisation
- Pattern detection:
mean(flops_any) < caution(flops_any) && mean(mem_bw) < caution(mem_bw)
- Applicable situations: Set of metrics to check has to be adapted for node exclusive or shared jobs. In case of shared jobs thresholds have to be scaled to the applicable value.
- Message to user: Your job has a very low resource utilisation.
- Pattern mitigation: Check compile options and/or configuration for more optimal settings. Consider to use a more efficient application. Contact support for help to optimise code.
B.3 Low GPU resource utilisation
- Pattern description: Job runs but has very low GPU utilisation
- Pattern detection:
mean(acc_utilisation) < caution(acc_utilisation)
- Applicable situations: *
- Message to user: Your job has a very low GPU utilisation.
- Pattern mitigation: Check compile options and/or configuration for more optimal settings. Consider to use a more efficient application. Contact support for help to optimise code.
B.4 Memory leak
- Pattern description: Usage of memory capacity monotonically increases
- Pattern detection:
avgSlope(mem_used) >> 0
- Applicable situations: *
- Message to user: Your job might have a memory leak.
- Pattern mitigation: Check application code, use other MPI implementation.
B.5 Exceed available memory capacity
- Pattern description: Main memory usage exceeds the available or requested memory capacity
- Pattern detection:
max(mem_used) > alert(mem_used)
- Applicable situations: Metrics have to be scaled for shared jobs. For shared jobs the (not yet available) mem_requested has to be used.
- Message to user: Your job almost exceeds the available memory capacity.
- Pattern mitigation: Check application code to reduce memory allocation, use other system with more memory capacity.
B.6 Access temporary files on network filesystem
B.7 Small file IO on parallel filesystem
B.8 Excessive CPU Load
- Pattern description: Job uses more threads / compute entities than allocated
- Pattern detection:
mean(cpu_load) > peak(cpu_load)*0.95 && threads > cores
- Applicable situations: TODO
- Message to user: Your job can use more execution units.
- Pattern mitigation: Request more cores/processes in your job script.
C - Resource Contention
Job classes related to overloading cluster components.
C.1 Excessive File IO
- Pattern description: Job is limited by file IO to a network file system
- Pattern detection:
mean(FS[*].read_bw) > critical(FS[*].read_bw)*0.95 && mean(FS[*].write_bw) > critical(FS[*].read_bw)
- Applicable situations: *
- Message to user: Your job is doing too much file io.
- Pattern mitigation: Change source code or app configuration to reduce file IO. Use local scratch file system if available.
C.1 Excessive File Metadata operations
- Pattern description: Job does too many metadata operations
- Pattern detection:
mean(cpu_load) > peak(cpu_load)*0.95 && threads > cores
- Applicable situations: File access on shared network file system
- Message to user: Your job can use more execution units.
- Pattern mitigation: Request more cores/processes in your job script.
C.3 Excessive Network IO
C.4 Multi-process GPU utilisation
D - Load Balancing
Job classes related to load balancing problems.
D.1 Load imbalance
- Pattern description: unequal distribution of compute load
- Pattern detection:
max(coreavg(cpu_load)) - min(coreavg(cpu_load)) > threshold || max(nodeavg(cpu_load)) - min(nodeavg(cpu_load)) > threshold
- Rule:
"parameter": ["balance_threshold"] "rule_terms":[ {"max_core": "max(cpu_load, 'core')"} {"min_core": "min(cpu_load, 'core')"} {"core_balance": "(max_core - min_core) > balance_threshold"}] "output": "core_balance" "parameter": ["balance_threshold"] "rule_terms":[ {"max_node": "max(cpu_load, 'node')"} {"min_node": "min(cpu_load, 'node')"} {"node_balance": "(max_node - min_node) > balance_threshold"}] "output": "node_balance"
E - Abuse
Job classes related to unintended or forbidden usage of clusters.
E.1 Cryptomining
E.2 Network Backdoors
E.3 Excess Batch System use
E.4 Short Job
- Pattern description: The job is too short to qualify for using a batch system.
- Pattern detection:
Peak(Runtime) < T_Short
- Applicable situations: *
- Message to user: TODO
- Pattern mitigation: TODO