Job Classification

From HPC Wiki
Jump to navigation Jump to search


Pathologic Job Classes

Categories

A - Resource Allocation

A.1 Low CPU utilisation

  • Pattern description: Job uses less resources than requested
  • Pattern detection: mean(cpu_load) < normal(cpu_load)
  • Applicable situations: Node exclusive jobs only, because cpu_load is a node scope metric
  • Message to user: Your job uses significantly less resources than requested.
  • Pattern mitigation: Check job script for allocation and placement errors
"parameter":["cpuload_threshold"],
"rule_terms":[
{"load_mean":"cpu_load.mean('all')"},
{"lowload":"load_mean < job.numHwthreads * cpuload_threshold"}
],
"output":"lowload"

A.2 Low GPU utilisation

  • Pattern description: Job uses less GPUs than requested
  • Pattern detection: mean(acc_utilization) < alert(acc_utilization) on any of the requested GPUs
  • Applicable situations: *
  • Message to user: Your job does not use some of the requested GPUs.
  • Pattern mitigation: Check job script for allocation and placement errors. Check application code.
"parameter":["gpuload_threshold"],                                                    
"rule_terms":[
{"load_mean":"acc_utilization.mean('all')"},
{"load_thres":"load_mean < gpuload_threshold"},
{"lowload":"load_thres.any('all')"}
],
"output":"lowload"

A.3 Resource oversubscription

  • Pattern description: Job with too high cpu load
  • Pattern detection: avg(cpu_load) < peak(cpu_load)
  • Applicable situations: peak value needs to be scaled for shared jobs.
  • Message to user: Job overloads available resources.
  • Pattern mitigation: Check job script and parallelisation for errors.
"parameter":["oversubscription_threshold"],
"rule_terms":[
{"load_mean":"cpu_load.mean('all')"},
{"load_thres":"(job.numHwthreadsjob.numNodes) * oversubscription_threshold"},
{"oversubscription":"load_mean > load_thres"}
],
"output":"oversubscription"

A.4 Short job

  • Pattern description: Job is very short
  • Pattern detection: duration < threshold
  • Applicable situations: *
  • Message to user: Your jobs duration is very short.
  • Pattern mitigation: Try to combine jobs. Implement job scheduling within one job.
"parameter":["duration_threshold"],
"rule_terms":[
{"short_job":"job.duration < duration_threshold"}
],
"output":"short_job"

A.5 Failing Chain-/Array-Jobs

  • Pattern description: Jobs fail in rapid succession
  • Pattern detection: job_state == failed && duration < threshold within some time. Needs state?
  • Applicable situations: *
  • Message to user: Many of your jobs fail.
  • Pattern mitigation: Fix error in job script and/or application setup.

B - Resource Utilisation

B.1 Idle job

  • Pattern description: Job runs but does not show any activity
  • Pattern detection: mean(flops_any) < alert(flops_any) && mean(mem_bw) < alert(mem_bw) && mean(net_bw) < alert(net_bw)
  • Applicable situations: Set of metrics to check has to be adapted for node exclusive or shared jobs. In case of shared jobs thresholds have to be scaled to the applicable value.
  • Message to user: Your job does not use any resources.
  • Pattern mitigation: Check application code or log for potential errors or deadlocks.

B.2 Low CPU resource utilisation

  • Pattern description: Job runs but has very low resource utilisation
  • Pattern detection: mean(flops_any) < caution(flops_any) && mean(mem_bw) < caution(mem_bw)
  • Applicable situations: Set of metrics to check has to be adapted for node exclusive or shared jobs. In case of shared jobs thresholds have to be scaled to the applicable value.
  • Message to user: Your job has a very low resource utilisation.
  • Pattern mitigation: Check compile options and/or configuration for more optimal settings. Consider to use a more efficient application. Contact support for help to optimise code.

B.3 Low GPU resource utilisation

  • Pattern description: Job runs but has very low GPU utilisation
  • Pattern detection: mean(acc_utilisation) < caution(acc_utilisation)
  • Applicable situations: *
  • Message to user: Your job has a very low GPU utilisation.
  • Pattern mitigation: Check compile options and/or configuration for more optimal settings. Consider to use a more efficient application. Contact support for help to optimise code.

B.4 Memory leak

  • Pattern description: Usage of memory capacity monotonically increases
  • Pattern detection: avgSlope(mem_used) >> 0
  • Applicable situations: *
  • Message to user: Your job might have a memory leak.
  • Pattern mitigation: Check application code, use other MPI implementation.

B.5 Exceed available memory capacity

  • Pattern description: Main memory usage exceeds the available or requested memory capacity
  • Pattern detection: max(mem_used) > alert(mem_used)
  • Applicable situations: Metrics have to be scaled for shared jobs. For shared jobs the (not yet available) mem_requested has to be used.
  • Message to user: Your job almost exceeds the available memory capacity.
  • Pattern mitigation: Check application code to reduce memory allocation, use other system with more memory capacity.

B.6 Access temporary files on network filesystem

B.7 Small file IO on parallel filesystem

B.8 Excessisve CPU Load

  • Pattern description: Job uses more threads / compute entities than allocated
  • Pattern detection: mean(cpu_load) > peak(cpu_load)*0.95 && threads > cores
  • Applicable situations: TODO
  • Message to user: Your job can use more execution units.
  • Pattern mitigation: Request more cores/processes in your job script. ## C - Resource Contention ### C.1 Excessive File IO ### C.2 Excessive Network IO ### C.3 Multi-process GPU utilisation

D - Load Balancing

D.1 Load imbalance

  • Pattern description: unequal distribution of compute load
  • Pattern detection: max(coreavg(cpu_load)) - min(coreavg(cpu_load)) > threshold || max(nodeavg(cpu_load)) - min(nodeavg(cpu_load)) > threshold
  • Rule:
"parameter": ["balance_threshold"]
"rule_terms":[
{"max_core": "max(cpu_load, 'core')"}
{"min_core": "min(cpu_load, 'core')"}
{"core_balance": "(max_core - min_core) > balance_threshold"}]
"output": "core_balance"

"parameter": ["balance_threshold"]
"rule_terms":[
{"max_node": "max(cpu_load, 'node')"}
{"min_node": "min(cpu_load, 'node')"}
{"node_balance": "(max_node - min_node) > balance_threshold"}]
"output": "node_balance"

E - Abuse

E.1 Cryptomining

E.2 Network Backdoors

E.3 Excess Batch System use

E.4 Short JOB (TuDA)

  • Pattern description: The job is too short to qualify for using a batch system.
  • Pattern detection: Peak(Runtime) < T_Short
  • Applicable situations: *
  • Message to user: TODO
  • Pattern mitigation: TODO