Difference between revisions of "Torque"

From HPC Wiki
Jump to navigation Jump to search
m
 
(16 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 +
[[Category:HPC-User]]
 
== General ==
 
== General ==
  
Torque is a job [[Scheduler|scheduler]]. It is used to monitor and control the workload of the batch system of a supercomputer and assigns resources to jobs. This system targets applications that utilize a lot of resources and it cannot be directly accessed by the user, as opposed to the [[Nodes#Log-in|login-nodes]]. Applications to execute have to be specified in a [[Jobscript|jobscript]] that is sent to the batch system by the user.
+
Torque is a job [[scheduler]]. To get an overview of the functionality of a scheduler, go [[Scheduler#General|here]] or to the [[Scheduling_Basics|Scheduling Basics]].
 
 
  
 
== Job Submission ==
 
== Job Submission ==
Line 12: Line 12:
 
Just like any other incoming job, your job will first be queued. Then, the scheduler decides when your job will be run. The more resources your job requires, the longer it may be waiting to execute.
 
Just like any other incoming job, your job will first be queued. Then, the scheduler decides when your job will be run. The more resources your job requires, the longer it may be waiting to execute.
  
You can check the current status of your submitted jobs and their job ids with the following shell command. The most common states for a job are running <code>R</code> (the jobscript is currently being executed), queued <code>Q</code> (job waits for free nodes) or on hold <code>H</code> (job is currently stopped, but does not wait for resources). The command also shows the elapsed time since your job has started running and the time limit.
+
You can check the current status of your submitted jobs and their job ids with the following shell command. The most common states for a job are queued <code>Q</code> (job waits for free nodes), running <code>R</code> (the jobscript is currently being executed) or on hold <code>H</code> (job is currently stopped, but does not wait for resources). The command also shows the elapsed time since your job has started running and the time limit.
  
 
  $ qstat -u <user_id>
 
  $ qstat -u <user_id>
Line 22: Line 22:
 
== #PBS Usage ==
 
== #PBS Usage ==
  
TODO
+
If you are writing a [[jobscript]] for a Torque batch system, the magic cookie is "#PBS". To use it, start a new line in your script with "#PBS". Following that, you can put one of the parameters shown below, where the word written in <...> should be replaced with a value.
 +
 
 +
Basic settings:
 +
{| class="wikitable" style="width: 40%;"
 +
| Parameter || Function
 +
|-
 +
| -N <name> || job name
 +
|-
 +
| -o <path> || file to write stdout to
 +
|-
 +
| -e <path> || file to write error output to
 +
|-
 +
| -j oe || both the output and error log will be written to the same log file called <job_name>.o<job_id>
 +
|}
 +
 
 +
Requesting resources:
 +
{| class="wikitable" style="width: 60%;"
 +
| Parameter || Function
 +
|-
 +
| -l walltime=<total_time_limit> || time limit (including waiting time in the queue!) in the format hours:minutes:seconds; once the time specified is up, the job will be killed by the [[scheduler]]
 +
|-
 +
| -cput=<runlimit> || maximum execution time, specify as above
 +
|-
 +
| -l mem=<memlimit> || memory limit per process as an integer number, followed by a unit, e. g. 400MB
 +
|-
 +
| -l nodes=1:ppn=1 || ask for a single processor for a sequential application
 +
|}
 +
 
 +
Email notifications:
 +
{| class="wikitable" style="width: 60%;"
 +
| Parameter || Function
 +
|-
 +
| -M <address> || set email address
 +
|-
 +
| -m a || receive a mail if your job gets aborted
 +
|-
 +
| -m b || get notified when your job starts running
 +
|-
 +
| -m e || receive a mail when your job has finished
 +
|-
 +
| -m abe || enable all mail options above
 +
|}
 +
 
 +
Parallel programming (read more [[Parallel_Programming|here]]):
 +
{| class="wikitable" style="width: 60%;"
 +
| Parameter || Function
 +
|-
 +
| -l nodes=1:ppn=<threads> || specify the number of threads to use for an OpenMP application; set OMP_NUM_THREADS accordingly
 +
|-
 +
| -l nodes=<num_nodes>:ppn=<num_cores> || specify number of processes to start (one per core), which is num_nodes*num_procs; be careful about your system's architecture and do not request more nodes or cores than available on the machine
 +
|}
 +
 
 +
For hybrid programs, make sure to disable processor affinity by adding <code>export MV2_ENABLE_AFFINITY=0</code> to your script. Otherwise, all threads will be using the same core.
 +
 
 +
== Array and Chain Jobs ==
 +
 
 +
<syntaxhighlight lang="zsh">
 +
 
 +
qsub -W depend=afterok:<Job-ID> <SCRIPT>
 +
 
 +
</syntaxhighlight>
 +
 
 +
This requires the Job <Job-ID> to terminate without errors before the submitted script may be run. Other available conditions for chain jobs are
 +
 
 +
{| class="wikitable" style="width: 60%;"
 +
| Condition || Function
 +
|-
 +
| afterok:<jobID> || job can start once job <jobID> has terminated without errors
 +
|-
 +
| afternotok:<jobID> || job can start once job <jobID> has terminated with errors
 +
|-
 +
| afterany:<jobID> || job can start once job <jobID> has terminated
 +
|}
  
 
== Jobscript Examples ==
 
== Jobscript Examples ==
  
TODO
+
This serial job will run a given executable, in this case "myapp.exe".
 +
<syntaxhighlight lang="bash">
 +
#!/usr/bin/bash
 +
 
 +
### Job name
 +
#PBS -N MYJOB
 +
 
 +
### File where the output should be written
 +
#PBS -o MYJOB_OUTPUT.txt
 +
 
 +
### Time your job needs to execute, e. g. 1 h 20 min
 +
#PBS -l cput=1:20:00
 +
 
 +
### Memory your job needs, e. g. 1000 MB
 +
#PBS -l mem=1000MB
 +
 
 +
### The last part consists of regular shell commands:
 +
### Change to working directory
 +
cd /home/user/mywork
 +
 
 +
### Execute your application
 +
myapp.exe
 +
</syntaxhighlight>
 +
 
 +
This job runs the executable "myapp.exe" that has been parallelized with OpenMP using 12 threads.
 +
<syntaxhighlight lang="bash">
 +
#!/usr/bin/bash
 +
 
 +
### Job name
 +
#PBS -N MY_OMP_JOB
 +
 
 +
### Redirect stdout and stderr to the same file
 +
#PBS -j oe
 +
 
 +
### Total time limit (including queuing), e. g. 45 min
 +
#PBS -l walltime=00:45:00
 +
 
 +
### Request 1 node and number of threads, e. g. 12
 +
#PBS -l nodes=1:ppn=12
 +
 
 +
### The last part consists of regular shell commands:
 +
### Change to working directory
 +
cd /home/user/mywork
 +
 
 +
### Execute your application and set OMP_NUM_THREADS for the application run
 +
OMP_NUM_THREADS=12 myapp.exe
 +
</syntaxhighlight>
 +
 
 +
This MPI job will start 8 processes of "myapp.exe" on 2 nodes.
 +
<syntaxhighlight lang="bash">
 +
#!/usr/bin/bash
 +
 
 +
### Job name
 +
#PBS -N MY_MPI_JOB
 +
 
 +
### Output file
 +
#PBS -o mpi_job_output.txt
 +
 
 +
### Time limit for execution, e. g. 30 min
 +
#PBS -l cput=0:30:00
 +
 
 +
### Request two nodes and 4 processes each
 +
#PBS -l nodes=2:ppn=4
 +
 
 +
### The last part consists of regular shell commands:
 +
### Change to working directory
 +
cd /home/user/mywork
 +
 
 +
### Execute your application and set the "-np" option for the application run
 +
mpiexec -np 8 myapp.exe
 +
</syntaxhighlight>
  
 
== References ==
 
== References ==
Line 35: Line 177:
  
 
[http://www.arc.ox.ac.uk/content/torque-job-scheduler Guide to the Torque scheduler]
 
[http://www.arc.ox.ac.uk/content/torque-job-scheduler Guide to the Torque scheduler]
 +
 +
[https://www.osc.edu/supercomputing/batch-processing-at-osc/job-scripts More jobscript examples and tips]

Latest revision as of 10:30, 5 September 2019

General

Torque is a job scheduler. To get an overview of the functionality of a scheduler, go here or to the Scheduling Basics.

Job Submission

This command submits the job you defined in your jobscript to the batch system:

$ qsub jobscript.sh

Just like any other incoming job, your job will first be queued. Then, the scheduler decides when your job will be run. The more resources your job requires, the longer it may be waiting to execute.

You can check the current status of your submitted jobs and their job ids with the following shell command. The most common states for a job are queued Q (job waits for free nodes), running R (the jobscript is currently being executed) or on hold H (job is currently stopped, but does not wait for resources). The command also shows the elapsed time since your job has started running and the time limit.

$ qstat -u <user_id>

In case you submitted a job on accident or realised that your job might not be running correctly, you can always remove it from the queue or terminate it when running by typing:

$ qdel <job_id>

#PBS Usage

If you are writing a jobscript for a Torque batch system, the magic cookie is "#PBS". To use it, start a new line in your script with "#PBS". Following that, you can put one of the parameters shown below, where the word written in <...> should be replaced with a value.

Basic settings:

Parameter Function
-N <name> job name
-o <path> file to write stdout to
-e <path> file to write error output to
-j oe both the output and error log will be written to the same log file called <job_name>.o<job_id>

Requesting resources:

Parameter Function
-l walltime=<total_time_limit> time limit (including waiting time in the queue!) in the format hours:minutes:seconds; once the time specified is up, the job will be killed by the scheduler
-cput=<runlimit> maximum execution time, specify as above
-l mem=<memlimit> memory limit per process as an integer number, followed by a unit, e. g. 400MB
-l nodes=1:ppn=1 ask for a single processor for a sequential application

Email notifications:

Parameter Function
-M <address> set email address
-m a receive a mail if your job gets aborted
-m b get notified when your job starts running
-m e receive a mail when your job has finished
-m abe enable all mail options above

Parallel programming (read more here):

Parameter Function
-l nodes=1:ppn=<threads> specify the number of threads to use for an OpenMP application; set OMP_NUM_THREADS accordingly
-l nodes=<num_nodes>:ppn=<num_cores> specify number of processes to start (one per core), which is num_nodes*num_procs; be careful about your system's architecture and do not request more nodes or cores than available on the machine

For hybrid programs, make sure to disable processor affinity by adding export MV2_ENABLE_AFFINITY=0 to your script. Otherwise, all threads will be using the same core.

Array and Chain Jobs

qsub -W depend=afterok:<Job-ID> <SCRIPT>

This requires the Job <Job-ID> to terminate without errors before the submitted script may be run. Other available conditions for chain jobs are

Condition Function
afterok:<jobID> job can start once job <jobID> has terminated without errors
afternotok:<jobID> job can start once job <jobID> has terminated with errors
afterany:<jobID> job can start once job <jobID> has terminated

Jobscript Examples

This serial job will run a given executable, in this case "myapp.exe".

#!/usr/bin/bash

### Job name
#PBS -N MYJOB

### File where the output should be written
#PBS -o MYJOB_OUTPUT.txt

### Time your job needs to execute, e. g. 1 h 20 min
#PBS -l cput=1:20:00

### Memory your job needs, e. g. 1000 MB 
#PBS -l mem=1000MB

### The last part consists of regular shell commands:
### Change to working directory
cd /home/user/mywork

### Execute your application
myapp.exe

This job runs the executable "myapp.exe" that has been parallelized with OpenMP using 12 threads.

#!/usr/bin/bash

### Job name
#PBS -N MY_OMP_JOB

### Redirect stdout and stderr to the same file
#PBS -j oe

### Total time limit (including queuing), e. g. 45 min
#PBS -l walltime=00:45:00

### Request 1 node and number of threads, e. g. 12
#PBS -l nodes=1:ppn=12

### The last part consists of regular shell commands:
### Change to working directory
cd /home/user/mywork

### Execute your application and set OMP_NUM_THREADS for the application run
OMP_NUM_THREADS=12 myapp.exe

This MPI job will start 8 processes of "myapp.exe" on 2 nodes.

#!/usr/bin/bash

### Job name
#PBS -N MY_MPI_JOB

### Output file
#PBS -o mpi_job_output.txt

### Time limit for execution, e. g. 30 min
#PBS -l cput=0:30:00

### Request two nodes and 4 processes each
#PBS -l nodes=2:ppn=4

### The last part consists of regular shell commands:
### Change to working directory
cd /home/user/mywork

### Execute your application and set the "-np" option for the application run
mpiexec -np 8 myapp.exe

References

Overview of how to write a jobscript for Torque

Job submission on Torque

Guide to the Torque scheduler

More jobscript examples and tips