Batch-Scheduler

From HPC Wiki
Revision as of 14:19, 3 September 2019 by Daniel-schurhoff-de23@rwth-aachen.de (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page gives an overview of how to use a Batch-Scheduler and what pitfalls may exist. A more general description of why Batch-Schedulers are needed can be found here. There are different Schedulers around, e.g. SLURM, LSF and Torque. Click here to figure out which one you need.

Usage

If you want to execute a program in the batch system, you need to submit a jobscript tailored for the used scheduler. With the help of this script the scheduler needs to learn:

  • How many resources your program needs (e.g. time and memory)
  • Which parallelization you are using for your program

While the specifics of how to provide this information depends on the used scheduler, some general rules apply to most of them. How to apply these rules should be answered in the example scripts (or referenced there). Not all pitfalls apply to all batch systems.

Pitfalls

There are some general problems one needs to keep in mind:

  • If you request more resources than the hardware can offer, the scheduler might not reject the job (so that the job will be stuck in the queue forever).
  • Be careful about whether the memory limit is per process or in total.
  • The scheduler might not support pinning so you might want to do this manually.
  • There might be per-user quotas for the usage of the cluster.

Serial Jobs

Serial jobs execute programs which do not use any kind of parallelism (ie. use only one core at a time). Thus, you typically only need to specify the time and memory resources your job needs. However, some batch systems allow exclusive and non-exclusive usage of nodes. Pay attention that you do not block a whole node for a program which just needs one core!

Shared Memory Jobs

This category of parallelization includes OpenMP as well as language-builtin threading. Speaking in hardware, shared memory parallelization means that you use multiple cores which are on the same node (and therefore share their memory). This means that you need to tell the scheduler that the requested cores should actually be on the same node. Furthermore, you should synchronize the number of threads spawned with the number of cores you requested. In the case of OpenMP, this could happen by setting OMP_NUM_THREADS. Furthermore, the scheduler may not support pinning. Therefore, you may want to set OMP_PLACES and OMP_PROC_BIND (see here).

It is important to note that a typical cluster node has multiple sockets (see NUMA). For optimal usage avoid using the interconnect. This means that threads which work on a certain part of the data should also initialise this data. Due to the first-touch policy, the memory is allocated close to where the initialising threads are pinned to.

Distributed Memory Jobs

This is usually done via MPI since it handles the correct start-up of the program. Again, pay attention that the MPI library and the resource requests match. This means that you loaded the correct module (usually Open MPI or Intel MPI, but never both). If you want to know how to perform binding with MPI, see this page. Depending on the scheduler, it might be possible to request nodes in the same chassis or rack in order to reduce network latency.

Hybrid Jobs

Hybrid parallelization means that you run a job on different nodes (e.g. using MPI) while using shared memory parallelization (e.g. OpenMP) on each of the nodes. This means that you need to specify at least the number of nodes as well as that you want to use more than one core per node. Distributing the job across different nodes is usually handled by the scheduler. The parallelization on each node may not be done by the scheduler but instead by setting the OpenMP environment variables. For running hybrid jobs, you basically need to pay attention to all points mentioned for shared memory jobs as well as for distributed memory jobs. Moreover, pay attention to the multi-threading support of your MPI library. Hybrid usage of MPI may not be fully supported or at your own risk.

Array and Chain Jobs

It is highly recommended to divide long-running computations (e.g. several days) into smaller parts, as there is mostly no fault recovery. Therefore, all intermediate results are lost in case the application crashes or a hardware failure occurs. This may also reduce the pending time, as most schedulers will give shorter jobs a higher priority. All parts can be submitted together as so called array- and chain-jobs, with only one part being executed at a time. The two differ in the order the parts are executed, which is random for array-jobs and individually specified for chain-jobs. Details can be found with the individual schedulers.

Advanced Usage

Apart from the aforementioned types of jobs, the scheduler might offer even more types:

  • Jobs across multiple nodes (distributed jobs or hybrid jobs) can also be parallelized without MPI. This goes beyond the scope of this page.
  • Jobs running several days should be split into smaller packages. Among the advantages are reduced queuing times and a higher stability (e.g. against node failure). The splitting can either be done by manually submitting or by using chain jobs.
  • Sometimes it may be necessary to run the same program with different arguments (e.g. determining hyperparameters). In this case an array job may be used.