Snapshotting Jobs

Snapshotting jobs (also known as Checkpoint/Restart), is a way to save the current state of a running job and continue its execution at a later point in time. This can be useful to overcome often imposed time limits or periods of maintenance where a cluster is not available. It can also be used to create regular snapshots of very long jobs to not loose the current progress in case something goes wrong. Some programs do have this functionality built into their code. Please inform yourself about this possibility as this is usually a much safer way to create snapshots. If this is not the case, keep on reading.

DMTCP

DMTCP (Distributed MultiThreaded CheckPointing) is a tool to create snapshots of currently running jobs. This is achieved by saving the current memory state to a checkpoint file. This file can then be used to continue the job.

Current limitations

At this point DMTCP reliably only works for non-MPI codes, i.e. multi-threaded codes using OpenMP or pthreads etc., even though MPI codes are officially supported. There are ongoing efforts to re-implement the support for MPI codes. The project is called MANA and is in active development. However, at this point the code is highly tailored for the usage at NERSC. At a future point MANA will be merged into the official DMTCP code base.

Installation

If not available on your site, installation should be done by an administrator. A viable option is to use Easybuild as an easyconfig-file for DMTCP is available.

Manual usage

Here we assume you have direct access to a compute node via SSH, using e.g. an interactive job session and all DMTCP commands are available in your environment.

1. Navigate to the location where you want to run your program and launch the DMTCP coordinator:

dmtcp_coordinator

2. In a separate shell (on the same node) launch your application:

dmtcp_launch ./a.out

3. To create a new checkpoint, return to the shell where you started the coordinator. Type c followed by pressing [RETURN]

4.1 Restart: Creating a checkpoint causes the dmtcp_coordinator to write a checkpoint file (file type: .dmtcp) for each client process. If all processes were on the same processor, and there were no .dmtcp files prior to this checkpoint you can run:

dmtcp_restart ckpt_*.dmtcp

4.2 Next to a checkpoint file, DMTCP also writes out a starting script. However, this script is looking if a resource manager (i.e. SLURM, PBS etc.) was used and is setting the the environment variable RES_MANAGER if you are running e.g. in an interactive SLURM session. You can try to use this script to restart your program but you might have to adjust it for your needs:

./dmtcp_restart_script.sh

Automatically creating checkpoints

You can also instruct the DMTCP coordinator to periodically create checkpoints. Use

dmtcp_coordinator -i 300 &

to start the coordinator in the background creating a new checkpoint file every 300 seconds (5min). Then start your code with

dmtcp_launch ./a.out

Restarting from a checkpoint file is the same as before

dmtcp_restart ckpt_*.dmtcp

Integration into SLURM

There is a plugin for DMTCP to use it in combination with the SLURM job scheduler. This enables users to submit Jobs using SLURM, automatically write out checkpoint files and restart those jobs from a taken checkpoint. The plugin will automatically be compiled when using the Easybuild installation. For the usage of the plugin, two SLURM submission scripts (one to start the other to restart a job) are provided by the developers:

Launching a new job

The slurm_launch.job script is used to submit a new job with enabled snapshotting. The frequency of the snapshots can be changed in the scipt via

start_coordinator -i 120  # in seconds

as well as the executable which will be launched

dmtcp_launch --rm ./a.out

In addition, the requested resources have to be adjusted via the known #SBATCH pragmas. After adjusting the script to your needs, you can submit it via

sbatch slurm_launch.job

Restarting a job

To restart a job, previously submitted via the launch script, you can simply submit the restart script via sbatch. Here, you only have to adjust the #SBATCH pragmas and DMTCP coordinator settings. The script then uses the created dmtcp_restart_script.sh to restart from the last checkpoint file.

sbatch slurm_rstr.job

Tips and known issues

When compiling code using the intel compiler with OpenMP use the flag -qopenmp-link=static, otherwise parallel regions are not entered without error messages
For Java applications set the following environment variables:

export DMTCP_DL_PLUGIN=0
export DMTCP_SIGCKPT=10