Snapshotting Jobs

From HPC Wiki
Jump to navigation Jump to search


Snapshotting jobs is a way to save the current state of a running job and continue its execution at a later point in time. This can be useful to overcome often imposed time limits or periods of maintenance where a cluster is not available. It can also be used to create regular snapshots of very long jobs to not loose the current progress in case something goes wrong. Some programs do have this functionality built into their code. Please inform yourself about this possibility as this is usually a much safer way to create snapshots. If this is not the case, keep on reading.

DMTCP

DMTCP (Distributed MultiThreaded CheckPointing) is a tool to create snapshots of currently running jobs. This is achieved by saving the current memory state to a checkpoint file. This file can then be used to continue the job.


Installation

If not available on your site, installation should be done by an administrator. A viable option is to use Easybuild as an easyconfig-file for DMTCP is available.

Manual

Here we assume you have direct access to a compute node via SSH, using e.g. an interactive job session and all DMTCP commands are available in your environment.

1. Navigate to the location where you want to run your program and launch the DMTCP coordinator:

dmtcp_coordinator

2. In a separate shell (on the same node) launch your application:

dmtcp_launch ./a.out

3. To create a new checkpointm, return to the shell where you started the coordinator. Type c followed by pressing [RETURN]


4.1 Restart: Creating a checkpoint causes the dmtcp_coordinator to write a checkpoint file (file type: .dmtcp) for each client process. If all processes were on the same processor, and there were no .dmtcp files prior to this checkpoint you can run:

dmtcp_restart ckpt_*.dmtcp


4.2 Next to a checkpoint file, DMTCP also writes out a starting script. However, this script is looking if a resource manager (i.e. SLURM, PBS etc.) was used and is setting the the environment variable RES_MANAGER if you are running e.g. in an interactive SLURM session. You can try to use this script to restart your program but you might have to adjust it for your needs:

./dmtcp_restart_script.sh


Automatically creating checkpoints

Integration into SLURM