Difference between revisions of "Snapshotting Jobs"
m |
|||
(One intermediate revision by one other user not shown) | |||
Line 1: | Line 1: | ||
− | [[Category:HPC-User]] | + | [[Category:HPC-User]][[Category:HPC-Admin]] |
− | Snapshotting jobs is a way to save the current state of a running job and continue its execution at a later point in time. This can be useful to overcome often imposed time limits or periods of maintenance where a cluster is not available. It can also be used to create regular snapshots of very long jobs to not loose the current progress in case something goes wrong. Some programs do have this functionality built into their code. Please inform yourself about this possibility as this is usually a much safer way to create snapshots. If this is not the case, keep on reading. | + | |
+ | Snapshotting jobs (also known as '''C'''heck'''p'''oint/'''R'''estart), is a way to save the ''current state'' of a running job and continue its execution at a later point in time. This can be useful to overcome often imposed time limits or periods of maintenance where a cluster is not available. It can also be used to create regular snapshots of very long jobs to not loose the current progress in case something goes wrong. Some programs do have this functionality built into their code. Please inform yourself about this possibility as this is usually a much safer way to create snapshots. If this is not the case, keep on reading. | ||
__TOC__ | __TOC__ |
Latest revision as of 15:24, 29 November 2021
Snapshotting jobs (also known as Checkpoint/Restart), is a way to save the current state of a running job and continue its execution at a later point in time. This can be useful to overcome often imposed time limits or periods of maintenance where a cluster is not available. It can also be used to create regular snapshots of very long jobs to not loose the current progress in case something goes wrong. Some programs do have this functionality built into their code. Please inform yourself about this possibility as this is usually a much safer way to create snapshots. If this is not the case, keep on reading.
DMTCP
DMTCP (Distributed MultiThreaded CheckPointing) is a tool to create snapshots of currently running jobs. This is achieved by saving the current memory state to a checkpoint file. This file can then be used to continue the job.
Current limitations
At this point DMTCP reliably only works for non-MPI codes, i.e. multi-threaded codes using OpenMP or pthreads etc., even though MPI codes are officially supported. There are ongoing efforts to re-implement the support for MPI codes. The project is called MANA and is in active development. However, at this point the code is highly tailored for the usage at NERSC. At a future point MANA will be merged into the official DMTCP code base.
Installation
If not available on your site, installation should be done by an administrator. A viable option is to use Easybuild as an easyconfig-file for DMTCP is available.
Manual usage
Here we assume you have direct access to a compute node via SSH, using e.g. an interactive job session and all DMTCP commands are available in your environment.
1. Navigate to the location where you want to run your program and launch the DMTCP coordinator:
dmtcp_coordinator
2. In a separate shell (on the same node) launch your application:
dmtcp_launch ./a.out
3. To create a new checkpoint, return to the shell where you started the coordinator. Type c
followed by pressing [RETURN]
4.1 Restart: Creating a checkpoint causes the dmtcp_coordinator to write a checkpoint file (file type: .dmtcp) for each client process. If all processes were on the same processor, and there were no .dmtcp files prior to this checkpoint you can run:
dmtcp_restart ckpt_*.dmtcp
4.2 Next to a checkpoint file, DMTCP also writes out a starting script. However, this script is looking if a resource manager (i.e. SLURM, PBS etc.) was used and is setting the the environment variable RES_MANAGER
if you are running e.g. in an interactive SLURM session. You can try to use this script to restart your program but you might have to adjust it for your needs:
./dmtcp_restart_script.sh
Automatically creating checkpoints
You can also instruct the DMTCP coordinator to periodically create checkpoints. Use
dmtcp_coordinator -i 300 &
to start the coordinator in the background creating a new checkpoint file every 300 seconds (5min). Then start your code with
dmtcp_launch ./a.out
Restarting from a checkpoint file is the same as before
dmtcp_restart ckpt_*.dmtcp
Integration into SLURM
There is a plugin for DMTCP to use it in combination with the SLURM job scheduler. This enables users to submit Jobs using SLURM, automatically write out checkpoint files and restart those jobs from a taken checkpoint. The plugin will automatically be compiled when using the Easybuild installation. For the usage of the plugin, two SLURM submission scripts (one to start the other to restart a job) are provided by the developers:
Launching a new job
The slurm_launch.job script is used to submit a new job with enabled snapshotting. The frequency of the snapshots can be changed in the scipt via
start_coordinator -i 120 # in seconds
as well as the executable which will be launched
dmtcp_launch --rm ./a.out
In addition, the requested resources have to be adjusted via the known #SBATCH pragmas. After adjusting the script to your needs, you can submit it via
sbatch slurm_launch.job
Restarting a job
To restart a job, previously submitted via the launch script, you can simply submit the restart script via sbatch. Here, you only have to adjust the #SBATCH pragmas and DMTCP coordinator settings. The script then uses the created dmtcp_restart_script.sh to restart from the last checkpoint file.
sbatch slurm_rstr.job
Tips and known issues
- When compiling code using the intel compiler with OpenMP use the flag
-qopenmp-link=static
, otherwise parallel regions are not entered without error messages - For Java applications set the following environment variables:
export DMTCP_DL_PLUGIN=0 export DMTCP_SIGCKPT=10