https://hpc-wiki.info/hpc/index.php?title=FAQ_Batch_Jobs&feed=atom&action=history
FAQ Batch Jobs - Revision history
2024-03-28T17:55:42Z
Revision history for this page on the wiki
MediaWiki 1.35.9
https://hpc-wiki.info/hpc/index.php?title=FAQ_Batch_Jobs&diff=2211&oldid=prev
Christian-griebel-aab2@tu-darmstadt.de: /* Common Problems and Pitfalls of Batch Jobs */
2020-01-20T00:46:50Z
<p><span dir="auto"><span class="autocomment">Common Problems and Pitfalls of Batch Jobs</span></span></p>
<table class="diff diff-contentalign-left diff-editfont-monospace" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 00:46, 20 January 2020</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l41" >Line 41:</td>
<td colspan="2" class="diff-lineno">Line 41:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>;I get "<tt>srun: Job step creation temporarily disabled</tt>", have no results and my job <del class="diffchange diffchange-inline">seem </del>to have idled until it times out?</div></td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>;I get "<tt>srun: Job step creation temporarily disabled</tt>", have no results and my job <ins class="diffchange diffchange-inline">seems </ins>to have idled until it times out?</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>This ususally is caused by "''nested calls''" to either <tt>srun</tt> or <tt>mpirun</tt> within the same job. The second or "inner" instance of srun/mpirun tries to allocate the same resources as the "outer" one already did, and thus cannot complete.</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>This ususally is caused by "''nested calls''" to either <tt>srun</tt> or <tt>mpirun</tt> within the same job. The second or "inner" instance of srun/mpirun tries to allocate the same resources as the "outer" one already did, and thus cannot complete.</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l48" >Line 48:</td>
<td colspan="2" class="diff-lineno">Line 48:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> srun myScientificProgram …</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> srun myScientificProgram …</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Check whether <del class="diffchange diffchange-inline">your application's main command <tt></del>myScientificProgram<del class="diffchange diffchange-inline"></tt> </del>in fact is <del class="diffchange diffchange-inline">a </del>MPI-capable binary. Then, the above syntax is correct.</div></td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Check whether myScientificProgram in fact is <ins class="diffchange diffchange-inline">an </ins>MPI-capable binary. Then, the above syntax is correct.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>But if <del class="diffchange diffchange-inline"><tt></del>myScientificProgram<del class="diffchange diffchange-inline"></tt> </del>turns out to be a script, calling <tt>srun</tt> or <tt>mpirun</tt> by itself, then remove the <tt>srun</tt> in front of <del class="diffchange diffchange-inline"><tt></del>myScientificProgram<del class="diffchange diffchange-inline"></tt> </del>and run it directly.</div></td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>But if myScientificProgram turns out to be a script, calling <tt>srun</tt> or <tt>mpirun</tt> by itself, then remove the <tt>srun</tt> in front of myScientificProgram and run it directly.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l86" >Line 86:</td>
<td colspan="2" class="diff-lineno">Line 86:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> exit $EXITCODE</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> exit $EXITCODE</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>Immediately after <del class="diffchange diffchange-inline">executing </del>myScientificProgram, its exit code is saved to $EXITCODE, and as a last line now, your job script can re-set <del class="diffchange diffchange-inline">this </del>exit code <del class="diffchange diffchange-inline">(</del>the one of the real payload<del class="diffchange diffchange-inline">)</del>.</div></td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>Immediately after <ins class="diffchange diffchange-inline">execution of </ins>myScientificProgram, its exit code is saved to $EXITCODE, and as a last line now, your job script can re-set <ins class="diffchange diffchange-inline">its own </ins>exit code <ins class="diffchange diffchange-inline">to </ins>the one of the real payload.</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>That way, Slurm <del class="diffchange diffchange-inline">get </del>the <del class="diffchange diffchange-inline">“real” </del>exit code of “myScientificProgram”, not just the one of the command which happens to be the last one in your job script, and will set COMPLETED or FAILED appropriately.</div></td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>That way, Slurm <ins class="diffchange diffchange-inline">grasps </ins>the exit code of “myScientificProgram”, not just the one of the command which happens to be the last one in your job script, and will set COMPLETED or FAILED appropriately.</div></td></tr>
</table>
Christian-griebel-aab2@tu-darmstadt.de
https://hpc-wiki.info/hpc/index.php?title=FAQ_Batch_Jobs&diff=2210&oldid=prev
Christian-griebel-aab2@tu-darmstadt.de: /* Common Problems and Pitfalls of Batch Jobs */
2020-01-20T00:18:16Z
<p><span dir="auto"><span class="autocomment">Common Problems and Pitfalls of Batch Jobs</span></span></p>
<table class="diff diff-contentalign-left diff-editfont-monospace" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 00:18, 20 January 2020</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l39" >Line 39:</td>
<td colspan="2" class="diff-lineno">Line 39:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>This also makes sure your job is reproducible later on, independently of what modules were loaded in your login session at submit time.</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>This also makes sure your job is reproducible later on, independently of what modules were loaded in your login session at submit time.</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>;I get "<tt>srun: Job step creation temporarily disabled</tt>", have no results and my job seem to have idled until it times out?</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>;I get "<tt>srun: Job step creation temporarily disabled</tt>", have no results and my job seem to have idled until it times out?</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>This ususally is caused by "''nested calls''" to either <tt>srun</tt> or <tt>mpirun</tt> within the same job.</div></td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>This ususally is caused by "''nested calls''" to either <tt>srun</tt> or <tt>mpirun</tt> within the same job<ins class="diffchange diffchange-inline">. The second or "inner" instance of srun/mpirun tries to allocate the same resources as the "outer" one already did, and thus cannot complete</ins>.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> #!/bin/bash</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> #!/bin/bash</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> #SBATCH …</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> #SBATCH …</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> srun myScientificProgram …</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> srun myScientificProgram …</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">or</del></div></td><td colspan="2"> </td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> mpirun myScientificProgram …</del></div></td><td colspan="2"> </td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Check whether your application's main command <tt>myScientificProgram</tt> in fact is a MPI-capable binary. Then, the above syntax is correct.</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Check whether your application's main command <tt>myScientificProgram</tt> in fact is a MPI-capable binary. Then, the above syntax is correct.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>But if <tt>myScientificProgram</tt> <del class="diffchange diffchange-inline">is </del>a script, calling <tt>srun</tt> or <tt>mpirun</tt> by itself, then remove the srun/<del class="diffchange diffchange-inline">mpirun </del>in front of <del class="diffchange diffchange-inline">it </del>and run it directly.</div></td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>But if <tt>myScientificProgram</tt> <ins class="diffchange diffchange-inline">turns out to be </ins>a script, calling <tt>srun</tt> or <tt>mpirun</tt> by itself, then remove the <ins class="diffchange diffchange-inline"><tt></ins>srun<ins class="diffchange diffchange-inline"><</ins>/<ins class="diffchange diffchange-inline">tt> </ins>in front of <ins class="diffchange diffchange-inline"><tt>myScientificProgram</tt> </ins>and run it directly.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
</table>
Christian-griebel-aab2@tu-darmstadt.de
https://hpc-wiki.info/hpc/index.php?title=FAQ_Batch_Jobs&diff=2209&oldid=prev
Christian-griebel-aab2@tu-darmstadt.de: /* Common Problems and Pitfalls of Batch Jobs */
2020-01-20T00:14:07Z
<p><span dir="auto"><span class="autocomment">Common Problems and Pitfalls of Batch Jobs</span></span></p>
<table class="diff diff-contentalign-left diff-editfont-monospace" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 00:14, 20 January 2020</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l39" >Line 39:</td>
<td colspan="2" class="diff-lineno">Line 39:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>This also makes sure your job is reproducible later on, independently of what modules were loaded in your login session at submit time.</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>This also makes sure your job is reproducible later on, independently of what modules were loaded in your login session at submit time.</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">;I get "<tt>srun: Job step creation temporarily disabled</tt>", have no results and my job seem to have idled until it times out?</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">This ususally is caused by "''nested calls''" to either <tt>srun</tt> or <tt>mpirun</tt> within the same job.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"> #!/bin/bash</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"> #SBATCH …</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"> srun myScientificProgram …</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">or</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"> mpirun myScientificProgram …</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">Check whether your application's main command <tt>myScientificProgram</tt> in fact is a MPI-capable binary. Then, the above syntax is correct.</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">But if <tt>myScientificProgram</tt> is a script, calling <tt>srun</tt> or <tt>mpirun</tt> by itself, then remove the srun/mpirun in front of it and run it directly.</ins></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
</table>
Christian-griebel-aab2@tu-darmstadt.de
https://hpc-wiki.info/hpc/index.php?title=FAQ_Batch_Jobs&diff=2201&oldid=prev
Christian-griebel-aab2@tu-darmstadt.de at 13:48, 3 November 2019
2019-11-03T13:48:14Z
<p></p>
<table class="diff diff-contentalign-left diff-editfont-monospace" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 13:48, 3 November 2019</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l18" >Line 18:</td>
<td colspan="2" class="diff-lineno">Line 18:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In particular, the directory specified with</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>In particular, the directory specified with</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> #SBATCH -e /path/to/error/directory/%j.err</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> #SBATCH -e /path/to/error/directory/%j.err</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>for the STDERR of your jobs needs to exist beforehand and must be writable for you.</div></td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>for the STDERR of your jobs needs to exist beforehand and must be writable for you. SLURM ends the job immediately if it is unable to write i.e. the error file (due to a missing target directory).</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div> </div></td><td colspan="2"> </td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>SLURM ends the job immediately if it is unable to write i.e. the error file (due to a missing target directory).</div></td><td colspan="2"> </td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Due to being a “chicken and egg” problem, a construct ''inside'' the job script like</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Due to being a “chicken and egg” problem, a construct ''inside'' the job script like</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l26" >Line 26:</td>
<td colspan="2" class="diff-lineno">Line 24:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> mkdir -p /path/to/error/directory/</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div> mkdir -p /path/to/error/directory/</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>cannot work either, since for Slurm, the “mkdir” command is already part of the job. Thus, any of “mkdir”s potential output (STDOUT or STDERR) would have to be written to a directory which at begin of the job does not yet exist.</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>cannot work either, since for Slurm, the “mkdir” command is already part of the job. Thus, any of “mkdir”s potential output (STDOUT or STDERR) would have to be written to a directory which at begin of the job does not yet exist.</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>;Sometimes, my job runs successfully, sometimes it does not. Why is that?</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>;Sometimes, my job runs successfully, sometimes it does not. Why is that?</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l40" >Line 40:</td>
<td colspan="2" class="diff-lineno">Line 39:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>This also makes sure your job is reproducible later on, independently of what modules were loaded in your login session at submit time.</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>This also makes sure your job is reproducible later on, independently of what modules were loaded in your login session at submit time.</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>;My jobs are reported as “COMPLETED”, even though my scientific program in fact failed miserably. Why is that?</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>;My jobs are reported as “COMPLETED”, even though my scientific program in fact failed miserably. Why is that?</div></td></tr>
</table>
Christian-griebel-aab2@tu-darmstadt.de
https://hpc-wiki.info/hpc/index.php?title=FAQ_Batch_Jobs&diff=2200&oldid=prev
Christian-griebel-aab2@tu-darmstadt.de: job scripts common pitfalls or problems
2019-11-03T13:45:46Z
<p>job scripts common pitfalls or problems</p>
<p><b>New page</b></p><div>[[Category:HPC-User]]<br />
<br />
= Common Problems and Pitfalls of Batch Jobs =<br />
<small>(explained by taking the example of SLURM)</small><br />
;Why are there certain ''mandatory'' resource requirements?<br />
The batch scheduler needs to know some minimal properties of a job to decide which nodes it should be started on.<br />
<br />
If for example you would not specify <code>--mem-per-cpu=</code> or <code>--mem-per-node=</code>, a task requiring very large main memory might be scheduled to a node with too little RAM and would thus crash.<br />
<br />
To put it another way: with the resource requirements of all user jobs, the scheduler needs to play kind of “multidimensional tetris”. At least along the dimensions runtime, memory size and number of CPU cores, the scheduler places your jobs as efficiently and as gap-free as possible into the cluster. (In the background, many more parameters are used.)<br />
<br />
These three properties of a job are thus the bare minimum to give the scheduler something to schedule with.<br />
<br />
<br />
;After submission of my job, it seems to start but exits immediately, without creating any output or error. What's wrong?<br />
<br />
Check whether all directories mentioned in your job script are in fact there and writable for you.<br />
In particular, the directory specified with<br />
#SBATCH -e /path/to/error/directory/%j.err<br />
for the STDERR of your jobs needs to exist beforehand and must be writable for you.<br />
<br />
SLURM ends the job immediately if it is unable to write i.e. the error file (due to a missing target directory).<br />
<br />
Due to being a “chicken and egg” problem, a construct ''inside'' the job script like<br />
#SBATCH -e /path/to/error/directory/%j.err<br />
mkdir -p /path/to/error/directory/<br />
cannot work either, since for Slurm, the “mkdir” command is already part of the job. Thus, any of “mkdir”s potential output (STDOUT or STDERR) would have to be written to a directory which at begin of the job does not yet exist.<br />
<br />
;Sometimes, my job runs successfully, sometimes it does not. Why is that?<br />
<br />
Make sure the relevant [[modules]] are loaded in your job script.<br />
<br />
While you can load those modules right when logging in on the login node (since these are inherited by your batch job), this in fact is not reliable. Instead, it renders your jobs dependent on what modules you have loaded in your login session.<br />
<br />
We thus recommend to begin each job script with<br />
module purge<br />
module load <each and every relevant module><br />
myScientificProgram …<br />
to have exactly those modules loaded which are needed, and not more.<br />
<br />
This also makes sure your job is reproducible later on, independently of what modules were loaded in your login session at submit time.<br />
<br />
;My jobs are reported as “COMPLETED”, even though my scientific program in fact failed miserably. Why is that?<br />
<br />
There is no magic by which the scheduler could know the really important part of your job script. The only way for Slurm to detect success or failure is the exit code of your job script, not the real success or failure of any program or command ''within'' it.<br />
<br />
The exit code of well-written programs is zero in case everything went well, and >0 if an error has occurred.<br />
<br />
Imagine the following job script:<br />
<br />
#!/bin/bash<br />
#SBATCH …<br />
myScientificProgram …<br />
<br />
Here, the last command executed is in fact your scientific program, so the whole job script exits with the exit code of “myScientificProgram” as desired. Thus, Slurm will assign COMPLETED if “myScientificProgram” has had an exit code of 0, and will assign FAILED if not.<br />
<br />
If you issue just one simple command after “myScientificProgram”, this will overwrite the exit code of “myScientificProgram” with its own:<br />
<br />
#!/bin/bash<br />
#SBATCH …<br />
myScientificProgram …<br />
cp resultfile $HOME/jobresults/<br />
<br />
Now, the “cp” command's exit code will be the whole job's exit code, since “cp” is the last command of the job script. If the “cp” command succeeds, Slurm will assign COMPLETED even though “myScientificProgram” might have failed – “cp”s success ''covers'' the failure of “myScientificProgram”.<br />
<br />
To avoid that, save the exit code of your important program before executing any additional commands:<br />
<br />
#!/bin/bash<br />
#SBATCH …<br />
myScientificProgram …<br />
EXITCODE=$?<br />
cp resultfile $HOME/jobresults/<br />
/any/other/job/closure/cleanup/commands …<br />
exit $EXITCODE<br />
<br />
Immediately after executing myScientificProgram, its exit code is saved to $EXITCODE, and as a last line now, your job script can re-set this exit code (the one of the real payload).<br />
That way, Slurm get the “real” exit code of “myScientificProgram”, not just the one of the command which happens to be the last one in your job script, and will set COMPLETED or FAILED appropriately.</div>
Christian-griebel-aab2@tu-darmstadt.de