Difference between revisions of "Administration tips and tricks"

Latest revision as of 15:43, 25 May 2020

Mutual dependencies of services

Problem

After reboot or power cycle/failure, the local compute nodes' scheduler daemon is started too early: the global filesystem is not ready yet and the first job fails on those nodes.

When the local scheduler daemon starts, most likely it will report the node as "ready to receive jobs" to its master daemon. If the mounts of remote filesystems are initiated, but not finished yet, the first job(s) will fail due to missing directories and files.

You could now write node-local checker scripts trying to read or write on mount points, with all bells and whistles like using timeout ... touch /mount/point/tmp/$(uname -n).checker. Or you could write fine-grained systemd dependencies (with PathExists= or DirectoryNotEmpty=.

All these will fail inevitably, if the shared filesystem takes longer than expected to become operational.

Suggestion

Try to "turn the tables" and check whether your shared filesystem supports any kind of callback or signalling to tell "Now, I am really ready and operational". Then, have your shared filesystem start up your local scheduler daemon--when all is ready.

In the case of SpectrumScale (GPFS), you can define "user callbacks" which are triggered locally on each node at certain events. Creating such a callback (using systemd and slurmd as an example):

 mmaddcallback NameOfCB --command /bin/systemctl --parms "start slurmd" --event startup -N all,your,compute,node,classes,as,defined,in,GPFS

The event startup is in fact GPFS's "local full readiness" state. The callback will thus be called on each node only after it has completed all GPFS arbitrating, joining and mounting stuff, then in effect running

 /bin/systemctl start slurmd

which is what you want. The "NameOfCB" is not important, it is just used to list or delete such a callback later on.

Now, you need to exempt the systemd unit of your local scheduler daemon from being auto-started during bootup. Simply disable it (which in effect only removes a symbolic link):

 systemctl disable slurmd

and watch the next reboot for the orderly coming up of GPFS duly followed by slurmd.

@@ Line 1: / Line 1: @@
 [[Category:HPC-Admin]]
-= General tips & tricks in administrating HPC clusters =
+== Mutual dependencies of services ==
+;Problem
+''After reboot or power cycle/failure, the local compute nodes' scheduler daemon is started too early: the [[HPC-Dictionary#Shared filesystems|global filesystem]] is not ready yet and the first job fails on those nodes.''
+When the local scheduler daemon starts, most likely it will report the node as "ready to receive jobs" to its master daemon. If the mounts of remote filesystems are initiated, but not finished yet, the first job(s) will fail due to missing directories and files.
-== Mutual dependencies of services ==
+You could now write node-local checker scripts trying to read or write on mount points, with all bells and whistles like using <code>timeout ... touch /mount/point/tmp/$(uname -n).checker</code>.
-=== Problem ===
+Or you could write fine-grained ''systemd'' dependencies (with <code>PathExists=</code> or <code>DirectoryNotEmpty=</code>.
-''After reboot or power cycle/failure, the local compute nodes' scheduler daemon is started too early: the [[global filesystem]] is not ready yet and the first job fails on those nodes.''
+All these will fail inevitably, if the [[HPC-Dictionary#Shared filesystems|shared filesystem]] takes longer than expected to become operational.
-When the local scheduler daemon starts, it nmost likely will report the node as "ready to receive jobs". If the mounts of remote filesystems are initiated, but not finished yet, the first jobs will fail due to missing directories/files.
+;Suggestion
+Try to "turn the tables" and check whether your [[HPC-Dictionary#Shared filesystems|shared filesystem]] supports any kind of callback or signalling to tell "Now, I am really ready and operational".
+Then, have your [[HPC-Dictionary#Shared filesystems|shared filesystem]] start up your local scheduler daemon--when all is ready.
-You could now write local checker scripts trying to read or write on mount points of the node, but if the [[shared filesystem]] takes longer than expected to get operational, these checkers will all be failing miserably even due to using <code>timeout</code> or other sophisticated checks.
+In the case of SpectrumScale (GPFS), you can define "user callbacks" which are triggered ''locally on each node'' at certain events. Creating such a callback (using <code>systemd</code> and <code>slurmd</code> as an example):
+  mmaddcallback NameOfCB --command '''/bin/systemctl''' --parms "'''start slurmd'''" --event ''startup'' -N all,your,compute,node,classes,as,defined,in,GPFS
-=== Suggestion ===
+The event ''startup'' is in fact GPFS's "local full readiness" state. The callback will thus be called on each node ''only after'' it has completed all GPFS arbitrating, joining and mounting stuff, then in effect running
-Try to "turn the tables" and check whether your [[shared filesystem]] supports any kind of "Now, I am really ready and operational" callback or signal.
+  '''/bin/systemctl start slurmd'''
-Then, have your [[shared filesystem]] start up your local scheduler daemon--when all is ready.
+which is what you want.
+The "NameOfCB" is not important, it is just used to list or delete such a callback later on.
-In the case of GPS, you can define "user callbacks" which are triggered at certain events. Creating such a callback:
+Now, you need to exempt the systemd unit of your local scheduler daemon from being auto-started during bootup. Simply disable it (which in effect only removes a symbolic link):
+  systemctl disable slurmd
+and watch the next reboot for the orderly coming up of GPFS duly followed by <code>slurmd</code>.

Difference between revisions of "Administration tips and tricks"

Latest revision as of 15:43, 25 May 2020

Mutual dependencies of services

Navigation menu

Search