Administration tips and tricks

From HPC Wiki
Revision as of 16:55, 1 October 2019 by Christian-griebel-aab2@tu-darmstadt.de (talk | contribs) (Created page with "Category:HPC-Admin = General tips & tricks in administrating HPC clusters = == Mutual dependencies of services == === Problem === ''After reboot or power cycle/failure,...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


General tips & tricks in administrating HPC clusters

Mutual dependencies of services

Problem

After reboot or power cycle/failure, the local compute nodes' scheduler daemon is started too early: the global filesystem is not ready yet and the first job fails on those nodes.


When the local scheduler daemon starts, it nmost likely will report the node as "ready to receive jobs". If the mounts of remote filesystems are initiated, but not finished yet, the first jobs will fail due to missing directories/files.

You could now write local checker scripts trying to read or write on mount points of the node, but if the shared filesystem takes longer than expected to get operational, these checkers will all be failing miserably even due to using timeout or other sophisticated checks.

Suggestion

Try to "turn the tables" and check whether your shared filesystem supports any kind of "Now, I am really ready and operational" callback or signal. Then, have your shared filesystem start up your local scheduler daemon--when all is ready.

In the case of GPS, you can define "user callbacks" which are triggered at certain events. Creating such a callback: