SSH to Compute Nodes (Admin Guide)

From HPC Wiki
Admin Guide SSH to Compute Nodes
Jump to navigation Jump to search

User Access on Compute Nodes

Compute nodes and are somewhat fragile if the user has full access on them. Consider the case that a user has unlimited access to a compute node via ssh. At the same time a workload manager, such as slurm, is running on the cluster. If the user starts a job directly on the nodes via command line, slurm will not know that the job is running hence the ressource will still be marked as idle. Another user who wants to start a job via slurm might now get the same ressources which puts the node in a undefined state.

To avoid this kind of behaviour some methods have been collected on the ap3 mailing List.

slurm

The user has login access via ssh to a login node from which jobs can be started using sbatch or srun etc. From here the slurm module pam_slurm_adopt is used. The modules purpose is to prevent the user from sshing onto any (non-login) nodes as long as the ressources are not owned. Owning the ressources requires either to have a running job or job allocation.

This or a similar system is used in Paderborn and Aachen. Bielefeld plans to do it in the near future.

HTCondor

The user has login access via LDAP/Kerberos (using sssd) onto a Desktop Computer. In order to obtain ressources a job has to be started using the HTCondor system. Jobs are implemented with a NAT-Gateway solution for which a tunneled ssh-connection is established (out-of-the-box functionality using the HTCondor CCB service). The authentication for this also applies Kerberos. Furthermore, this solution allows X-Forwarding locked into a container. If combined with SINGULARITY_JOB = true, this will containerize all jobs and interactive jobs and allows users to connect to a running job’s container via SSH.

This uses off-the-shelf functionality of HTCondor. An integration with slurm is not yet feasible due to the containerization.

This system is used in Bonn (BAF2).

Resources