Admin Guide Parameters and Limits

From HPC Wiki
Admin Guide Parameters and Limits /
Jump to navigation Jump to search


This article talks about setting some Linux kernel parameters (sysctl values) and user limits specific to clusters.

Sysctl-Parameters

Linux normally has very sensible default values when it comes to sysctl parameters (historically found in subdirectories of /proc, but also in /sys these days. There are some parameters that should be adapted for a HPC cluster though, for example anything to do with the retaining of MAC addresses (arp cache) or the maximum number of open TCP/IP connections. Since lots of software, e.g. monitoring, needs an open connection to all nodes (N), these numbers should be set to values of double that number to be safe.

For N=1000 (or N=1024) we use

# Increase the arp cache size of the kernel
net.ipv4.neigh.default.gc_thresh1 = 4096
net.ipv4.neigh.default.gc_thresh2 = 8192
net.ipv4.neigh.default.gc_thresh3 = 8192

# keep entries longer in arp cache
net.ipv4.neigh.default.base_reachable_time = 86400
net.ipv4.neigh.default.gc_stale_time = 86400

# The maximum number of "backlogged sockets". Value of 2*N recommended for GPFS
net.core.somaxconn = 2000

Settings such as these should be saved to a file in /etc/sysctl.d/ or historically to /etc/sysctl.conf.

User Limits

Some user limits (number of open processes, etc) should be tweaked according to the rules of the cluster. For example, should core files be generated on compute nodes? The first place to check is the /etc/security/limits.conf file. To override values, a new file /etc/security/limits.d/99-cluster.conf can be created, for example:

*       soft    memlock 4086160 #Allow more Memory Locks for MPI
*       hard    memlock 4086160 #Allow more Memory Locks for MPI
*       soft    nofile  1048576 #Increase the Number of File Descriptors
*       hard    nofile  1048576 #Increase the Number of File Descriptors
*       soft    nproc   1024    #Limit Number of Processes
*       hard    nproc   1024    #Limit Number of Processes
*       soft    stack   unlimited       #Set soft to hard limit
*       soft    core    4194304 #Allow Core Files