Unix describes a family of operating systems. Popular representatives include Ubuntu, CentOS and even MacOS, although the latter is not common on HPC systems. Main key features are its shell and file system.
The file system describes the directory structure of an operating system. On Unix-based systems the top most directory is
/, which is called the root directory. As the name may suggest the file system is organized hierarchicly (like a tree) from there on out. Most of the time you will be working in directories starting with
/home/<username>, which represents the user's home directory. All directories starting with
/home/<username> can freely be modified to the will of the user.
An environment variable is a dynamic object on a computer, which stores a value. Under Unix-based operating systems you can:
- set the value of a variable with:
- read the value of a variable with:
Environment variables can be referenced by software (or the user) to get or set information about the system. Down below are a few examples of environment variables, which might give you an idea for their use and usefulness.
||your current username|
||the directory you are currently in|
||hostname of the computer you are on|
||your home directory|
||list of directories searched for when a command is executed|
A cluster referes to a collection of multiple nodes, which are connected via a network offering high bandwidth with low lateny communication. Accessing a cluster is possible by connecting to its specific login nodes.
A node is an individual computer consisting of one or more sockets.
Backend nodes are reserved for executing memory demanding and long running applications. They are the most powerful, but also most power consuming part of a cluster as they make up around 98% of it. Since these nodes are not directly accessable by the user, a scheduler manages their access. In order to run on these nodes, a batch job needs to be submitted to the batch system via a scheduler specific command.
Copy nodes are reserved for transfering data to or from a cluster. They usually offer a better connection than other nodes and minimize the disturbance of other users on the system. Depending on the facility, software installed on these nodes may differ from other ones due to their restricted use case, though not every facility chooses to install a designated copy node at all. As an alternative login node may be used to move data between systems.
Synonym for login node.
Login nodes are reserved for connecting to the cluster of a facility. Most of the time they can also be used for testing and performing interactive tasks (e.g. the analysis of previously collected application profiles). These test runs should generally not exceed execution times of just a few minutes and may only be used to verify that your software is running correctly on the system and its environment before submitting batch jobs to the batch system.
A socket is the physical package in which multiple cores are enclosed sharing the same memory.
A core has one or more hardware threads and is respnsible for executing instructions.
Central Processing Unit (CPU)
The word "CPU" is widely used in the field of HPC though not precisely defined. It is mostly used to describe the concrete hardware architecture of a node, but should generally be avoided due to possible misunderstandings and ambiguities.
Random Access Memory (RAM)
The RAM is used as working memory for the cores. This is volatile memory meaning, that after a process ends the data in the RAM is no longer available. The RAM is shared between all sockets on a node, though it is physically seperated for each socket.
A cache is a relatively small amount of fast memory (compared to RAM), on the CPU chip. A modern CPU has three cache levels: L1 and L2 are specific to each core, while L3 (or Last Level Cache (LLC)) is shared among all cores of a CPU.
Scalability represents a property of software, that describes how good an application can use an increased number of hardware resources. Good scalability would mean a decrease in runtime when more and more cores are used to solve the problem. Typically applications reach an upper bound regarding a number of cores "beyond they stop scaling", which means that the execution time stops going down (or might even increase again).