HPC-Dictionary

From HPC Wiki
Jump to navigation Jump to search

Unix

Unix describes a family of operating systems. Popular representatives include Ubuntu, CentOS and even MacOS, although the latter is not common on HPC systems. Main key features are its shell and file system.

File System

The file system describes the directory structure of an operating system. On Unix-based systems the top most directory is /, which is called the root directory. As the name may suggest the file system is organized hierarchically (like a tree) from there on out. Most of the time you will be working in directories starting with /home/<username>, which represents the user's home directory. All directories starting with /home/<username> can freely be modified to the will of the user.

Environment Variable

An environment variable is a dynamic object on a computer, which stores a value. On Unix-based operating systems you can:

  • set the value of a variable with: export <variable-name>=<value>
  • read the value of a variable with: echo $<variable-name>

Environment variables can be referenced by software (or the user) to get or set information about the system. Down below are a few examples of environment variables, which might give you an idea for their use and usefulness.

Common Environment Variables on Unix Systems
Environment Variable Content
$USER your current username
$PWD the directory you are currently in
$HOSTNAME hostname of the computer you are on
$HOME your home directory
$PATH list of directories searched for when a command is executed

Cluster

A cluster refers to a collection of multiple nodes, which are connected via a network offering high bandwidth with low latency communication. Accessing a cluster is possible by connecting to its specific login nodes.

Node

Visualization of a typical hardware hierarchy on a cluster

A node is an individual computer consisting of one or more sockets.

Backend Node

Backend nodes are reserved for executing memory demanding and long running applications. They are the most powerful, but also most power consuming part of a cluster as they make up around 98% of it. Since these nodes are not directly accessible by the user, a scheduler manages their access. In order to run on these nodes, a batch job needs to be submitted to the batch system via a scheduler specific command.

Copy Node

Copy nodes are reserved for transfering data to or from a cluster. They usually offer a better connection than other nodes and minimize the disturbance of other users on the system. Depending on the facility, software installed on these nodes may differ from other ones due to their restricted use case, though not every facility chooses to install a designated copy node at all. As an alternative login nodes may be used to move data between systems.

Frontend Node

Synonym for login node.

Login Node

Login nodes are reserved for connecting to the cluster of a facility. Most of the time they can also be used for testing and performing interactive tasks (e.g. the analysis of previously collected application profiles). These test runs should generally not exceed execution times of just a few minutes and may only be used to verify that your software is running correctly on the system and its environment before submitting batch jobs to the batch system.

Socket

A socket is the physical package in which multiple cores are enclosed sharing the same memory.

Core

A core has one or more hardware threads and is responsible for executing instructions.

Thread

Several threads belong to a single process and share an address space, but each thread has its own stack.

Central Processing Unit (CPU)

The word "CPU" is widely used in the field of HPC though not precisely defined. It is mostly used to describe the concrete hardware architecture of a node, but should generally be avoided due to possible misunderstandings and ambiguities.

Random Access Memory (RAM)

Visualization of the memory hierarchy a.k.a. the memory pyramid

The RAM is used as working memory for the cores. This is volatile memory meaning, that after a process ends the data in the RAM is no longer available. The RAM is shared between all sockets on a node, though it is physically separated for each socket.

Cache

A cache is a relatively small amount of fast memory (compared to RAM), on the CPU chip. A modern CPU has three cache levels: L1 and L2 are specific to each core, while L3 (or Last Level Cache (LLC)) is shared among all cores of a CPU.

Scalability

Scalability describes how well an application can use an increasing amount of hardware resources.

Good scalability in general means reduced runtimes when more and more cores are used to solve the same or larger problems.

Typically, applications will hit an upper limit of cores beyond they don't scale further, ie. more cores don't lead to lower runtimes (or even increase it again).

However, good scalability can also imply that the execution time remains the same, when the problem size grows similarly to the hardware resources.

strong scalability

By involving more and more processor cores to solve the same problem (size), the application still achieves reduced runtimes.

weak scalability

By involving more and more processor cores, the application can tackle larger problems (size), though you can no longer achieve reduced runtimes on the same problem (size).