Binding/Pinning

Basics

Pinning threads for shared-memory parallelism or binding processes for distributed-memory parallelism is an advanced way to control how your system distributes the threads or processes across the available cores. It is important for improving the performance of your application by avoiding costly remote memory accesses and keeping the threads or processes close to each other. Threads are "pinned" by setting certain OpenMP-related environment variables, which you can do with this command:

$ export <env_variable_name>=<value>

The terms "thread pinning" and "thread affinity" as well as "process binding" and "process affinity" are used interchangeably. You can bind processes by specifying additional options when executing your Open MPI application.

How to Pin Threads in OpenMP

Schematic of how OMP_PLACES={0}:8:2 would be interpreted

Schematic of how OMP_PROC_BIND=close would be interpreted on a system comprising 2 nodes with 4 hardware threads each

Schematic of OMP_PROC_BIND=spread and an remote memory access, if thread 0 and 1 work on the same data

OMP_PLACES is employed to specify places on the machine where the threads are put. However, this variable on its own does not determine thread pinning completely, because your system still won't know in what pattern to assign the threads to the given places. Therefore, you also need to set OMP_PROC_BIND.

OMP_PROC_BIND specifies a binding policy which basically sets criteria by which the threads are distributed.

If you want to get a schematic overview of your cluster's hardware, e. g. to figure out how many hardware threads there are, type: $ lstopo.

`OMP_PLACES`

This variable can hold two kinds of values: a name specifying (hardware) places, or a list that marks places.

Abstract name	Meaning
`threads`	a place is a single hardware thread, i. e. the hyperthreading will be ignored
`cores`	a place is a single core with its corresponding amount of hardware threads
`sockets`	a place is a single socket

In order to define specific places by an interval, OMP_PLACES can be set to <lowerbound>:<length>:<stride>. All of these three values are non-negative integers and must not exceed your system's bounds. The value of <lowerbound> can be defined as a list of hardware threads. As an interval, <lowerbound> has this format: {<starting_point>:<length>} that can be a single place, or a place that holds several hardware threads, which is indicated by <length>.

Example hardware	`OMP_PLACES`	Places
24 cores with one hardware thread each, starting at core 0 and using every 2nd core	`{0}:24:2` or `{0:1}:24:2`	`{0}, {2}, {4}, {6}, {8}, {10}, {12}, {14}, {16}, {18}, {20}, {22}`
12 cores with two hardware threads each, starting at the first two hardware threads on the first core ({0,1}) and using every 4th core	`{0,1}:12:4` or `{0:2}:12:4`	`{0,1}, {4,5}, {8,9}, {12,13}, {16,17}, {20,21}`

You can also determine these places with a comma-separated list. Say there are 8 cores available with one hardware thread each, and you would like to execute your application on the first four cores, you could define this: $ export OMP_PLACES="{0,1,2,3}"

`OMP_PROC_BIND`

Now that you have set OMP_PROC_BIND, you can now define the order in which the places should be assigned. This is especially useful for NUMA systems (see references below) because some threads may have to access remote memory, which will slow your application down significantly. If OMP_PROC_BIND is not set, your system will distribute the threads across the nodes and cores randomly.

Value	Function
`true`	the threads should not be moved
`false`	the threads can be moved
`master`	worker threads are in the same partition as the master
`close`	worker threads are close to the master in contiguous partitions, e. g. if the master is occupying hardware thread 0, worker 1 will be placed on hw thread 1, worker 2 on hw thread 2 and so on
`spread`	workers are spread across the available places to maximize the space inbetween two neighbouring threads

Options for Binding in Open MPI

Binding processes to certain processors can be done by specifying the options below when executing a program. This is a more advanced way of running an application and also requires knowledge about your system's architecture, e. g. how many cores there are (for an overview of your hardware topology, use $ lstopo). If none of these options are given, default values are set. By overriding default values with the ones specified, you may be able to improve the performance of your application, if your system distributes them in a suboptimal way per default.

Option	Function	Explanation
--bind-to <arg>	bind to the processors associated with hardware component; arg can be one of: none, hwthread, core, l1cache, l2cache, l3cache, socket, numa, board; default values: `core` for np <= 2, `socket` otherwise	e. g.: in case of `l3cache` the processes will be bound to those processors that share the same L3 cache
--bind-to-core	bind processes to cores	bind each process to a core
--bind-to-socket	bind processes to sockets	put each process onto a processor socket
--bind-to-none	bind no processes	do not bind any processes, but distribute them freely
--cpus-per-proc <num_cpus>	bind each process to the given number of CPUs	if set to 3, each process will take up 3 CPUs
--report-bindings	print any bindings for launched processes to the console