Difference between revisions of "How to Use OpenMP"

Latest revision as of 09:17, 4 May 2020

Basics

This page will give you a general overview of how to compile and execute a program that has been parallelized with OpenMP. As opposed to MPI, you do not have to load any modules to use OpenMP (but your compiler must support OpenMP - most of the compilers do it).

How to Compile OpenMP Code

Additional vendor-specific (and sometimes version-specific) compiler flags tell the compiler to enable OpenMP. Otherwise, the OpenMP pragmas in the code will be ignored by the compiler.

Depending on which compiler you have loaded, use one of the flags below to compile your code.

Compiler	Flag
GNU	`-fopenmp`
Intel	`-qopenmp`
Clang	`-fopenmp`
Oracle	`-xopenmp`
NAG Fortran	`-openmp`

For example: if you plan to use an Intel compiler for your OpenMP code written in C, you have to type this to create an application called omp_code.exe:

$ icc -qopenmp omp_code.c -o omp_code.exe

How to Run an OpenMP Application

Setting `OMP_NUM_THREADS`

If you forget to set OMP_NUM_THREADS to any value, the default value of your cluster environment will be used. In many cases, the default is 1, so that your program is executed serially. If this envvar is not set at all the OpenMP run time may also deciede to use up all cores of your computer which must not always be the expected outcome, so it is a good idea always to set a meaningful value.

One way to specify the number of threads is by passing an extra argument when running the executable file. In order to start the parallel regions of the example program above with 12 threads, you'd have to type:

$ OMP_NUM_THREADS=12 ./omp_code.exe

This sets the environment variable OMP_NUM_THREADS to 12 for the execution time of omp_code.exe only, and it is reset to its default value after the execution of omp_code.exe finished.

Another way to set the number of threads is by changing your environment variable. This example will increment it up to 24 threads and override the default value:

$ export OMP_NUM_THREADS=24

If you simply run your application with $ ./omp_code.exe next, this value will be used automatically.

Thread Pinning

The performance of your application may be improved depending on the distribution of threads. Go here to learn more about thread pinning in order to minimize the execution time.

@@ Line 1: / Line 1: @@
+[[Category:HPC-User]]
 == Basics ==
-This will give you a general overview of how to compile and execute a program that has been [[Parallel_Programming|parallelized]] with [[OpenMP]].
+This page will give you a general overview of how to compile and execute a program that has been [[Parallel_Programming|parallelized]] with [[OpenMP]].
-As opposed to [[How_to_Use_MPI|MPI]], you do not have to load any modules to use OpenMP.
+As opposed to [[How_to_Use_MPI|MPI]], you do not have to load any modules to use OpenMP (but your compiler must support OpenMP - most of the compilers do it).
+__TOC__
 == How to Compile OpenMP Code ==
-Additional compiler flags tell the compiler to enable OpenMP. Otherwise, the OpenMP pragmas in the code will be ignored by the compiler.
+Additional vendor-specific (and sometimes version-specific) compiler flags tell the compiler to enable OpenMP. Otherwise, the OpenMP pragmas in the code will be ignored by the compiler.
 Depending on which compiler you have loaded, use one of the flags below to compile your code.
@@ Line 13: / Line 17: @@
 | Compiler || Flag
 |-
-| GNU || -fopenmp
+| GNU || <code>-fopenmp</code>
 |-
-| Intel || -openmp
+| Intel || <code>-qopenmp</code>
 |-
-| Oracle || -xopenmp
+| Clang  || <code>-fopenmp</code>
+|-
+| Oracle || <code>-xopenmp</code>
+|-
+| NAG Fortran || <code>-openmp</code>
 |}
-For example: if you plan to use an Intel compiler for your OpenMP code written in C, you have to type this to create an application called "omp_code.exe":
+For example: if you plan to use an Intel compiler for your OpenMP code written in C, you have to type this to create an application called <code>omp_code.exe</code>:
-  $ icc -fopenmp omp_code.c -o omp_code.exe
+  $ icc -qopenmp omp_code.c -o omp_code.exe
 == How to Run an OpenMP Application ==
-=== Setting OMP_NUM_THREADS ===
+=== Setting <code>OMP_NUM_THREADS</code> ===
-If you forget to set OMP_NUM_THREADS to any value, the default value of your cluster environment will be used. In most cases, the default is 1, so that your program is executed serially.
+If you forget to set <code>OMP_NUM_THREADS</code> to any value, the default value of your cluster environment will be used. In many cases, the default is ''1'', so that your program is executed serially. If this envvar is not set at all the OpenMP run time may also deciede to use up ''all'' cores of your computer which must not always be the expected outcome, so it is a good idea always to set a meaningful value.
 One way to specify the number of threads is by passing an extra argument when running the executable file. In order to start the parallel regions of the example program above with 12 threads, you'd have to type:
   $ OMP_NUM_THREADS=12 ./omp_code.exe
-This automatically sets the environment variable OMP_NUM_THREADS to 12, but it is reset to its default value after the execution of "omp_code.exe" finished.
+This sets the environment variable <code>OMP_NUM_THREADS</code> to ''12'' for the execution time of <code>omp_code.exe</code> only, and it is reset to its default value after the execution of <code>omp_code.exe</code> finished.
 Another way to set the number of threads is by changing your environment variable. This example will increment it up to 24 threads and override the default value:
@@ Line 38: / Line 45: @@
 If you simply run your application with <code>$ ./omp_code.exe</code> next, this value will be used automatically.
+=== [[Binding/Pinning#How_to_Pin_Threads_in_OpenMP|Thread Pinning]] ===
-== Thread Pinning ==
+The performance of your application may be improved depending on the distribution of threads. Go [[Binding/Pinning|here]] to learn more about thread pinning in order to minimize the execution time.
-[[File:Omp_places.png|thumb|350px|Schematic of how <code>OMP_PLACES={0}:8:2</code> would be interpreted]]
-[[File:Proc_bind_close.PNG|thumb|350px|Schematic of how <code>OMP_PROC_BIND=close</code> would be interpreted on a system comprising 2 nodes with 4 hardware threads each]]
-[[File:Proc_bind_spread.PNG|thumb|350px|Schematic of <code>OMP_PROC_BIND=spread</code> and an remote memory access, if thread 0 and 1 work on the same data]]
-Threads are "pinned" by setting certain OpenMP-related environment variables. It is an advanced way to control how your system distributes the threads across the available cores, with the purpose of improving the performance of your application or avoiding costly memory accesses by keeping the threads close to each other.
-OMP_PLACES is employed to specify places on the machine where the threads are put. However, this variable on its own does not determine thread pinning completely, because your system still won't know in what pattern to assign the threads to the given places. Therefore, you also need to set OMP_PROC_BIND.
-OMP_PROC_BIND specifies a binding policy which basically sets criteria by which the threads are distributed.
-If you want to get a schematic overview of your cluster's hardware, e. g. to figure out how many hardware threads there are, type: <code>$ lstopo</code>
-=== OMP_PLACES ===
-This variable can hold two kinds of values: a name specifying (hardware) places, or a list that marks places.
-{| class="wikitable" style="width:40%;"
-| Abstract name || Meaning
-|-
-| <code>threads</code> || a place is a single hardware thread, i. e. the hyperthreading will be ignored
-|-
-| <code>cores</code> || a place is a single core with its corresponding amount of hardware threads
-|-
-| <code>sockets</code> || a place is a single socket
-|}
-In order to define specific places by an interval, OMP_PLACES can be set to <code><lowerbound>:<length>:<stride></code>.
-All of these three values are non-negative integers and must not exceed your system's bounds. The value of <code><lowerbound></code> can be defined as a list of hardware threads. As an interval, <code><lowerbound></code> has this format: <code>{<starting_point>:<length>}</code> that can be a single place, or a place that holds several hardware threads, which is indicated by <code><length></code>.
-{| class="wikitable" style="width:80%;"
-| Example hardware || OMP_PLACES || Places
-|-
-| 24 cores with one hardware thread each, starting at core 0 and using every 2nd core || <code>{0}:24:2</code> or <code>{0:1}:24:2</code> || <code>{0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22}</code>
-|-
-| 12 cores with two hardware threads each, starting at the first two hardware threads on the first core ({0,1}) and using every 4th core || <code>{0,1}:12:4</code> or <code>{0:2}:12:1</code> || <code>{{0,1}, {4,5}, {8,9}, {12,13}, {16,17}, {20,21}}</code>
-|}
-You can also determine these places with a comma-separated list. Say there are 8 cores available with one hardware thread each, and you would like to execute your application on the first four cores, you could define this:
- $ export OMP_PLACES={0, 1, 2, 3}
-=== OMP_PROC_BIND ===
-Now that you have set OMP_PROC_BIND, you can now define the order in which the places should be assigned. This is especially useful for NUMA systems because some threads may have to access remote memory, which will slow your application down significantly. If OMP_PROC_BIND is not set, your system will distribute the threads across the nodes and cores randomly.
-{| class="wikitable" style="width:60%;"
-| Value || Function
-|-
-| <code>true</code> || the threads should not be moved
-|-
-| <code>false</code> || the threads can be moved
-|-
-| <code>master</code> || worker threads are in the same partition as the master
-|-
-| <code>close</code> || worker threads are close to the master in contiguous partitions, e. g. if the master is occupying hardware thread 0, worker 1 will be placed on hw thread 1, worker 2 on hw thread 2 and so on
-|-
-| <code>spread</code> || workers are spread across the available places to maximize the space inbetween two neighbouring threads
-|}