Difference between revisions of "Building LLVM/Clang with OpenMP Offloading to NVIDIA GPUs"

From HPC Wiki
Jump to navigation Jump to search
(remove some unnecessary configure options)
(Revert to old version written for Clang 7.0; this has many additional information not present in the current article)
Line 1: Line 1:
 
[[Category:HPC-Developer]]
 
[[Category:HPC-Developer]]
<!--
 
 
Clang 7.0, released in September 2018, has support for offloading to NVIDIA GPUs.
 
Clang 7.0, released in September 2018, has support for offloading to NVIDIA GPUs.
 
These instructions will guide you through the process of building the Clang compiler on Linux.
 
These instructions will guide you through the process of building the Clang compiler on Linux.
 
While this page refers to version 7.0, it should be applicable (with possibly minor adaptions) to later versions.
 
While this page refers to version 7.0, it should be applicable (with possibly minor adaptions) to later versions.
 
It's recommended to get the latest release from https://releases.llvm.org/!
 
It's recommended to get the latest release from https://releases.llvm.org/!
-->
+
 
This guide describes how to build the Clang compiler with OpenMP support for offloading computational task to Nvidia GPUs. A working Linux environment with GCC (8.3.0) and CMake (3.15.6) is assumed for the build process. LLVM/Clang ([https://github.com/llvm/llvm-project/releases 10.0.0] or later) is recommended, because some bugs relevant to OpenMP GPU-Offloading were found in earlier versions of LLVM/Clang in [https://github.com/pc2/OMP-Offloading our tests].
 
<!--
 
 
== Determine GPU Architectures ==
 
== Determine GPU Architectures ==
  
Line 17: Line 14:
 
A clearly structured table can be found on [https://en.wikipedia.org/wiki/CUDA#GPUs_supported Wikpedia] or in NVIDIA's [https://developer.nvidia.com/cuda-gpus developer documentation].
 
A clearly structured table can be found on [https://en.wikipedia.org/wiki/CUDA#GPUs_supported Wikpedia] or in NVIDIA's [https://developer.nvidia.com/cuda-gpus developer documentation].
 
As an example, the "Tesla P100" has compute capability 6.0 while the more recent Volta GPU "Tesla V100" is listed with 7.0.
 
As an example, the "Tesla P100" has compute capability 6.0 while the more recent Volta GPU "Tesla V100" is listed with 7.0.
-->
 
 
== Determine GPU(s) on Compute Node ==
 
 
First of all, we need to determine whether the GPU(s) on a compute node can be correctly identified by using the command <code>nvidia-smi</code>. As an example, the output below shows two Nvidia RTX 2080 Ti GPUs on one compute node in the OCuLUS system at [https://pc2.uni-paderborn.de/ Paderborn Center for Parallel Computing], Paderborn University, Germany.
 
 
<pre>
 
+-----------------------------------------------------------------------------+
 
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2    |
 
|-------------------------------+----------------------+----------------------+
 
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 
| Fan  Temp  Perf  Pwr:Usage/Cap|        Memory-Usage | GPU-Util  Compute M. |
 
|===============================+======================+======================|
 
|  0  GeForce RTX 208...  Off  | 00000000:03:00.0 Off |                  N/A |
 
| 31%  35C    P0    64W / 250W |      0MiB / 11019MiB |      0%      Default |
 
+-------------------------------+----------------------+----------------------+
 
|  1  GeForce RTX 208...  Off  | 00000000:84:00.0 Off |                  N/A |
 
| 35%  34C    P0    35W / 250W |      0MiB / 11019MiB |      0%      Default |
 
+-------------------------------+----------------------+----------------------+
 
 
+-----------------------------------------------------------------------------+
 
| Processes:                                                      GPU Memory |
 
|  GPU      PID  Type  Process name                            Usage      |
 
|=============================================================================|
 
|  No running processes found                                                |
 
+-----------------------------------------------------------------------------+
 
</pre>
 
  
As can be seen, the Nvidia driver version is 440.33.01 and CUDA version is 10.2. Then, we're ready to build LLVM/Clang with OpenMP support for GPU-offloading.
 
<!--
 
 
== Install Prerequisites ==
 
== Install Prerequisites ==
  
Line 96: Line 64:
 
</syntaxhighlight>
 
</syntaxhighlight>
 
Again the last step is optional if you are skipping <code>compiler-rt</code>.
 
Again the last step is optional if you are skipping <code>compiler-rt</code>.
-->
 
  
== Download LLVM/Clang (10.0.0 or later) ==
+
== Build the Compiler ==
 
 
LLVM/Clang (10.0.0) can be obtained by running:
 
 
 
<syntaxhighlight lang="bash">
 
curl -Ls https://github.com/llvm/llvm-project/archive/llvmorg-10.0.0.tar.gz | tar zxf -
 
</syntaxhighlight>
 
 
 
Whereas the latest source code on GitHub can be downloaded by running:
 
  
<syntaxhighlight lang="bash">
 
git clone https://github.com/llvm/llvm-project.git
 
</syntaxhighlight>
 
 
== Build the Compiler ==
 
<!--
 
 
With the sources in place let's proceed to configure and build the compiler.
 
With the sources in place let's proceed to configure and build the compiler.
 
Projects using CMake are usually built in a separate directory:
 
Projects using CMake are usually built in a separate directory:
Line 213: Line 166:
  
 
This should give you some <code>libomptarget-nvptx-sm_??.bc</code> libraries as mentioned in the warning message.
 
This should give you some <code>libomptarget-nvptx-sm_??.bc</code> libraries as mentioned in the warning message.
-->
 
To support OpenMP GPU-offloading two building steps for LLVM/Clang are required: first compile LLVM/Clang with GCC and then bootstrap LLVM/Clang itself.
 
 
=== Build LLVM/Clang with GCC ===
 
 
The following commands can be used to compile and install the Clang compiler, as well as some other libraries. See https://llvm.org/docs/ for the explanation of the cmake options.
 
<pre>
 
cmake                                                                          \
 
  -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra;libcxx;libcxxabi;lld;openmp" \
 
  -DCMAKE_BUILD_TYPE=Release                                                  \
 
  -DLLVM_TARGETS_TO_BUILD="X86;NVPTX"                                          \
 
  -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_61                                      \
 
  -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=35,37,50,52,60,61,70,75            \
 
  -DCMAKE_C_COMPILER=gcc                                                      \
 
  -DCMAKE_CXX_COMPILER=g++                                                    \
 
  -G "Unix Makefiles" the-llvm-project-directory/llvm
 
make -j 16
 
make install
 
</pre>
 
 
=== Bootstrap LLVM/Clang ===
 
 
The following commands can be used to bootstrap Clang by itself. Please note GNU's libstdc++ (instead of libc++ from LLVM) is used during linking.
 
<pre>
 
cmake                                                                          \
 
  -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra;libcxx;libcxxabi;lld;openmp" \
 
  -DCMAKE_BUILD_TYPE=Release                                                  \
 
  -DLLVM_TARGETS_TO_BUILD="X86;NVPTX"                                          \
 
  -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_61                                      \
 
  -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=35,37,50,52,60,61,70,75            \
 
  -DCMAKE_C_COMPILER=clang                                                    \
 
  -DCMAKE_CXX_COMPILER=clang++                                                \
 
  -G "Unix Makefiles" the-llvm-project-directory/llvm
 
make -j 16
 
make install
 
</pre>
 
  
 
== Done ==
 
== Done ==
  
Now, we have successfully installed the Clang compiler with OpenMP GPU-offloading support. Code samples of OpenMP GPU-offloading and more information can be found on https://github.com/pc2/OMP-Offloading.
 
 
<!--
 
 
Following the instructions up to this point you should now have a fully working Clang compiler with support for OpenMP offloading!
 
Following the instructions up to this point you should now have a fully working Clang compiler with support for OpenMP offloading!
  
 
<span style="font-size:85%;">This guide was originally published as a blog post: https://www.hahnjo.de/blog/2018/10/08/clang-7.0-openmp-offloading-nvidia.html</span>
 
<span style="font-size:85%;">This guide was originally published as a blog post: https://www.hahnjo.de/blog/2018/10/08/clang-7.0-openmp-offloading-nvidia.html</span>
-->
 

Revision as of 09:16, 6 May 2020

Clang 7.0, released in September 2018, has support for offloading to NVIDIA GPUs. These instructions will guide you through the process of building the Clang compiler on Linux. While this page refers to version 7.0, it should be applicable (with possibly minor adaptions) to later versions. It's recommended to get the latest release from https://releases.llvm.org/!

Determine GPU Architectures

As of writing Clang's OpenMP implementation for NVIDIA GPUs doesn't support multiple GPU architectures in a single binary. This means that you have to know the target GPU when compiling an OpenMP application. Additionally Clang needs compatible runtime libraries for every architecture that you'll want to use in the future.

So first of all you need to gather a list of GPU models that you are going to run on and map them to a list of architectures. A clearly structured table can be found on Wikpedia or in NVIDIA's developer documentation. As an example, the "Tesla P100" has compute capability 6.0 while the more recent Volta GPU "Tesla V100" is listed with 7.0.

Install Prerequisites

Building LLVM requires some software:

  • First you'll need some standard tools like make, tar, and xz. If you don't have them installed, please consult your distribution's instructions on how to get them.
  • For the build process a compiler already needs to be installed. Most Linux systems default to the GNU Compiler Collection (gcc). Please ensure that you have at least version 4.8 or refer to some online tutorials on how to install one for your system. If you happen to have an older installation of Clang, any version greater than version 3.1 should be fine.
  • Additionally LLVM requires a (more or less) recent CMake, at least version 3.4.3. If your distribution doesn't provide an adequate version, see https://cmake.org/ on how to get it.
  • For the runtime libraries the system needs both libelf and its development headers.
  • Last but not least, you'll need the CUDA toolkit by NVIDIA. However the latest CUDA 10.0 is not yet compatible with Clang 7.0. For that release it's recommended to use version 9.2. This release also has support for Volta GPUs which may already be found in some HPC systems.

Download and Extract Sources

The LLVM project consists of multiple components. For the purpose of this guide, you need at least the LLVM Core libraries, Clang and the OpenMP project. Download their tarballs from https://releases.llvm.org/:

 $ wget https://releases.llvm.org/7.0.0/llvm-7.0.0.src.tar.xz
 $ wget https://releases.llvm.org/7.0.0/cfe-7.0.0.src.tar.xz
 $ wget https://releases.llvm.org/7.0.0/openmp-7.0.0.src.tar.xz

You might also want to download and build compiler-rt:

 $ wget https://releases.llvm.org/7.0.0/compiler-rt-7.0.0.src.tar.xz

This will give you some runtime libraries that are required to use Clang's sanitizers. A detailed explanation would go beyond the scope of this page, but you can take a look at the documentation of ASan, LSan, MSan, and TSan. (Please keep in mind that these links document the current development, so not all features might be available in a released version!)

It's highly recommended to verify the integrity of the downloaded archives. Each file has been signed by the release manager and you can find both the public key and .sig files next to the files you have just downloaded.

The next step is to unpack the tarballs: (the last step may be skipped if you don't want to build compiler-rt)

 $ tar xf llvm-7.0.0.src.tar.xz
 $ tar xf cfe-7.0.0.src.tar.xz
 $ tar xf openmp-7.0.0.src.tar.xz
 $ tar xf compiler-rt-7.0.0.src.tar.xz

This should leave you with 3 / 4 directories named llvm-7.0.0.src, cfe-7.0.0.src, openmp-7.0.0.src, and (optionally) compiler-rt-7.0.0.src. All these components can be built together if the directories are correctly nested:

 $ mv cfe-7.0.0.src llvm-7.0.0.src/tools/clang
 $ mv openmp-7.0.0.src llvm-7.0.0.src/projects/openmp
 $ mv compiler-rt-7.0.0.src llvm-7.0.0.src/projects/compiler-rt

Again the last step is optional if you are skipping compiler-rt.

Build the Compiler

With the sources in place let's proceed to configure and build the compiler. Projects using CMake are usually built in a separate directory:

 $ mkdir build
 $ cd build

The next steps will be pretty IO-intensive, so it might be a good idea to put the build directory on a locally attached disk (or even an SSD).

Next CMake needs to generate Makefiles which will eventually be used for compilation:

 $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$(pwd)/../install \
	-DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_60 \
	-DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=35,60,70 ../llvm-7.0.0.src

Of course you can use any other Generator that CMake supports.

The first two flags are standard for CMake projects: CMAKE_BUILD_TYPE=Release turns on optimizations and disables debug information. CMAKE_INSTALL_PREFIX specifies where the final binaries and libraries will be installed. Be sure to choose a permanent location if you are building in a temporary directory.

The other two options are related to the GPU architectures as mentioned above. CLANG_OPENMP_NVPTX_DEFAULT_ARCH sets the default architecture when not passing the value during compilation. You should adjust the default to match the environment you'll be using most of the time. The architecture must be prefix with sm_, so Clang configured with the above command will build for the Tesla P100 by default.
LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES applies to the runtime libraries: It specifies a list of architectures that the libraries will be built for. As you cannot run on GPUs without a compatible runtime, you should pass all architectures you care about. Also, please note that the values are passed without the dot, so compute capability 7.0 becomes 70.

If everything went right you should see something like the following towards the end of the output:

-- Found LIBOMPTARGET_DEP_LIBELF: /usr/lib64/libelf.so
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.27.1") 
-- Found LIBOMPTARGET_DEP_LIBFFI: /usr/lib64/libffi.so
-- Found LIBOMPTARGET_DEP_CUDA_DRIVER: <<<REDACTED>>>/libcuda.so
-- LIBOMPTARGET: Building offloading runtime library libomptarget.
-- LIBOMPTARGET: Not building aarch64 offloading plugin: machine not found in the system.
-- LIBOMPTARGET: Building CUDA offloading plugin.
-- LIBOMPTARGET: Not building PPC64 offloading plugin: machine not found in the system.
-- LIBOMPTARGET: Not building PPC64le offloading plugin: machine not found in the system.
-- LIBOMPTARGET: Building x86_64 offloading plugin.
-- LIBOMPTARGET: Building CUDA offloading device RTL.

In this case the system also has libffi installed which allows building a plugin that offloads to the host (here: x86_64). This is mostly used for testing and not required for offloading to GPUs.

Now comes the time-consuming part:

 $ make -j8

Using the -j parameter (short for --jobs) you can allow make to run multiple commands concurrently. Usually the number of cores in your server is a reasonable choice which can speed up the compilation by a good deal.

Afterwards the built libraries and binaries need to be installed:

 $ make -j8 install

Rebuild the OpenMP Runtime Libraries with Clang

If you tried to compile an application with OpenMP offloading right now, Clang would print the following message:

clang-7: warning: No library 'libomptarget-nvptx-sm_60.bc' found in the default clang lib directory or in LIBRARY_PATH. Expect degraded performance due to no inlining of runtime functions on target devices. [-Wopenmp-target]

As you'd expect from a warning you can run perfectly fine without these "bitcode libraries". However GPUs are meant as an accelerator so you want your application to run as fast as possible. To get the missing libraries you'll need to recompile the OpenMP project, using Clang built in the previous step.

Instead of only rebuilding the OpenMP project, it's also possible to repeat step 3 entirely. That's usually referred to as "bootstrapping" because Clang is compiling its own source code. This is usually preferred when installing a released version of a compiler.
Anyway, the following will explain building only the OpenMP runtime libraries which will get you the required files much faster.

To do so, first create a new build directory:

 $ cd ..
 $ mkdir build-openmp
 $ cd build-openmp

Now configure the project with CMake using the Clang compiler built in the previous step:

 $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$(pwd)/../install \
	-DCMAKE_C_COMPILER=$(pwd)/../install/bin/clang \
	-DCMAKE_CXX_COMPILER=$(pwd)/../install/bin/clang++ \
	-DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=35,60,70 \
	../llvm-7.0.0.src/projects/openmp

The flags are the same as above except that we want to use a different compiler. With CMake this can be adjusted with CMAKE_C_COMPILER and CMAKE_CXX_COMPILER. If you installed the binaries to a different location, you need to adapt their values accordingly.

Build and install the OpenMP runtime libraries:

 $ make -j8
 $ make -j8 install

This should give you some libomptarget-nvptx-sm_??.bc libraries as mentioned in the warning message.

Done

Following the instructions up to this point you should now have a fully working Clang compiler with support for OpenMP offloading!

This guide was originally published as a blog post: https://www.hahnjo.de/blog/2018/10/08/clang-7.0-openmp-offloading-nvidia.html