Building LLVM/Clang with OpenMP Offloading to NVIDIA GPUs

Clang 7.0, released in September 2018, has support for offloading to NVIDIA GPUs. These instructions will guide you through the process of building the Clang compiler on Linux. While this page refers to version 7.0, it should be applicable (with possibly minor adaptions) to later versions. It's recommended to get the latest release from https://releases.llvm.org/!

Determine GPU Architectures

As of writing Clang's OpenMP implementation for NVIDIA GPUs doesn't support multiple GPU architectures in a single binary. This means that you have to know the target GPU when compiling an OpenMP application. Additionally Clang needs compatible runtime libraries for every architecture that you'll want to use in the future.

So first of all you need to gather a list of GPU models that you are going to run on and map them to a list of architectures. A clearly structured table can be found on Wikpedia or in NVIDIA's developer documentation. As an example, the "Tesla P100" has compute capability 6.0 while the more recent Volta GPU "Tesla V100" is listed with 7.0.

Install Prerequisites

Building LLVM requires some software:

First you'll need some standard tools like make, tar, and xz. If you don't have them installed, please consult your distribution's instructions on how to get them.
For the build process a compiler already needs to be installed. Most Linux systems default to the GNU Compiler Collection (gcc). Please ensure that you have at least version 4.8 or refer to some online tutorials on how to install one for your system. If you happen to have an older installation of Clang, any version greater than version 3.1 should be fine.
Additionally LLVM requires a (more or less) recent CMake, at least version 3.4.3. If your distribution doesn't provide an adequate version, see https://cmake.org/ on how to get it.
For the runtime libraries the system needs both libelf and its development headers.
Last but not least, you'll need the CUDA toolkit by NVIDIA. However the latest CUDA 10.0 is not yet compatible with Clang 7.0. For that release it's recommended to use version 9.2. This release also has support for Volta GPUs which may already be found in some HPC systems.

Download and Extract Sources

The LLVM project consists of multiple components. For the purpose of this guide, you need at least the LLVM Core libraries, Clang and the OpenMP project. Download their tarballs from https://releases.llvm.org/:

 $ wget https://releases.llvm.org/7.0.0/llvm-7.0.0.src.tar.xz
 $ wget https://releases.llvm.org/7.0.0/cfe-7.0.0.src.tar.xz
 $ wget https://releases.llvm.org/7.0.0/openmp-7.0.0.src.tar.xz

You might also want to download and build compiler-rt:

 $ wget https://releases.llvm.org/7.0.0/compiler-rt-7.0.0.src.tar.xz

This will give you some runtime libraries that are required to use Clang's sanitizers. A detailed explanation would go beyond the scope of this page, but you can take a look at the documentation of ASan, LSan, MSan, and TSan. (Please keep in mind that these links document the current development, so not all features might be available in a released version!)

It's highly recommended to verify the integrity of the downloaded archives. Each file has been signed by the release manager and you can find both the public key and .sig files next to the files you have just downloaded.

The next step is to unpack the tarballs: (the last step may be skipped if you don't want to build compiler-rt)

 $ tar xf llvm-7.0.0.src.tar.xz
 $ tar xf cfe-7.0.0.src.tar.xz
 $ tar xf openmp-7.0.0.src.tar.xz
 $ tar xf compiler-rt-7.0.0.src.tar.xz

This should leave you with 3 / 4 directories named llvm-7.0.0.src, cfe-7.0.0.src, openmp-7.0.0.src, and (optionally) compiler-rt-7.0.0.src. All these components can be built together if the directories are correctly nested:

 $ mv cfe-7.0.0.src llvm-7.0.0.src/tools/clang
 $ mv openmp-7.0.0.src llvm-7.0.0.src/projects/openmp
 $ mv compiler-rt-7.0.0.src llvm-7.0.0.src/projects/compiler-rt

Again the last step is optional if you are skipping compiler-rt.

Build the Compiler

With the sources in place let's proceed to configure and build the compiler. Projects using CMake are usually built in a separate directory:

 $ mkdir build
 $ cd build

The next steps will be pretty IO-intensive, so it might be a good idea to put the build directory on a locally attached disk (or even an SSD).

Next CMake needs to generate Makefiles which will eventually be used for compilation:

 $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$(pwd)/../install \
	-DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_60 \
	-DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=35,60,70 ../llvm-7.0.0.src

Of course you can use any other Generator that CMake supports.

The first two flags are standard for CMake projects: CMAKE_BUILD_TYPE=Release turns on optimizations and disables debug information. CMAKE_INSTALL_PREFIX specifies where the final binaries and libraries will be installed. Be sure to choose a permanent location if you are building in a temporary directory.

The other two options are related to the GPU architectures as mentioned above. CLANG_OPENMP_NVPTX_DEFAULT_ARCH sets the default architecture when not passing the value during compilation. You should adjust the default to match the environment you'll be using most of the time. The architecture must be prefix with sm_, so Clang configured with the above command will build for the Tesla P100 by default.
LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES applies to the runtime libraries: It specifies a list of architectures that the libraries will be built for. As you cannot run on GPUs without a compatible runtime, you should pass all architectures you care about. Also, please note that the values are passed without the dot, so compute capability 7.0 becomes 70.

If everything went right you should see something like the following towards the end of the output:

-- Found LIBOMPTARGET_DEP_LIBELF: /usr/lib64/libelf.so
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.27.1") 
-- Found LIBOMPTARGET_DEP_LIBFFI: /usr/lib64/libffi.so
-- Found LIBOMPTARGET_DEP_CUDA_DRIVER: <<<REDACTED>>>/libcuda.so
-- LIBOMPTARGET: Building offloading runtime library libomptarget.
-- LIBOMPTARGET: Not building aarch64 offloading plugin: machine not found in the system.
-- LIBOMPTARGET: Building CUDA offloading plugin.
-- LIBOMPTARGET: Not building PPC64 offloading plugin: machine not found in the system.
-- LIBOMPTARGET: Not building PPC64le offloading plugin: machine not found in the system.
-- LIBOMPTARGET: Building x86_64 offloading plugin.
-- LIBOMPTARGET: Building CUDA offloading device RTL.

In this case the system also has libffi installed which allows building a plugin that offloads to the host (here: x86_64). This is mostly used for testing and not required for offloading to GPUs.

Now comes the time-consuming part:

 $ make -j8

Using the -j parameter (short for --jobs) you can allow make to run multiple commands concurrently. Usually the number of cores in your server is a reasonable choice which can speed up the compilation by a good deal.

Afterwards the built libraries and binaries need to be installed:

 $ make -j8 install

Rebuild the OpenMP Runtime Libraries with Clang

If you tried to compile an application with OpenMP offloading right now, Clang would print the following message:

clang-7: warning: No library 'libomptarget-nvptx-sm_60.bc' found in the default clang lib directory or in LIBRARY_PATH. Expect degraded performance due to no inlining of runtime functions on target devices. [-Wopenmp-target]

As you'd expect from a warning you can run perfectly fine without these "bitcode libraries". However GPUs are meant as an accelerator so you want your application to run as fast as possible. To get the missing libraries you'll need to recompile the OpenMP project, using Clang built in the previous step.

Instead of only rebuilding the OpenMP project, it's also possible to repeat step 3 entirely. That's usually referred to as "bootstrapping" because Clang is compiling its own source code. This is usually preferred when installing a released version of a compiler.
Anyway, the following will explain building only the OpenMP runtime libraries which will get you the required files much faster.

To do so, first create a new build directory:

 $ cd ..
 $ mkdir build-openmp
 $ cd build-openmp

Now configure the project with CMake using the Clang compiler built in the previous step:

 $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$(pwd)/../install \
	-DCMAKE_C_COMPILER=$(pwd)/../install/bin/clang \
	-DCMAKE_CXX_COMPILER=$(pwd)/../install/bin/clang++ \
	-DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=35,60,70 \
	../llvm-7.0.0.src/projects/openmp

The flags are the same as above except that we want to use a different compiler. With CMake this can be adjusted with CMAKE_C_COMPILER and CMAKE_CXX_COMPILER. If you installed the binaries to a different location, you need to adapt their values accordingly.

Build and install the OpenMP runtime libraries:

 $ make -j8
 $ make -j8 install

This should give you some libomptarget-nvptx-sm_??.bc libraries as mentioned in the warning message.

Done

Following the instructions up to this point you should now have a fully working Clang compiler with support for OpenMP offloading!

This guide was originally published as a blog post: https://www.hahnjo.de/blog/2018/10/08/clang-7.0-openmp-offloading-nvidia.html