Difference between revisions of "GPU Tutorial/SAXPY CUDA C"

From HPC Wiki
GPU Tutorial/SAXPY CUDA C
Jump to navigation Jump to search
 
(6 intermediate revisions by the same user not shown)
Line 5: Line 5:
  
 
This video discusses the SAXPY via NVIDIA CUDA C/C++.
 
This video discusses the SAXPY via NVIDIA CUDA C/C++.
 +
CUDA is an application programming interface (API) for NVIDIA GPUs. In general, CUDA works with many programming languages, but this tutorial is going to focus on C/C++. CUDA gives access to a GPUs instruction set, which means we have to go through everything step-by-step, since many things do not happen automatically.
  
 
=== Video === <!--T:5-->
 
=== Video === <!--T:5-->
Line 20: Line 21:
 
{
 
{
 
|type="()"}
 
|type="()"}
- functions
+
- new functions
 
|| CUDA does not only add new functions, but all of these features.
 
|| CUDA does not only add new functions, but all of these features.
- syntax
+
- new syntax
 
|| CUDA does not only add new syntax, but all of these features.
 
|| CUDA does not only add new syntax, but all of these features.
 
- GPU support
 
- GPU support
Line 53: Line 54:
 
|type="()"}
 
|type="()"}
 
- __host__
 
- __host__
|| Wrong
+
|| Wrong. This specifies a function that runs on the CPU.
 
- __device__
 
- __device__
|| Wrong
+
|| Wrong. This indeed does specify a function that runs on the GPU, but it also needs to be called from the GPU, while we want a kernel to be launched by the CPU.
 
+ __global__
 
+ __global__
 
|| Correct
 
|| Correct
 
- __GPU__
 
- __GPU__
|| Wrong
+
|| Wrong. This modifier doesn't exist.
 
</quiz>
 
</quiz>
 
{{hidden end}}
 
{{hidden end}}
Line 68: Line 69:
 
{
 
{
 
|type="()"}
 
|type="()"}
- MyKernel()
+
- MyKernel();
|| Wrong
+
|| Wrong. This would just execute an ordinary function.
- CUDA.run(NoBlocks, NoThreads, MyKernel())
+
- CUDA.run(NoBlocks, NoThreads, MyKernel());
|| Wrong
+
|| Wrong. There is no CUDA.run()
+ <<<NoBlocks, NoThreads>>>MyKernel()
+
+ <<<NoBlocks, NoThreads>>>MyKernel();
 
|| Correct
 
|| Correct
- __global(NoBlocks, NoThreads)__ MyKernel()
+
- __global(NoBlocks, NoThreads)__ MyKernel();
|| Wrong
+
|| Wrong. __global__ and other modifiers cant have arguments and are part of a function definition, not launch.
 
</quiz>
 
</quiz>
 
{{hidden end}}
 
{{hidden end}}
Line 81: Line 82:
  
 
{{hidden begin  
 
{{hidden begin  
|title = 5. Inside your kernel function, how do you distribute your data over the threads?}}
+
|title = 5. Inside your kernel function, how do you distribute your data over the GPU threads?}}
 
<quiz display=simple>
 
<quiz display=simple>
 
{
 
{
Line 89: Line 90:
 
+ Each thread has has an index attached to it, which is addressed via threadIdx.x
 
+ Each thread has has an index attached to it, which is addressed via threadIdx.x
 
|| Correct
 
|| Correct
- If you use array-element-wise operations (like y.=a.*x.+b ), this is managed by the NVIDIA preprocessor.
+
- If you use array-element-wise operations, e.g.: y.=a.*x.+b . This is managed by the NVIDIA preprocessor.
|| Wrong
+
|| Wrong. There are no element-wise operators in C/C++
 
- You flag a line to be parallelized via keywords, e.g.: __device__ y=a*x+b
 
- You flag a line to be parallelized via keywords, e.g.: __device__ y=a*x+b
|| Wrong
+
|| Wrong. These modifiers are used at function definitions.
</quiz>
 
{{hidden end}}
 
 
 
 
 
 
 
 
 
 
 
=== Introduction Quiz === <!--T:5--> 
 
{{hidden begin
 
|title = 1. For which kind of program can we expect improvements with GPUs?}}
 
<quiz display=simple>
 
{
 
|type="()"}
 
- serial programs
 
|| Correct: CPU: optimized for low latency (strong single thread); GPU: optimized for throughput (massive parallelism)
 
+ parallel programs
 
|| Wrong: CPU: optimized for low latency (strong single thread); GPU: optimized for throughput (massive parallelism)
 
</quiz>
 
{{hidden end}}
 
 
 
 
 
{{hidden begin
 
|title = 2. What does GPU stands for?}}
 
<quiz display=simple>
 
{
 
|type="()"}
 
+ graphics processing unit
 
|| Correct
 
-  grand powerful unit
 
|| Wrong
 
</quiz>
 
{{hidden end}}
 
 
 
 
 
{{hidden begin
 
|title = 3. Why do we expect an onverhead in the GPU timings?}}
 
<quiz display=simple>
 
{
 
|type="()"}
 
- The data must be copied to an extra device first and has to be transferred back later
 
|| Correct, but his is not the whole answer.
 
- A GPU core is "weaker" than a CPU core
 
|| Correct, but his is not the whole answer.
 
- For "small" problems like the SAXPY, the whole power of a GPU is rarely used
 
|| Correct, but his is not the whole answer.
 
+ All of the above
 
|| Correct!
 
 
</quiz>
 
</quiz>
 
{{hidden end}}
 
{{hidden end}}

Latest revision as of 11:17, 3 January 2022

Tutorial
Title: Introduction to GPU Computing
Provider: HPC.NRW

Contact: tutorials@hpc.nrw
Type: Multi-part video
Topic Area: GPU computing
License: CC-BY-SA
Syllabus

1. Introduction
2. Several Ways to SAXPY: CUDA C/C++
3. Several Ways to SAXPY: OpenMP
4. Several Ways to SAXPY: Julia
5. Several Ways to SAXPY: NUMBA

This video discusses the SAXPY via NVIDIA CUDA C/C++. CUDA is an application programming interface (API) for NVIDIA GPUs. In general, CUDA works with many programming languages, but this tutorial is going to focus on C/C++. CUDA gives access to a GPUs instruction set, which means we have to go through everything step-by-step, since many things do not happen automatically.

Video

(Slides as pdf)


Quiz

1. Which features does CUDA add to C/C++?

new functions
new syntax
GPU support
All of the above


2. What is a kernel?

It's a flag you can set to automatically parallelize any function.
It's the part of your code that is run on the GPU.
It's a new CUDA function that activates the GPU.


3. How do you flag a function to be a kernel?

__host__
__device__
__global__
__GPU__

4. Let's say you coded your kernel function called "MyKernel". How do you run it?

MyKernel();
CUDA.run(NoBlocks, NoThreads, MyKernel());
<<<NoBlocks, NoThreads>>>MyKernel();
__global(NoBlocks, NoThreads)__ MyKernel();


5. Inside your kernel function, how do you distribute your data over the GPU threads?

You don't have to, CUDA does that automatically for you.
Each thread has has an index attached to it, which is addressed via threadIdx.x
If you use array-element-wise operations, e.g.: y.=a.*x.+b . This is managed by the NVIDIA preprocessor.
You flag a line to be parallelized via keywords, e.g.: __device__ y=a*x+b