Difference between revisions of "GPU Tutorial/SAXPY CUDA C"

Revision as of 12:44, 11 November 2021

Tutorial
Title:	Introduction to GPU Computing
Provider:	HPC.NRW
Contact:	tutorials@hpc.nrw
Type:	Multi-part video
Topic Area:	GPU computing
License:	CC-BY-SA
Syllabus
1. Introduction
2. Several Ways to SAXPY: CUDA C/C++
3. Several Ways to SAXPY: OpenMP
4. Several Ways to SAXPY: Julia
5. Several Ways to SAXPY: NUMBA

This video discusses the SAXPY via NVIDIA CUDA C/C++.

1. Which features does CUDA add to C/C++?

2. What is a kernel?

3. How do you flag a function to be a kernel?

4. Let's say you coded your kernel function called "MyKernel". How do you run it?

5. Inside your kernel function, how do you distribute your data over the GPU threads?

1. For which kind of program can we expect improvements with GPUs?

2. What does GPU stands for?

3. Why do we expect an overhead in the GPU timings?

@@ Line 68: / Line 68: @@
 {
 |type="()"}
-- MyKernel()
+- MyKernel();
 || Wrong. This would just execute an ordinary function.
-- CUDA.run(NoBlocks, NoThreads, MyKernel())
+- CUDA.run(NoBlocks, NoThreads, MyKernel());
 || Wrong. There is no CUDA.run()
-+ <<<NoBlocks, NoThreads>>>MyKernel()
++ <<<NoBlocks, NoThreads>>>MyKernel();
 || Correct
-- __global(NoBlocks, NoThreads)__ MyKernel()
+- __global(NoBlocks, NoThreads)__ MyKernel();
 || Wrong. __global__ and other modifiers cant have arguments and are part of a function definition, not launch.
 </quiz>
@@ Line 89: / Line 89: @@
 + Each thread has has an index attached to it, which is addressed via threadIdx.x
 || Correct
-- If you use array-element-wise operations (like y.=a.*x.+b ), this is managed by the NVIDIA preprocessor.
+- If you use array-element-wise operations, e.g.: y.=a.*x.+b . this is managed by the NVIDIA preprocessor.
 || Wrong. There are no element-wise operators in C/C++
 - You flag a line to be parallelized via keywords, e.g.: __device__ y=a*x+b

	MyKernel();
	CUDA.run(NoBlocks, NoThreads, MyKernel());
	<<<NoBlocks, NoThreads>>>MyKernel();
	__global(NoBlocks, NoThreads)__ MyKernel();

	You don't have to, CUDA does that automatically for you.
	Each thread has has an index attached to it, which is addressed via threadIdx.x
	If you use array-element-wise operations, e.g.: y.=a.*x.+b . this is managed by the NVIDIA preprocessor.
	You flag a line to be parallelized via keywords, e.g.: __device__ y=a*x+b

	It's a flag you can set to automatically parallelize any function.
	It's the part of your code that is run on the GPU.
	It's a new CUDA function that activates the GPU.

	The data must be copied to an extra device first and has to be transferred back later
	A GPU core is "weaker" than a CPU core
	For "small" problems like the SAXPY, the whole power of a GPU is rarely used
	All of the above

	serial programs
	parallel programs

	graphics processing unit
	grand powerful unit