diff --git a/README.md b/README.md index cb10e3a01ca109993c32a4c840aa9a65e5028848..4509dafa5b4bc8aa6422a754d10a685c1b462bb1 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,15 @@ -# Material for CUDA lab exercises +# PDC Summer School 2021: CUDA Lab Exercises ## Sources -Source code for the exercises can be found under `lab_1/C`, `lab_1/Fortran`, +The source code for the exercises can be found under `lab_1/C`, `lab_1/Fortran`, `lab_2/C` and `lab_2/Fortran`. -## Instructions for exercises +## Instructions for the exercises -- [Lab 1 C](lab_1/README.md) -- [Lab 1 Fortran](lab_1/README.md) and [specific Fortran guidelines](lab_1/Fortran/README.md) -- [Lab 2 C](lab_2/README.md) -- [Lab 2 Fortran](lab_2/README.md) and [specific Fortran guidelines](lab_1/Fortran/README.md) +The instructions for the different exercises are here: + +- Lab 1 + - C programmers: [Lab 1](lab_1/README.md) + - Fortran programmers: [Lab 1](lab_1/README.md) and [specific Fortran guidelines](lab_1/Fortran/README.md) +- Lab 2: [Lab 2](lab_1/README.md) \ No newline at end of file diff --git a/lab_1/C/lab01_ex1.cu b/lab_1/C/lab01_ex1.cu index bae1d240dbc1ff56ef2bb3646a3475cd64359aad..1f5384d1daf912c5ce7d6b430d089394732a47bb 100644 --- a/lab_1/C/lab01_ex1.cu +++ b/lab_1/C/lab01_ex1.cu @@ -30,7 +30,7 @@ int main(int argc, char **argv) gpu_helloworld<<<grid, block>>>(); //////////////// - // TO-DO #1.2 //////////////////// + // TO-DO A1.2 //////////////////// // Introduce your changes here! // ////////////////////////////////// diff --git a/lab_1/C/lab01_ex2.cu b/lab_1/C/lab01_ex2.cu index 7b31e461dd1293941603a301c632997c6e1cec4e..e059f571040362c1e71367670bfde12d5952d5dc 100644 --- a/lab_1/C/lab01_ex2.cu +++ b/lab_1/C/lab01_ex2.cu @@ -42,7 +42,7 @@ void cpu_saxpy(int n, float a, float *x, float *y) } //////////////// -// TO-DO #2.6 ///////////////////////////////////////////////////////////// +// TO-DO A2.6 ///////////////////////////////////////////////////////////// // Declare the kernel gpu_saxpy() with the same interface as cpu_saxpy() // /////////////////////////////////////////////////////////////////////////// @@ -54,12 +54,12 @@ int main(int argc, char **argv) float error = 0.0f; //////////////// - // TO-DO #2.2 /////////////////////////////// + // TO-DO A2.2 /////////////////////////////// // Introduce the grid and block definition // ///////////////////////////////////////////// ////////////////// - // TO-DO #2.3.1 ///////////////////////////// + // TO-DO A2.3.1 ///////////////////////////// // Declare the device pointers d_x and d_y // ///////////////////////////////////////////// @@ -83,7 +83,7 @@ int main(int argc, char **argv) } ////////////////// - // TO-DO #2.3.2 //////////////////////////////////////////////////////// + // TO-DO A2.3.2 //////////////////////////////////////////////////////// // Allocate d_x and d_y on the GPU, and copy the content from the CPU // //////////////////////////////////////////////////////////////////////// @@ -94,12 +94,12 @@ int main(int argc, char **argv) error = generate_hash(ARRAY_SIZE, y); //////////////// - // TO-DO #2.4 //////////////////////////////////////// + // TO-DO A2.4 //////////////////////////////////////// // Call the GPU kernel gpu_saxpy() with d_x and d_y // ////////////////////////////////////////////////////// ////////////////// - // TO-DO #2.5.1 //////////////////////////////////////////////////// + // TO-DO A2.5.1 //////////////////////////////////////////////////// // Copy the content of d_y from the GPU to the array y on the CPU // //////////////////////////////////////////////////////////////////// @@ -119,7 +119,7 @@ int main(int argc, char **argv) free(y); ////////////////// - // TO-DO #2.5.2 ///////// + // TO-DO A2.5.2 ///////// // Release d_x and d_y // ///////////////////////// diff --git a/lab_1/CUDA_Lab01.pdf b/lab_1/CUDA_Lab01.pdf deleted file mode 100644 index 66a1ab79c30b95cf8182afc1f96c4409b8795bfc..0000000000000000000000000000000000000000 Binary files a/lab_1/CUDA_Lab01.pdf and /dev/null differ diff --git a/lab_1/CUDA_Lab01_Fortran_Guidelines.pdf b/lab_1/CUDA_Lab01_Fortran_Guidelines.pdf deleted file mode 100644 index 179f1f8828a267bd3c6bb881b3f9b98c739197b6..0000000000000000000000000000000000000000 Binary files a/lab_1/CUDA_Lab01_Fortran_Guidelines.pdf and /dev/null differ diff --git a/lab_1/Fortran/README.md b/lab_1/Fortran/README.md index b8d489541fbeafe14c4fa2ebad7e47f071d7942b..579dd8b8c09d6f2d9ed0b4907457a862095ddf6f 100644 --- a/lab_1/Fortran/README.md +++ b/lab_1/Fortran/README.md @@ -1,27 +1,22 @@ -# Guidelines for CUDA Fortran +# PDC Summer School 2021: CUDA Laboratory 1 / Guidelines for CUDA Fortran -In this document, we are going to cover the very basic CUDA Fortran concepts in -comparison with C. We ask you to follow the CUDA Laboratory 1 description in C -and to use this document to understand what would be your changes in Fortran. +In this document, we are going to cover the very basic CUDA Fortran concepts in comparison with C. We ask you to follow the CUDA Laboratory 1 description in C and to use this document to understand what would be your changes in Fortran. ## Compiling a CUDA Fortran Program -The compilation is very similar to CUDA C, but with slight variations. First, -you need to load not only the CUDA module, but also the PGI compiler: +The compilation is very similar to CUDA C, but with slight variations. First, you need to load not only the CUDA module, but also the PGI compiler: ``` -module load cuda/7.0 pgi +module load cuda pgi ``` -To compile a CUDA Fortran program, use ``pgfortran`` and include the architecture -(i.e., ``cc3x``): +To compile a CUDA Fortran program, use ``pgfortran`` and include the architecture (i.e., ``cc3x``): ``` pgfortran -Mcuda=cc3x your_cuda_file.cuf -o your_cuda_file.out ``` -You can run a program as in the CUDA C version, allocating a node first with -``salloc`` and then running the code with ``srun``: +You can run a program as in the CUDA C version, allocating a node first with ``salloc`` and then running the code with ``srun``: ``` srun -n 1 ./your_cuda_file.out @@ -29,9 +24,7 @@ srun -n 1 ./your_cuda_file.out ## Kernel Management -The concept of ``grid`` and ``block`` is the same as in CUDA C. In this case, you need -to declare both variables as ``type(dim3)``. This is an example with a grid of 1 -block of 32 threads in X: +The concept of ``grid`` and ``block`` is the same as in CUDA C. In this case, you need to declare both variables as ``type(dim3)``. This is an example with a grid of 1 block of 32 threads in X: ``` type(dim3) :: grid @@ -48,31 +41,22 @@ call your_kernel<<<grid, block>>>( ... ) ## Memory Management -To allocate memory on the GPU and release it afterwards, use the ``cudaMalloc()`` -and the ``cudaFree()`` functions. You need to declare the variable with the ``device`` -attribute: +To allocate memory on the GPU and release it afterwards, use the ``cudaMalloc()`` and the ``cudaFree()`` functions. You need to declare the variable with the ``device`` attribute: ``` real, allocatable, device :: d_x(:) hr = cudaMalloc(d_x, 256) ``` -Here, we declare an array ``d_x`` of type ``real, allocatable`` to be used on the -GPU. Then, we use ``cudaMalloc()`` to define the size with 256 elements. We have -captured the status result of the operation in an integer ``hr``, in case we would -like to check if there were any errors. +Here, we declare an array ``d_x`` of type ``real, allocatable`` to be used on the GPU. Then, we use ``cudaMalloc()`` to define the size with 256 elements. We have captured the status result of the operation in an integer ``hr``, in case we would like to check if there were any errors. -To copy memory from the host to the GPU (or viceversa), use the ``cudaMemcpy()`` -function: +To copy memory from the host to the GPU (or viceversa), use the ``cudaMemcpy()`` function: ``` hr = cudaMemcpy(d_x, x, ARRAY_SIZE) ! Copy the content from x to d_x ``` -Compared to the CUDA C version, the main difference is that we no longer have -to specify the direction of the copy. In this case, we are copying from ``x`` (on -the CPU) to ``d_x`` (on the GPU). But we could revert the direction by simply -swapping the variables: +Compared to the CUDA C version, the main difference is that we no longer have to specify the direction of the copy. In this case, we are copying from ``x`` (on the CPU) to ``d_x`` (on the GPU). But we could revert the direction by simply swapping the variables: ``` hr = cudaMemcpy(x, d_x, ARRAY_SIZE) ! Copy the content from d_x to x @@ -80,9 +64,7 @@ hr = cudaMemcpy(x, d_x, ARRAY_SIZE) ! Copy the content from d_x to x ## Kernel Implementation -The CUDA Fortran kernels are once again very similar to their CUDA C -counterpart. In this case, you need to declare a new subroutine with the ``global`` -attribute, such as: +The CUDA Fortran kernels are once again very similar to their CUDA C counterpart. In this case, you need to declare a new subroutine with the ``global`` attribute, such as: ``` attributes(global) subroutine your_kernel(n, d_x) @@ -90,26 +72,21 @@ attributes(global) subroutine your_kernel(n, d_x) end subroutine your_kernel ``` -The type definition of the constant arguments, such as "n" in the -previous example, must contain the attribute ``value``: +The type definition of the constant arguments, such as "n" in the previous example, must contain the attribute ``value``: ``` integer, value :: n ``` -We also recommend you to specify the intent of the input parameter. In the case -of ``d_x``, we could declare it as (note that, inside the kernel, we do not specify -the ``device`` attribute): +We also recommend you to specify the intent of the input parameter. In the case of ``d_x``, we could declare it as (note that, inside the kernel, we do not specify the ``device`` attribute): ``` real, intent(inout) :: d_x(:) ``` -Finally, the predefined constants ``gridDim``, ``blockDim``, ``blockIdx`` and -``threadIdx``, are all available inside the CUDA Fortran kernel: +Finally, the predefined constants ``gridDim``, ``blockDim``, ``blockIdx`` and ``threadIdx``, are all available inside the CUDA Fortran kernel: ``` integer :: tid tid = (blockIdx%x - 1) * blockDim%x + threadIdx%x ``` - diff --git a/lab_1/Fortran/lab01_ex2.cuf b/lab_1/Fortran/lab01_ex2.cuf index f1dde0ef2c790017df537b70fdda32382d672a53..516d6814898d563a04cbc368f96ea7b08efdb298 100644 --- a/lab_1/Fortran/lab01_ex2.cuf +++ b/lab_1/Fortran/lab01_ex2.cuf @@ -32,7 +32,7 @@ end module HelperSubroutines module CUDAKernels contains !//////////////// -!// TO-DO #2.6 ///////////////////////////////////////////////////////////// +!// TO-DO A2.6 ///////////////////////////////////////////////////////////// !// Declare the kernel gpu_saxpy() with the same interface as cpu_saxpy() // !/////////////////////////////////////////////////////////////////////////// end module CUDAKernels @@ -54,12 +54,12 @@ program lab01_ex2 integer :: hr !//////////////// - !// TO-DO #2.2 /////////////////////////////// + !// TO-DO A2.2 /////////////////////////////// !// Introduce the grid and block definition // !///////////////////////////////////////////// !////////////////// - !// TO-DO #2.3.1 ///////////////////////////// + !// TO-DO A2.3.1 ///////////////////////////// !// Declare the device pointers d_x and d_y // !///////////////////////////////////////////// @@ -81,7 +81,7 @@ program lab01_ex2 y = 0.2 !////////////////// - !// TO-DO #2.3.2 //////////////////////////////////////////////////////// + !// TO-DO A2.3.2 //////////////////////////////////////////////////////// !// Allocate d_x and d_y on the GPU, and copy the content from the CPU // !//////////////////////////////////////////////////////////////////////// @@ -92,12 +92,12 @@ program lab01_ex2 call generate_hash(ARRAY_SIZE, y, error); !//////////////// - !// TO-DO #2.4 //////////////////////////////////////// + !// TO-DO A2.4 //////////////////////////////////////// !// Call the GPU kernel gpu_saxpy() with d_x and d_y // !////////////////////////////////////////////////////// !////////////////// - !// TO-DO #2.5.1 //////////////////////////////////////////////////// + !// TO-DO A2.5.1 //////////////////////////////////////////////////// !// Copy the content of d_y from the GPU to the array y on the CPU // !//////////////////////////////////////////////////////////////////// @@ -118,7 +118,7 @@ program lab01_ex2 deallocate(y) !////////////////// - !// TO-DO #2.5.2 ///////// + !// TO-DO A2.5.2 ///////// !// Release d_x and d_y // !///////////////////////// end program lab01_ex2 diff --git a/lab_1/README.md b/lab_1/README.md index 324c16b905f85dd7dc0e51fd59092ca0e82a1691..59bd2901b23590ca75e80a2cacee199583590d6f 100644 --- a/lab_1/README.md +++ b/lab_1/README.md @@ -1,203 +1,73 @@ -# CUDA Laboratory 1 - -_Introduction to High-Performance Computing_ +# PDC Summer School 2021: CUDA Laboratory 1 ## Introduction -In this first laboratory about GPU programming in CUDA, we are going to -introduce you to the very basic concepts you need to start programming your -own GPU-accelerated applications. The laboratory is divided into two different -blocks and exercises, for your own convenience: - -+ **Block A (Friday / 23<sup>th</sup> of August)**: The first block aims to provide you with - a first notion on how to use CUDA. This includes how to compile a CUDA - program, how to launch a CUDA kernel, how to index 1D arrays, and more. - -+ **Block B (Monday / 26<sup>th</sup> of August)**: The second block presents a use-case - for using GPUs with images. The goal is to make you understand how to index - 2D matrices, while doing something practical. - -A step-by-step guide is always provided on each exercise of both blocks. **You -will find one or more TO-DO steps (with questions) that you need to -address in order for the exercise to be considered complete.** When you find one -of these subsections, our advice is to stop reading the following paragraph and -address as much as you can. The exercises are relatively simple, do not worry. -Our objective is to help you understand the concepts being explained. - -For all the exercises, we are going to use Tegner. - -> Even though you may have a laptop with a CUDA-supported GPU, we encourage you -> to use Tegner during the sessions of today and Monday. - -Tegner is a cluster at KTH -Royal Institute of Technology with 65 heterogeneous nodes. Every node has one -or two GPUs, based on the NVIDIA Quadro K420 or the NVIDIA Tesla K80. For the -purposes of this laboratory, we will use the "Thin Nodes" of Tegner. -These nodes contain 2×12-core Intel E5-2690v3 CPUs at 2.60GHz, 512GB DRAM and -NVIDIA Quadro K420 GPU per node. More information can be found -[here](https://www.pdc.kth.se/hpc-services/computing-systems/tegner-1.737437). - -Everyone should try to finish the first block of the laboratory, otherwise it -would be difficult to complete the second block in the next laboratory session -of Monday. Ideally, the first block is designed to be completed during the -first session of CUDA (today), while the second block targets the second part -(after the weekend). Take your time, it is important for you to try to -understand the key-concepts presented on this session. - -<span style="color:red">**Please note!**</span> In case you need help or you -do not understand some of the concepts of CUDA, do not worry, we are here to -help you out! - -## Block A - -In this first block of the laboratory, we are going to introduce you to the -world of GPU programming using CUDA. We will begin explaining how to connect to -Tegner, as well as how to compile and run a given program. Thereafter, we will -ask you to create your first CUDA program using a CPU implementation of a SAXPY -function as reference. - -### Exercise 1 - Experimental Setup +In this first laboratory about GPU programming in CUDA, we are going to introduce you to the very basic concepts you need to start programming your own GPU-accelerated applications. It aims to provide you with a first notion on how to use CUDA. This includes how to compile a CUDA program, how to launch a CUDA kernel, how to index 1D arrays, and more. -We will begin our laboratory learning how to connect to Tegner and ask for a -GPU resource on the cluster. First, ask for your Kerberos ticket using the -kinit command, as usual: +A step-by-step guide is provided and with each exercise you will find one or more steps that need to be performed with questions, which you need to address in order for the exercise to be considered complete. -``` -kinit --forwardable your_username@NADA.KTH.SE -``` - -Replace ``your_username`` with the username that was provided to you by PDC (i.e., -the one you have used for the previous sessions). After this, you can now -connect to Tegner using SSH: - -``` -ssh -Y your_username@tegner.pdc.kth.se -``` +For all the exercises, we are going to use Tegner. **Even though you may have a laptop with a CUDA-supported GPU, we encourage you to use this during the labs.** For the purposes of this laboratory, we will use the "Thin Nodes" of Tegner. -You need to add the flag -Y to ask for X11 forwarding support. This is not -required for this first block of exercises, but it will be important for the -second block of the laboratory session on Monday. We will provide more details -afterwards. +In this first block A of the laboratory, we are going to introduce you to the world of GPU programming using CUDA. We will begin explaining how to how to compile and run a given program. Thereafter, we will ask you to create your first CUDA program using a CPU implementation of a SAXPY function as reference. -Once you are connected, clone the cuda exercise repository in your Klemming folder: +## Exercise A1: Experimental Setup -``` -cd /cfs/klemming/nobackup/your_initial/your_username -module load git -git clone https://github.com/PDC-support/cuda-lab-exercises.git -cd cuda-lab-exercises/lab_1/C -``` - -Inside the extracted folder, you will find two files named -``lab01_ex1.cu`` and ``lab01_ex2.cu``. You must use these two files for solving the -exercises of this first block. As you might have noticed, the source files of -CUDA programs have the extension "cu". This is a mere naming -convention to identify your different GPU-based source code files. It is also a -good way for other programmers to understand that the file contains CUDA code. +Once you are connected to Tegner, copy the exercise folder ``lab_1`` including the sub-folders. This includes two files named ``lab01_ex1.cu`` and ``lab01_ex2.cu``. You must use these two files for solving the exercises of this first block. As you might have noticed, the source files of CUDA programs have the extension ``cu``. This is a mere naming convention to identify your different GPU-based source code files. It is also a good way for other programmers to understand that the file contains CUDA code. -**Let us now compile and run the lab01\_ex1.cu file.** This file -contains a very simple (yet very important) "Hello World!" CUDA program. We -will ask you to solve some issues with this exercise later below. +**Let us now compile and run the lab01\_ex1.cu file.** This file contains a very simple (yet very important) "Hello World!" CUDA program. We will ask you to solve some issues with this exercise later below. --- -**TO-DO [1.1]** +**TO-DO [A.1.1]** -Open the file ``lab01_ex1.cu`` with your preferred text editor and examine its -content. Pay attention on how the the CUDA kernel ``gpu_helloworld()`` is launched -with the triple-bracket ``<<<>>>`` notation. Also, observe the declaration of the -kernel with ``__global__``. +Open the file ``lab01_ex1.cu`` with your preferred text editor and examine its content. Pay attention on how the the CUDA kernel ``gpu_helloworld()`` is launched with the triple-bracket ``<<<>>>`` notation. Also, observe the declaration of the kernel with ``__global__``. --- -As you have observed, the source code of the mentioned file is easy to follow. -Here, we just print a certain text using the ``printf()`` function, both in the CPU -version and the GPU version. However, for the GPU version, we also print the -thread identifier. The CUDA kernel is launched with 1 block of 32 threads, so -we simply have to use the predefined ``threadIdx`` constant on the X direction. In -other words, we are declaring the grid dimension as ``grid(1)`` and the block -dimension as ``block(32)``. Given the fact that the base type for each variable is -``dim3``, this means that we are implicitly creating a grid of blocks in the (X, Y, -Z) direction with a value 1 for Y and Z by default. The declaration of the -example is equivalent to the following: +As you have observed, the source code of the mentioned file is easy to follow. Here, we just print a certain text using the ``printf()`` function, both in the CPU version and the GPU version. However, for the GPU version, we also print the thread identifier. The CUDA kernel is launched with 1 block of 32 threads, so we simply have to use the predefined ``threadIdx`` constant on the X direction. In other words, we are declaring the grid dimension as ``grid(1)`` and the block dimension as ``block(32)``. Given the fact that the base type for each variable is ``dim3``, this means that we are implicitly creating a grid of blocks in the (X, Y, Z) direction with a value 1 for Y and Z by default. The declaration of the example is equivalent to the following: ``` dim3 grid(1,1,1); // 1 block in the grid dim3 block(32,1,1); // 32 threads per block ``` -This represents exactly what we had before (i.e., a grid of one block, whose -number of threads per block is 32 in the X direction). Understanding the way -indices work in CUDA is not trivial, so do not worry if you are a little bit -confused. The whole purpose of this first laboratory of CUDA is to make you -understand this concept. We will force you to practice throughout the document. +This represents exactly what we had before (i.e., a grid of one block, whose number of threads per block is 32 in the X direction). Understanding the way indices work in CUDA is not trivial, so do not worry if you are a little bit confused. The whole purpose of this first laboratory of CUDA is to make you understand this concept. We will force you to practice throughout the document. -To compile the ``lab01_ex1.cu`` example, you need to load the CUDA module. This -will also load all the necessary CUDA dependencies: +To compile the ``lab01_ex1.cu`` example, you need to load the CUDA module. This will also load all the necessary CUDA dependencies: ``` -module load cuda/7.0 +module load cuda ``` -To compile CUDA programs, we will use the ``nvcc`` command. This is a proprietary -NVIDIA compiler for CUDA that separates the host code (CPU code) from the GPU -code. The compiler will invoke GCC or ICC for the host code, as necessary. For -you, the only consideration is to use this compiler when you declare CUDA code -in your files, as follows: +To compile CUDA programs, we will use the ``nvcc`` command. This is a proprietary NVIDIA compiler for CUDA that separates the host code (CPU code) from the GPU code. The compiler will invoke GCC or ICC for the host code, as necessary. For you, the only consideration is to use this compiler when you declare CUDA code in your files, as follows: ``` nvcc -arch=sm_30 lab01_ex1.cu -o lab01_ex1.out ``` -The previous command asks ``nvcc`` to compile ``lab01_ex1.cu`` and to generate a binary -executable named ``lab01_ex1.out``. The example also requests support for the -feature version of CUDA architecture 3.0 (i.e., ``sm_30``). This is required to -distinguish old generations of graphic cards with new releases, that contain -extra features. This does not mean that your code will not run if you do not -provide this flag, but it is likely that many of the features that you plan to -use in CUDA today are only available on the latest architectures. +The previous command asks ``nvcc`` to compile ``lab01_ex1.cu`` and to generate a binary executable named ``lab01_ex1.out``. The example also requests support for the feature version of CUDA architecture 3.0 (i.e., ``sm_30``). This is required to distinguish old generations of graphic cards with new releases, that contain extra features. This does not mean that your code will not run if you do not provide this flag, but it is likely that many of the features that you plan to use in CUDA today are only available on the latest architectures. -Now, let us allocate one node on Tegner to run our program. We need to ask -``salloc`` for the type of nodes that we want, which will be the "Thin -Nodes" of Tegner (i.e., ``-C Haswell``) : +Now, let us allocate one node on Tegner to run our program. We need to ask ``salloc`` for the type of nodes that we want, which will be the "Thin Nodes" of Tegner (i.e., ``-C Haswell``) : ``` salloc --nodes=1 -C Haswell --gres=gpu:K420:1 -t 00:05:00 \ -A account_here --reservation=reservation_here ``` -Here, we are asking for 5 minutes of computation time on one single node of the -"Thin Nodes" of Tegner. In addition, we are specifying that we want to -get access to the GPU resource of the node with the ``--gres=gpu:K420:1`` option. +Here, we are asking for 5 minutes of computation time on one single node of the "Thin Nodes" of Tegner. In addition, we are specifying that we want to get access to the GPU resource of the node with the ``--gres=gpu:K420:1`` option. --- **NOTE** -Please, always ask for a node with ``salloc`` when your code compiles without -errors and you would like to run your program on Tegner. After you finish -executing and if you are not going to run anything for some time (e.g., between -each exercise), type exit to reclaim your allocation and allow other students -to get quick access to the cluster. This way, we will efficiently share the -resources and everyone will be able to run immediately. +Please, always ask for a node with ``salloc`` when your code compiles without errors and you would like to run your program on Tegner. After you finish executing and if you are not going to run anything for some time (e.g., between each exercise), type exit to reclaim your allocation and allow other students to get quick access to the cluster. This way, we will efficiently share the resources and everyone will be able to run immediately. --- -After you get the node, you must use srun to execute the program and ask for a -single process: +After you get the node, you must use srun to execute the program and ask for a single process: ``` srun -n 1 ./lab01_ex1.out ``` -You might be thinking: "Wait a second, only 1 process!? But we said 32 -threads before!". Yes, you are right, this is the confusing part. We are -going to parallelize the code using the local GPU of the node, and we are not -interested on using more nodes or more CPU processes. In the future, you could -use MPI with two or more processes, and create a massive parallel application -by using the GPU on each node. For now, we just want a single process that has -access to the GPU, but within the GPU we will request 32 threads. Check the -slides for an explanation on the differences between OpenMP, MPI and CUDA if -you feel this is still tricky to understand. - If everything went as expected, you should see the following output: ``` @@ -211,297 +81,155 @@ Hello from the GPU! My threadId is 30 Hello from the GPU! My threadId is 31 ``` -Were you able to see the suggested output? If not, can you guess why only the -CPU code generated the output? +**Questions** + +- Why should ``srun`` be called with the argument ``-n 1`` and note ``-n 24`` like in the MPI lab? +- Were you able to see the suggested output? If not, can you guess why only the CPU code generated the output? --- -**TO-DO [1.2]** +**TO-DO [A1.2]** -Open ``lab01_ex1.cu`` and find the commented "TO-DO #1.2" section in the code -to introduce the necessary fixes to make the code work as expected. +Open ``lab01_ex1.cu`` and find the commented "TO-DO A1.2" section in the code to introduce the necessary fixes to make the code work as expected. -+ Hint #1: Is the kernel execution synchronous or asynchronous? -+ Hint #2: Could this [https://goo.gl/J5j44F](https://goo.gl/J5j44F) be useful? +- Hint #1: Is the kernel execution synchronous or asynchronous? +- Hint #2: Could this [https://goo.gl/J5j44F](https://goo.gl/J5j44F) be useful? --- -One of the main concepts that you must understand while programming GPUs is -that, for the most part, the GPU code runs asynchronously with respect to the -CPU code. This is exactly what we observed in ``lab01_ex1.cu``, where even though -we set all the necessary elements correctly, the CUDA kernel always begins -execution while the CPU code is exiting the ``main()`` function. Without a proper -synchronization call, the CPU program finishes and no one will wait to see the -output coming from the GPU kernel. - -As suggested, ``cudaDeviceSynchronize()`` can fix the issue. However, you must keep -in mind that other functionality of CUDA also enforce synchronization points -implicitly, such as ``cudaMemcpy()``. If any of these functions are defined in your -code after you launch a CUDA kernel, then you do not need to set a -synchronization point with ``cudaDeviceSynchronize()``. - -### Exercise 2 - Your First CUDA Program - -Now that you understand how to compile and run a simple CUDA program, in this -exercise we ask you to create your very first CUDA kernel and to introduce some -of the functionality required for this kernel to work. This includes defining -the distribution of the threads or adding memory management operations to -transfer data from the host memory to the GPU, and back. We will use host or -CPU to refer to the CPU space. - -For this purpose, **we will implement a simple SAXPY program**. SAXPY is very -suitable to make you understand how to index 1D arrays inside a GPU kernel. The -term stands for "Single-Precision A\*X Plus Y", where A is a constant, -and X and Y are arrays. +One of the main concepts that you must understand while programming GPUs is that, for the most part, the GPU code runs asynchronously with respect to the CPU code. This is exactly what we observed in ``lab01_ex1.cu``, where even though we set all the necessary elements correctly, the CUDA kernel always begins execution while the CPU code is exiting the ``main()`` function. Without a proper synchronization call, the CPU program finishes and no one will wait to see the output coming from the GPU kernel. + +As suggested, ``cudaDeviceSynchronize()`` can fix the issue. However, you must keep in mind that other functionality of CUDA also enforce synchronization points implicitly, such as ``cudaMemcpy()``. If any of these functions are defined in your code after you launch a CUDA kernel, then you do not need to set a synchronization point with ``cudaDeviceSynchronize()``. + +## Exercise A2: Your first CUDA program + +Now that you understand how to compile and run a simple CUDA program, in this exercise we ask you to create your very first CUDA kernel and to introduce some of the functionality required for this kernel to work. This includes defining the distribution of the threads or adding memory management operations to transfer data from the host memory to the GPU, and back. We will use host or CPU to refer to the CPU space. + +For this purpose, **we will implement a simple SAXPY program**. SAXPY is very suitable to make you understand how to index 1D arrays inside a GPU kernel. The term stands for "Single-Precision A\*X Plus Y", where A is a constant, and X and Y are arrays. --- -**TO-DO [2.1]** +**TO-DO [A2.1]** -Open the file ``lab01_ex2.cu`` with your preferred text editor and examine its -content. In particular, make sure you understand the ``cpu_saxpy()`` function. We -will use this function as reference for the GPU version later in the exercise. +Open the file ``lab01_ex2.cu`` with your preferred text editor and examine its content. In particular, make sure you understand the ``cpu_saxpy()`` function. We will use this function as reference for the GPU version later in the exercise. --- -We will use the file ``lab01_ex2.cu`` for solving the exercise. The source code -contains a ``main()`` function that allocates two arrays, x and y, and initializes -each element of the array with 0.1 and 0.2, respectively. It is expected that -the user provides the value of the constant "a" as input for the -program. The size of each array is predefined with the constant ``ARRAY_SIZE``. -Right now, the program only calls ``cpu_saxpy()`` to compute the SAXPY result using -the CPU, but later you will introduce a call to the GPU version as well. -Finally, the code generates a naive hash of the result of both versions. This -will be used at the end of the execution to compare the solutions of the CPU -version and the GPU version. - -We are going to introduce small changes into the source code of ``lab01_ex2.cu`` to -allow for a SAXPY version on the GPU. As a rule of thumb, every CUDA program is -usually defined by following these simple five steps: - -1. **Declare and implement the CUDA kernel** that you would like to execute on the -GPU. -2. **Define the distribution of the threads**, in terms of the dimension of the -grid and the dimension of each block (of threads) inside the grid. -3. **Declare and define the GPU memory required** to execute the CUDA kernel. This -includes transferring the data from the host to the GPU, if needed. +We will use the file ``lab01_ex2.cu`` for solving the exercise. The source code contains a ``main()`` function that allocates two arrays, x and y, and initializes each element of the array with 0.1 and 0.2, respectively. It is expected that the user provides the value of the constant "a" as input for the program. The size of each array is predefined with the constant ``ARRAY_SIZE``. Right now, the program only calls ``cpu_saxpy()`` to compute the SAXPY result using the CPU, but later you will introduce a call to the GPU version as well. Finally, the code generates a naive hash of the result of both versions. This will be used at the end of the execution to compare the solutions of the CPU version and the GPU version. + +We are going to introduce small changes into the source code of ``lab01_ex2.cu`` to allow for a SAXPY version on the GPU. As a rule of thumb, every CUDA program is usually defined by following these simple five steps: + +1. **Declare and implement the CUDA kernel** that you would like to execute on the GPU. +2. **Define the distribution of the threads**, in terms of the dimension of the grid and the dimension of each block (of threads) inside the grid. +3. **Declare and define the GPU memory required** to execute the CUDA kernel. This includes transferring the data from the host to the GPU, if needed. 4. **Execute the CUDA kernel** with the correspondent parameters. -5. **Transfer the results from the GPU** to the host. Alternatively, use a -synchronization function to guarantee that the host waits for the GPU to -execute the kernel. - -In this exercise, we changed the order of some of these steps just so that we -force you to think on the main CUDA concepts. For instance, we will ask you to -implement the content of the kernel at the end of this exercise. The reason is -that we consider more relevant for you to initially understand how the -execution flow from the CPU to the GPU (and back) works. Thus, we are going to -focus inside the ``main()`` function to enable the execution of the CUDA kernel. - -First, let us define how many threads will be needed to compute the CUDA -kernel. We suggest you to use the constant ``ARRAY_SIZE`` to calculate the number -of blocks (of threads) to be used. For the block dimension, use only the -constant ``BLOCK_SIZE``. These two constants are defined on top of the file. Note -that, as we are using only 1D arrays, it is enough for you to define the -dimensions in terms of the X dimension only. You can check ``lab01_ex1.cu``, as -reference. - -For instance, imagine that we have an ``ARRAY_SIZE`` of 16 elements. By using a -``BLOCK_SIZE`` of 4, we can configure the execution to use 4 blocks in the grid and -on the X direction only: +5. **Transfer the results from the GPU** to the host. Alternatively, use a synchronization function to guarantee that the host waits for the GPU to execute the kernel. + +In this exercise, we changed the order of some of these steps just so that we force you to think on the main CUDA concepts. For instance, we will ask you to implement the content of the kernel at the end of this exercise. The reason is that we consider more relevant for you to initially understand how the execution flow from the CPU to the GPU (and back) works. Thus, we are going to focus inside the ``main()`` function to enable the execution of the CUDA kernel. + +First, let us define how many threads will be needed to compute the CUDA kernel. We suggest you to use the constant ``ARRAY_SIZE`` to calculate the number of blocks (of threads) to be used. For the block dimension, use only the constant ``BLOCK_SIZE``. These two constants are defined on top of the file. Note that, as we are using only 1D arrays, it is enough for you to define the dimensions in terms of the X dimension only. You can check ``lab01_ex1.cu``, as reference. + +For instance, imagine that we have an ``ARRAY_SIZE`` of 16 elements. By using a ``BLOCK_SIZE`` of 4, we can configure the execution to use 4 blocks in the grid and on the X direction only: <img src="image/array_size.png" alt="array_size" width="800px"/> -Inside the CUDA kernel, the GPU will provide us with chunks of ``BLOCK_SIZE`` -threads. We will need to use this dimension afterwards to determine which -elements we have to access for each array of SAXPY. As a side note, we -displayed in the figure a ``GRID_SIZE`` constant, but the purpose is to reflect the -size of the grid. +Inside the CUDA kernel, the GPU will provide us with chunks of ``BLOCK_SIZE`` threads. We will need to use this dimension afterwards to determine which elements we have to access for each array of SAXPY. As a side note, we displayed in the figure a ``GRID_SIZE`` constant, but the purpose is to reflect the size of the grid. -Developing a CUDA kernel requires a substantial mindset change from the -traditional CPU programming. The main one is the inherently absence of loops, -in favor of massively parallel number of threads that perform very small tasks -on specific elements of the data instead. In this case, we ideally want one -thread per element on each array. For now, try to concentrate on dividing the -workload assuming this fact. We will handle the specific details of thread -parallelism afterwards inside the SAXPY kernel. +Developing a CUDA kernel requires a substantial mindset change from the traditional CPU programming. The main one is the inherently absence of loops, in favor of massively parallel number of threads that perform very small tasks on specific elements of the data instead. In this case, we ideally want one thread per element on each array. For now, try to concentrate on dividing the workload assuming this fact. We will handle the specific details of thread parallelism afterwards inside the SAXPY kernel. --- -**TO-DO [2.2]** +**TO-DO [A2.2]** -Find the "TO-DO #2.2" section inside ``lab01_ex2.cu`` and declare the grid and -block dimensions that will be used to launch the CUDA kernel of SAXPY. We -suggest you to request enough threads for the GPU to cover all the elements of -each array at once. Do not worry if you request more threads than elements in -the array, but try to fit the value. +Find the "TO-DO A2.2" section inside ``lab01_ex2.cu`` and declare the grid and block dimensions that will be used to launch the CUDA kernel of SAXPY. We suggest you to request enough threads for the GPU to cover all the elements of each array at once. Do not worry if you request more threads than elements in the array, but try to fit the value. -+ Hint #1: Use the constant ``ARRAY_SIZE`` to determine how many blocks of - threads in the X direction you will need. -+ Hint #2: Use the constant ``BLOCK_SIZE`` to define the number of threads in - the X direction, per block. -+ Hint #3: For correctness, consider that the ``ARRAY_SIZE`` might not be - multiple of the ``BLOCK_SIZE``. +- Hint #1: Use the constant ``ARRAY_SIZE`` to determine how many blocks of threads in the X direction you will need. +- Hint #2: Use the constant ``BLOCK_SIZE`` to define the number of threads in the X direction, per block. +- Hint #3: For correctness, consider that the ``ARRAY_SIZE`` might not be multiple of the ``BLOCK_SIZE``. --- -After the grid and block dimensions are defined, the next step would be to -declare the device pointers that contain the elements of x and y, but on the -GPU side. The constant "a" can be passed by value to the CUDA kernel, -so no changes are required in this regard. We recommend you to always use the -prefix ``d_`` for the device-related pointers, as a good naming -convention. For instance, in this case, we should use ``d_x`` and ``d_y`` for the -device pointers that contain the elements of their equivalent x and y arrays in -the CPU: +After the grid and block dimensions are defined, the next step would be to declare the device pointers that contain the elements of x and y, but on the GPU side. The constant "a" can be passed by value to the CUDA kernel, so no changes are required in this regard. We recommend you to always use the prefix ``d_`` for the device-related pointers, as a good naming convention. For instance, in this case, we should use ``d_x`` and ``d_y`` for the device pointers that contain the elements of their equivalent x and y arrays in the CPU: ``` float *d_x = NULL; float *d_y = NULL; ``` -Declaring only the ``d_x`` and ``d_y`` pointers will not seem to provide that much -value to your SAXPY kernel afterwards. The main reason is that we also need to -explicitly allocate each array on the GPU side. Thus, even though the arrays x -and y are already allocated on the host side, we should account for the fact -that the memory visible to the CPU is not visible to the GPU. +Declaring only the ``d_x`` and ``d_y`` pointers will not seem to provide that much value to your SAXPY kernel afterwards. The main reason is that we also need to explicitly allocate each array on the GPU side. Thus, even though the arrays x and y are already allocated on the host side, we should account for the fact that the memory visible to the CPU is not visible to the GPU. -> The Unified Memory Model of CUDA is an exception. This model manages the data -> transfer from the host to the GPU, and back, in an automatic manner. In this -> laboratory session, we use the classical model, where the memory management is -> up to the programmer. This also has benefits for performance. +> The Unified Memory Model of CUDA is an exception. This model manages the data transfer from the host to the GPU, and back, in an automatic manner. In this laboratory session, we use the classical model, where the memory management is up to the programmer. This also has benefits for performance. -Moreover, the content of each array must be manually transferred from the host -side (i.e., we want to define the device arrays to contain the same elements as -in the CPU version). +Moreover, the content of each array must be manually transferred from the host side (i.e., we want to define the device arrays to contain the same elements as in the CPU version). --- -**TO-DO [2.3]** +**TO-DO [A2.3]** -Find the "TO-DO #2.3.1" section inside ``lab01_ex2.cu`` and declare the device -pointers ``d_x`` and ``d_y`` of type float. Thereafter, look for the -"TO-DO #2.3.2" section and allocate the arrays on the GPU. Do not forget to copy the -content of each array from the host side! +Find the "TO-DO A2.3.1" section inside ``lab01_ex2.cu`` and declare the device pointers ``d_x`` and ``d_y`` of type float. Thereafter, look for the "TO-DO A2.3.2" section and allocate the arrays on the GPU. Do not forget to copy the content of each array from the host side! -+ Hint #1: Use the constant ``ARRAY_SIZE`` to determine the size of each array - (in bytes). -+ Hint #2: Use the memory management functions of CUDA - [https://goo.gl/8VC46E](https://goo.gl/8VC46E) and - [https://goo.gl/vcxdKi](https://goo.gl/vcxdKi). +- Hint #1: Use the constant ``ARRAY_SIZE`` to determine the size of each array (in bytes). +- Hint #2: Use the memory management functions of CUDA [https://goo.gl/8VC46E](https://goo.gl/8VC46E) and [https://goo.gl/vcxdKi](https://goo.gl/vcxdKi). --- -After the arrays have been allocated on the GPU and its content transferred -from the CPU side, we are going to finish setting up the launch of the kernel -inside the ``main()`` function. The last steps would be to execute the kernel using -the grid and block dimensions that you defined earlier. We must also set the -device pointers ``d_x`` and ``d_y`` that you just allocated and filled with the content -from the host. Once again, check ``lab01_ex1.cu`` as reference. +After the arrays have been allocated on the GPU and its content transferred from the CPU side, we are going to finish setting up the launch of the kernel inside the ``main()`` function. The last steps would be to execute the kernel using the grid and block dimensions that you defined earlier. We must also set the device pointers ``d_x`` and ``d_y`` that you just allocated and filled with the content from the host. Once again, check ``lab01_ex1.cu`` as reference. --- **TO-DO [2.4]** -Find the "TO-DO #2.4" section inside ``lab01_ex2.cu`` and introduce the -necessary changes to launch the SAXPY kernel. Assume that the name of the -kernel is ``gpu_saxpy()`` and that the input parameters follow the interface of -the CPU version, but using ``d_x`` and ``d_y`` instead. +Find the "TO-DO A2.4" section inside ``lab01_ex2.cu`` and introduce the necessary changes to launch the SAXPY kernel. Assume that the name of the kernel is ``gpu_saxpy()`` and that the input parameters follow the interface of the CPU version, but using ``d_x`` and ``d_y`` instead. -+ Hint #1: The triple-bracket ``<<<>>>`` notation is always required when calling a CUDA kernel. -+ Hint #2: Constants can be passed by value to a CUDA kernel, without any additional changes. +- Hint #1: The triple-bracket ``<<<>>>`` notation is always required when calling a CUDA kernel. +- Hint #2: Constants can be passed by value to a CUDA kernel, without any additional changes. --- -After we have defined the allocations on the GPU and established the launch of -the SAXPY kernel, we will introduce the last two changes to the ``main()`` -function. The first one is to copy the result of the kernel from the GPU to the -host. Following the CPU implementation, assume that the result will be stored -on the ``d_y`` array. You can use the same ``cudaMemcpy()`` memory copy function as in -the previous steps, but making sure in this case that the order of the copy is -reversed. For the second change, we ask you to release the memory of each -device pointer at the end of the ``main()`` function. The source code currently -only releases the x and y arrays. +After we have defined the allocations on the GPU and established the launch of the SAXPY kernel, we will introduce the last two changes to the ``main()`` function. The first one is to copy the result of the kernel from the GPU to the host. Following the CPU implementation, assume that the result will be stored on the ``d_y`` array. You can use the same ``cudaMemcpy()`` memory copy function as in the previous steps, but making sure in this case that the order of the copy is reversed. For the second change, we ask you to release the memory of each device pointer at the end of the ``main()`` function. The source code currently only releases the x and y arrays. --- -**TO-DO [2.5]** +**TO-DO [A2.5]** -Find the "TO-DO #2.5.1" section inside ``lab01_ex2.cu`` and copy the result -from ``d_y`` to ``y``. After this, find the "TO-DO #2.5.2" section and release -the device arrays ``d_x`` and ``d_y`` before the end of the ``main()`` function. +Find the "TO-DO A2.5.1" section inside ``lab01_ex2.cu`` and copy the result from ``d_y`` to ``y``. After this, find the "TO-DO A2.5.2" section and release the device arrays ``d_x`` and ``d_y`` before the end of the ``main()`` function. -+ Hint #1: The order of the copy with ``cudaMemcpy()`` is reversed. This also - applies to the input parameters! -+ Hint #2: You must use [https://goo.gl/zVjbeR](https://goo.gl/zVjbeR) to - release each array. +- Hint #1: The order of the copy with ``cudaMemcpy()`` is reversed. This also applies to the input parameters! +- Hint #2: You must use [https://goo.gl/zVjbeR](https://goo.gl/zVjbeR) to release each array. --- -**Now that everything is set-up, the last change is to declare and define the -CUDA kernel**. This represents 90% of the effort while developing GPU code, and -it is one of the main reason why we preferred to leave this big effort for the -last step, just so that you can consolidate your basic skills on CUDA. The rest -of the code that you added (i.e., setting up the kernel) is always going to be -very similar from application to application. +**Now that everything is set-up, the last change is to declare and define the CUDA kernel**. This represents 90% of the effort while developing GPU code, and it is one of the main reason why we preferred to leave this big effort for the last step, just so that you can consolidate your basic skills on CUDA. The rest of the code that you added (i.e., setting up the kernel) is always going to be very similar from application to application. -To make it simpler for you, let us split the implementation of the kernel in -two **TO-DO** steps. The first one is to declare the SAXPY kernel. We will -call it ``gpu_saxpy()``. +To make it simpler for you, let us split the implementation of the kernel in two **TO-DO** steps. The first one is to declare the SAXPY kernel. We will call it ``gpu_saxpy()``. --- -**TO-DO [2.6]** +**TO-DO [A2.6]** -Find the "TO-DO #2.6" section inside ``lab01_ex2.cu`` and declare an empty -``gpu_saxpy()`` kernel. Use the interface of the CPU version as reference. +Find the "TO-DO A2.6" section inside ``lab01_ex2.cu`` and declare an empty ``gpu_saxpy()`` kernel. Use the interface of the CPU version as reference. -+ Hint #1: Do not forget that CUDA kernels require a special keyword to - differentiate from CPU functions. -+ Hint #2: The primitive types of CUDA are equivalent to the primitive types of - plain C on the CPU (e.g., ``float``). +- Hint #1: Do not forget that CUDA kernels require a special keyword to differentiate from CPU functions. +- Hint #2: The primitive types of CUDA are equivalent to the primitive types of plain C on the CPU (e.g., ``float``). --- -With the CUDA kernel declared, let us now implement the equivalent version of -SAXPY on the GPU, based on the original CPU implementation. For this, consider -the following: - -1. You have to assume that thousands of independent threads will call this - kernel. In fact, the key for a GPU kernel is to define massively parallel - work. -2. You have to define a way to index the data being processed by the current - thread in the kernel. Remember, you are splitting up the data into a grid of - blocks of threads. -3. You have to guarantee that no thread accesses out-of-bounds data. If you - defined more threads than elements per array (you might!), make sure all the - accesses are correct. - -With these few key-points in mind, here it comes the hardest part of the -exercise: implementing the CUDA version of SAXPY. Starting from the CPU -version, we ask you to calculate the index of the thread and to operate on the -data following the same SAXPY model as before. The output must be stored on -``d_y``. You can use ``threadIdx`` to understand the ID of the thread inside the block, -``blockIdx`` to understand the ID of the block that the thread belongs to, and -``blockDim`` to obtain the number of threads per block. Remember, we are operating -__on the X direction only__. +With the CUDA kernel declared, let us now implement the equivalent version of SAXPY on the GPU, based on the original CPU implementation. For this, consider the following: + +1. You have to assume that thousands of independent threads will call this kernel. In fact, the key for a GPU kernel is to define massively parallel work. +2. You have to define a way to index the data being processed by the current thread in the kernel. Remember, you are splitting up the data into a grid of blocks of threads. +3. You have to guarantee that no thread accesses out-of-bounds data. If you defined more threads than elements per array (you might!), make sure all the accesses are correct. + +With these few key-points in mind, here it comes the hardest part of the exercise: implementing the CUDA version of SAXPY. Starting from the CPU version, we ask you to calculate the index of the thread and to operate on the data following the same SAXPY model as before. The output must be stored on ``d_y``. You can use ``threadIdx`` to understand the ID of the thread inside the block, ``blockIdx`` to understand the ID of the block that the thread belongs to, and ``blockDim`` to obtain the number of threads per block. Remember, we are operating __on the X direction only__. --- -**TO-DO [2.7]** +**TO-DO [A2.7]** -Inside the ``gpu_saxpy()`` kernel declared in ``lab01_ex2.cu``, implement the GPU -version of SAXPY by calculating the index of the thread and performing the -computations for the specific elements that "belong" to the thread. -Store the result on ``d_y``. It is expected that you introduce an out-of-bounds -check, based on the input parameter "n". +Inside the ``gpu_saxpy()`` kernel declared in ``lab01_ex2.cu``, implement the GPU version of SAXPY by calculating the index of the thread and performing the computations for the specific elements that "belong" to the thread. Store the result on ``d_y``. It is expected that you introduce an out-of-bounds check, based on the input parameter "n". -+ Hint #1: If you are considering to use a loop, think twice! -+ Hint #2: Even though branches are costly on the GPU, do not worry, you can - safely use an if-statement. +- Hint #1: If you are considering to use a loop, think twice! +- Hint #2: Even though branches are costly on the GPU, do not worry, you can safely use an if-statement. --- -At this point, you have now completed most of the complexity of this exercise. -The last part is to evaluate if your code really works as expected. For that, -we ask you to compile it with ``nvcc`` and solve any issues that the compiler -might report, if any. +At this point, you have now completed most of the complexity of this exercise. The last part is to evaluate if your code really works as expected. For that, we ask you to compile it with ``nvcc`` and solve any issues that the compiler might report, if any. > Do not forget the ``-arch=sm_30`` architecture flag when compiling. -Thereafter, request a compute node on Tegner with ``salloc`` -and run your code with ``srun``. Keep in mind that you also need to provide the -value of the constant "a" to the executable (e.g., 2.0 is fine): +Thereafter, request a compute node on Tegner with ``salloc`` and run your code with ``srun``. Keep in mind that you also need to provide the value of the constant "a" to the executable (e.g., 2.0 is fine): ``` srun -n 1 ./lab01_ex2.out 2.0 @@ -513,116 +241,52 @@ If everything worked as expected, you should only see the following output: Execution finished (error=0.000000). ``` -If you managed to define the kernel and get exactly this output, well done! -This is a great achievement, congratulations! If you get something different, -such as an error message reporting that the solution is incorrect, quickly -review all the "**TO-DO**" steps of the exercise from the beginning to make sure -that you did not miss anything. Feel free to ask us if you are lost, we are -here to help. - -In the next block of exercises, we are going to extend this basic notion to -perform some more advanced computation over images. For now, enjoy the weekend! - -## Bonus Exercises - -In this section, we provide you with additional exercises with the purpose of -getting deeper into CUDA optimizations. These exercises are optional, but we -consider that advanced users might be interested in understanding how they -could improve the performance of their applications. - -### Exercise B1 - Measuring Execution Time - -Inside ``lab01_ex2.cu``, we ask you to measure the execution time of each -implementation of SAXPY. The main purpose is to understand how the performance -varies between the CPU and the GPU version. This is important for you to -understand the main bottlenecks while developing your GPU-accelerated -applications. For instance, it is probable that a considerable amount of time -on the GPU version is dedicated to transfer the data from the host to the GPU, -and back. Moreover, it can also be feasible that the CPU version is faster if -the problem size is not big enough to compensate the previous fact. - -You can use ``gettimeofday()`` to obtain the current timestamp in microseconds (see -[https://goo.gl/xMv177](https://goo.gl/xMv177)). The source code already -contains the definition of a ``get_elapsed()`` function that calculates the elapsed -time between two given ``tval`` values, and converts the returned value to -milliseconds. The definition of ``tval`` is on the beginning of the file. In -addition, you can use ``printf()`` to output the measured time on each version. - -For the GPU case, you need to measure the time **independently** by dividing the -measurement in three steps. First, consider the time dedicated to transfer the -input arrays from the host the GPU. Thereafter, measure the execution of the -kernel. Lastly, measure the data transfer of the result from the GPU to the -host. This way, we consider the real overall execution time dedicated to run -SAXPY on the GPU (i.e., not just the kernel execution, which will be -incorrect). +If you managed to define the kernel and get exactly this output, well done! This is a great achievement, congratulations! If you get something different, such as an error message reporting that the solution is incorrect, quickly review all the "**TO-DO**" steps of the exercise from the beginning to make sure that you did not miss anything. Feel free to ask us if you are lost, we are here to help. ---- -**TO-DO [B1.1]** +In the next block of exercises, we are going to extend this basic notion to perform some more advanced computation over images. For now, enjoy the weekend! + +## Bonus exercises -Inside ``lab01_ex2.cu``, introduce the necessary changes to measure the -execution time of the CPU and GPU implementations of SAXPY. For the GPU -version, make sure that you consider the execution of the kernel, alongside any -data transfers performed. +In this section, we provide you with additional exercises with the purpose of getting deeper into CUDA optimizations. These exercises are optional, but we consider that advanced users might be interested in understanding how they could improve the performance of their applications. -+ Hint #1: Keep in mind that the kernel execution is asynchronous. Hence, you - have to guarantee that the kernel has already run on the GPU, before - measuring the transfer of the result back to the host. +### Exercise A3: Measuring execution time + +Inside ``lab01_ex2.cu``, we ask you to measure the execution time of each implementation of SAXPY. The main purpose is to understand how the performance varies between the CPU and the GPU version. This is important for you to understand the main bottlenecks while developing your GPU-accelerated applications. For instance, it is probable that a considerable amount of time on the GPU version is dedicated to transfer the data from the host to the GPU, and back. Moreover, it can also be feasible that the CPU version is faster if the problem size is not big enough to compensate the previous fact. + +You can use ``gettimeofday()`` to obtain the current timestamp in microseconds (see [https://goo.gl/xMv177](https://goo.gl/xMv177)). The source code already contains the definition of a ``get_elapsed()`` function that calculates the elapsed time between two given ``tval`` values, and converts the returned value to milliseconds. The definition of ``tval`` is on the beginning of the file. In addition, you can use ``printf()`` to output the measured time on each version. + +For the GPU case, you need to measure the time **independently** by dividing the measurement in three steps. First, consider the time dedicated to transfer the input arrays from the host the GPU. Thereafter, measure the execution of the kernel. Lastly, measure the data transfer of the result from the GPU to the host. This way, we consider the real overall execution time dedicated to run SAXPY on the GPU (i.e., not just the kernel execution, which will be incorrect). --- +**TO-DO [A3.1]** -Is the GPU implementation faster? Probably you might be surprised by now that -is not that much faster, at least from what we could have expected. The reason -is that the cost of transferring the data between the host and the GPU is -relatively high, as you have observed. - -One approach to overcome (or hide) this limitation, is to pipeline the data -transfers alongside the kernel execution. This means that, while the data -required for the next kernel is being transferred, we are keeping the GPU busy -by simultaneously allowing the execution of other kernels. This maximizes the -throughput and efficiently takes advantage of the GPU power. - -CUDA Streams can be used to effectively enqueue work that the GPU will -concurrently handle. Even though this is out-of-the-scope of this introductory -course, we encourage you to read the following article in the future if you are -interested: [https://goo.gl/pJn7cR](https://goo.gl/pJn7cR) - -### Exercise B2 - Comparing the Thread Block Performance - -Deciding the size or dimension of the thread block for the execution of a CUDA -kernel is not trivial. One of the main reasons is that the performance of the -different block sizes usually depend on the underlying hardware. In addition, -your application can also affect the different characteristics of the GPU. For -instance, modern GPUs execute kernels in groups of 32 threads, called warps. -Knowing this fact, it usually makes sense to always try to use multiples of -this value to optimize the occupancy of the GPU as much as possible. - -In this exercise, we ask you to evaluate the performance of your SAXPY -implementation by varying the block size from 1, 2, 4, 8, ..., up to 512 (i.e., -using multiples of 2). We also request you to avoid the ``BLOCK_SIZE`` constant and -define a mechanism that allows you to vary this parameter without re-compiling -the program. Use the code that already exists for the constant "a" of -SAXPY, as reference. - -> The only difference is that you are expecting an integer value instead. See -> [https://goo.gl/ek3boh](https://goo.gl/ek3boh). +Inside ``lab01_ex2.cu``, introduce the necessary changes to measure the execution time of the CPU and GPU implementations of SAXPY. For the GPU version, make sure that you consider the execution of the kernel, alongside any data transfers performed. + +- Hint #1: Keep in mind that the kernel execution is asynchronous. Hence, you have to guarantee that the kernel has already run on the GPU, before measuring the transfer of the result back to the host. --- -**TO-DO [B1.2]** -Inside ``lab01_ex2.cu``, introduce the necessary changes to allow the block -dimension to be defined by parameter. Thereafter, measure the execution time of -the SAXPY implementation on the GPU by varying the size from 1, 2, 4, 8, ..., -up to 512, using only multiples of 2. +Is the GPU implementation faster? Probably you might be surprised by now that is not that much faster, at least from what we could have expected. The reason is that the cost of transferring the data between the host and the GPU is relatively high, as you have observed. + +One approach to overcome (or hide) this limitation, is to pipeline the data transfers alongside the kernel execution. This means that, while the data required for the next kernel is being transferred, we are keeping the GPU busy by simultaneously allowing the execution of other kernels. This maximizes the throughput and efficiently takes advantage of the GPU power. + +CUDA Streams can be used to effectively enqueue work that the GPU will concurrently handle. Even though this is out-of-the-scope of this introductory course, we encourage you to read the following article in the future if you are interested: [https://goo.gl/pJn7cR](https://goo.gl/pJn7cR) -+ Hint #1: Can you guess what are the consequences if the block size is below - 32 or not a multiple? +### Exercise A4: Comparing the thread block performance + +Deciding the size or dimension of the thread block for the execution of a CUDA kernel is not trivial. One of the main reasons is that the performance of the different block sizes usually depend on the underlying hardware. In addition, your application can also affect the different characteristics of the GPU. For instance, modern GPUs execute kernels in groups of 32 threads, called warps. Knowing this fact, it usually makes sense to always try to use multiples of this value to optimize the occupancy of the GPU as much as possible. + +In this exercise, we ask you to evaluate the performance of your SAXPY implementation by varying the block size from 1, 2, 4, 8, ..., up to 512 (i.e., using multiples of 2). We also request you to avoid the ``BLOCK_SIZE`` constant and define a mechanism that allows you to vary this parameter without re-compiling the program. Use the code that already exists for the constant "a" of SAXPY, as reference. + +> The only difference is that you are expecting an integer value instead. See [https://goo.gl/ek3boh](https://goo.gl/ek3boh). --- +**TO-DO [A4.2]** -Most of the time, the block size is a combination of previous experience -working with GPUs and empirical evaluations. NVIDIA provides an Excel file -named the "CUDA Occupancy Calculator" that provides an overview of what -would be the optimal occupancy of your GPU based on the architecture, the block -size and other parameters. The file can be downloaded from the following link: -[https://goo.gl/mJm4B8](https://goo.gl/mJm4B8) +Inside ``lab01_ex2.cu``, introduce the necessary changes to allow the block dimension to be defined by parameter. Thereafter, measure the execution time of the SAXPY implementation on the GPU by varying the size from 1, 2, 4, 8, ..., up to 512, using only multiples of 2. + +- Hint #1: Can you guess what are the consequences if the block size is below 32 or not a multiple? + +--- +Most of the time, the block size is a combination of previous experience working with GPUs and empirical evaluations. NVIDIA provides an Excel file named the "CUDA Occupancy Calculator" that provides an overview of what would be the optimal occupancy of your GPU based on the architecture, the block size and other parameters. The file can be downloaded from the following link: [https://goo.gl/mJm4B8](https://goo.gl/mJm4B8) diff --git a/lab_2/C/lab02_ex3_6.cu b/lab_2/C/lab02_ex3_6.cu index ab78190085cbe4bcaa49b6add29c5a867b0401fa..60a5c676eb961d039ae28819aa08df050e4771dc 100644 --- a/lab_2/C/lab02_ex3_6.cu +++ b/lab_2/C/lab02_ex3_6.cu @@ -177,7 +177,7 @@ void cpu_grayscale(int width, int height, float *image, float *image_out) __global__ void gpu_grayscale(int width, int height, float *image, float *image_out) { //////////////// - // TO-DO #4.2 ///////////////////////////////////////////// + // TO-DO B2.2 ///////////////////////////////////////////// // Implement the GPU version of the grayscale conversion // /////////////////////////////////////////////////////////// } @@ -209,7 +209,7 @@ float cpu_applyFilter(float *image, int stride, float *matrix, int filter_dim) __device__ float gpu_applyFilter(float *image, int stride, float *matrix, int filter_dim) { //////////////// - // TO-DO #5.2 //////////////////////////////////////////////// + // TO-DO B3.2 //////////////////////////////////////////////// // Implement the GPU version of cpu_applyFilter() // // // // Does it make sense to have a separate gpu_applyFilter()? // @@ -297,7 +297,7 @@ void cpu_sobel(int width, int height, float *image, float *image_out) __global__ void gpu_sobel(int width, int height, float *image, float *image_out) { //////////////// - // TO-DO #6.1 ///////////////////////////////////// + // TO-DO B4.1 ///////////////////////////////////// // Implement the GPU version of the Sobel filter // /////////////////////////////////////////////////// } diff --git a/lab_2/CUDA_Lab02.pdf b/lab_2/CUDA_Lab02.pdf deleted file mode 100644 index 812169d48873fc232ad51de0c1fc4cf88ab50abb..0000000000000000000000000000000000000000 Binary files a/lab_2/CUDA_Lab02.pdf and /dev/null differ diff --git a/lab_2/Fortran/lab02_ex3_6.cuf b/lab_2/Fortran/lab02_ex3_6.cuf index c36a463749fe79aa177ea4714933d0b7229bfe39..fa8c0ec778b1d248503d1828110fca98269b28c9 100644 --- a/lab_2/Fortran/lab02_ex3_6.cuf +++ b/lab_2/Fortran/lab02_ex3_6.cuf @@ -7,7 +7,7 @@ contains !attributes(global) subroutine gpu_grayscale(width, height, image, image_out) ! implicit none !//////////////// - !// TO-DO #4.2 ///////////////////////////////////////////// + !// TO-DO B2.2 ///////////////////////////////////////////// !// Implement the GPU version of the grayscale conversion // !/////////////////////////////////////////////////////////// !end subroutine gpu_grayscale @@ -52,7 +52,7 @@ contains integer, value :: filter_dim real, intent(out) :: pixel_out !//////////////// - !// TO-DO #5.2 //////////////////////////////////////////////// + !// TO-DO B3.2 //////////////////////////////////////////////// !// Implement the GPU version of cpu_applyFilter() // !// // !// Does it make sense to have a separate gpu_applyFilter()? // @@ -95,7 +95,7 @@ contains !attributes(global) subroutine gpu_sobel(width, height, image, image_out) ! implicit none !//////////////// - !// TO-DO #6.1 ///////////////////////////////////// + !// TO-DO B4.1 ///////////////////////////////////// !// Implement the GPU version of the Sobel filter // !/////////////////////////////////////////////////// !end subroutine gpu_sobel diff --git a/lab_2/README.md b/lab_2/README.md index 0bd61e2445f105f6635f1686db83aae012ace292..c38623114734982c055070517746df704c9998eb 100644 --- a/lab_2/README.md +++ b/lab_2/README.md @@ -1,167 +1,59 @@ -# CUDA Laboratory 2 - -_Introduction to High-Performance Computing_ +# PDC Summer School 2021: CUDA Laboratory 2 <img src="html/cover.jpeg" alt="BGR" width="800px"/> ## Introduction -In this second laboratory about GPU programming in CUDA, we are going to -continue building your skills in order to develop more advanced GPU-accelerated -applications. As a friendly reminder, the laboratory is divided in two -different blocks and exercises: +In this second laboratory about GPU programming in CUDA, we are going to continue building your skills in order to develop more advanced GPU-accelerated applications. -+ **Block A (Friday / 18<sup>th</sup> of August)**: The first block provided - you with a first notion on how to use CUDA. This includes how to compile a - CUDA program, how to launch a CUDA kernel, how to index 1D arrays, and more. +**We assume that everyone have already finished the exercises of the first lab**, i.e. exercise A1 to A2 (and ideally also the bonus exercises A3 and A4). -+ **Block B (Monday / 20<sup>th</sup> of August)**:The second block presents a - use-case for using GPUs with images. The goal is to make you understand how - to index 2D matrices, while doing something practical. +For all the exercises, we are going to use Tegner. **Even though you may have a laptop with a CUDA-supported GPU, we encourage you to use this during the labs.** For the purposes of this laboratory, we will use the "Thin Nodes" of Tegner. -**We assume that everyone have already finished the exercises of the first block -of the laboratory session**. If you have not attempted these exercises, it will -be relatively difficult to complete the second block of exercises that we are -presenting today. The reason is that you should already understand most of the -basic concepts of CUDA presented in the first block. For instance, by now, you -must know how to compile a CUDA program, how to manage the memory of the GPU, -and (more importantly) how to implement and launch a CUDA kernel. If none of -these requirements seem familiar, please, try the first block of exercises -before proceeding! +For the second lab, we are going to play with images and perform some basic image processing to create the base for an edge detector, such as [Canny](https://en.wikipedia.org/wiki/Canny_edge_detector). Our goal is to make you understand how to index 2D matrices, while doing something fun and practical. As a matter of fact, the results of the exercises below represent some of the main image processing techniques used in Computer Vision that allow for object and text recognition. If you would like to get a feeling on how the final output would look like, check the cover of this document! +Once again, we encourage you to really understand the concepts explained within the first block of exercises. If you do, then this block will be easy for you to solve, as you will see. -Once again, we are going to use Tegner for the set of exercises. +## Exercise B1: Experimental setup -> Even though you may have a laptop with a CUDA-supported GPU, we encourage you -> to use Tegner during the sessions of today and Monday. +Once you are connected to Tegner, copy the exercise folder ``lab_2`` including the sub-folders. We will use the file ``lab02.bmp``, located inside the images folder. This file is stored using the Bitmap (BMP) image format, a popular uncompressed format widely used by the Windows operating system. Each BMP file contains an encoded header that specifies the ``{width, height}`` of the image, the number of bits per color plane, and more. After the header, a subsequent string of interleaved color values follow (e.g., in BGR). Here is a simplified example of how a 3x3 image looks like inside the file: -Tegner is a cluster at KTH Royal Institute of Technology with 65 heterogeneous -nodes. Every node has one or two GPUs, based on the NVIDIA Quadro K420 or the -NVIDIA Tesla K80. For the purposes of this laboratory, we will use the "Thin -Nodes" of Tegner. These nodes contain 2×12-core Intel E5-2690v3 CPUs at -2.60GHz, 512GB DRAM and NVIDIA Quadro K420 GPU per node. More information can -be found -[here](https://www.pdc.kth.se/hpc-services/computing-systems/tegner-1.737437). +<img src="html/bgr.png" alt="BGR" width="800px"/> -In case you need help or you do not understand some of the topics of CUDA -presented in this laboratory session, do not worry, we are here to help you -out! +Each BGR, from Blue / Green / Red, represents an 8-bit pixel value in the image that encodes the intensity of each channel. The values span from 0 to 255 in the case of BMP 24bpp2, being 0 the absence of representation by this color and 255 the full representation. -## Block B +> Other Bitmap formats, such as BMP 32bpp, can contain an extra Alpha channel for transparency. -For the second block of exercises, we are going to play with images and perform -some basic image processing to create the base for an edge detector, such as -[Canny](https://en.wikipedia.org/wiki/Canny_edge_detector). Our goal is to make -you understand how to index 2D matrices, while doing something fun and -practical. As a matter of fact, the results of the exercises below represent -some of the main image processing techniques used in Computer Vision that allow -for object and text recognition. If you would like to get a feeling on how the -final output would look like, check the cover of this document! +This means that we could create a completely white image by setting all the pixel values to 255, or the opposite, a completely black image setting them to 0. One aspect of BMP files is that it is common to encounter that the pixel values are stored bottom-up (i.e., as if the image pixels were flipped). -Once again, we encourage you to really understand the concepts explained within -the first block of exercises. If you do, then this block will be easy for you -to solve, as you will see. +We provide you inside ``lab02_ex3_6.cu`` the functionality to read BMP 24bpp images and retrieve the pixel data in BGR format, as single-precision floating point values (suitable for the GPU). We also provide other set of functionality, such as a function to store BMP 8bpp files for Black & White images. **For this block of exercises, you will _not_ need to handle any of these aspects. We will only ask you to implement the kernels that perform the different image processing steps that will be applied to the reference image.** Everything else, including the set-up of CUDA in the ``main()`` function, is already defined inside ``lab02_ex3_6.cu``. -### Exercise 3 - Experimental Setup +One important aspect is to connect to Tegner using the ``-Y`` flag to enable X11 forwarding support. This flag is required because we are going to display the resultant images using the [ImageMagick](https://www.imagemagick.org/) suite. This is a free and open-source software suite for displaying, converting, and editing raster image and vector image files. The suite contains multiple interfaces (e.g., C API). -We will use a different CUDA source code file to implement the exercises below. -This source code file can be cloned from the repository +In this laboratory session, we will use the terminal interface of ImageMagick to render the images generated by each exercise. Thus, we effectively avoid to copy the images using ``scp``. For instance, let us resize the reference ``lab02.bmp`` image and display its content remotely. First, connect to Tegner with the ``-Y`` flag and access to the folder that contain the exercises. Then, execute the following command: ``` -cd /cfs/klemming/nobackup/your_initial/your_username -module load git -git clone https://github.com/PDC-support/cuda-lab-exercises.git -cd cuda-lab-exercises/lab_2/C +display -resize 1280x720 images/lab02.bmp ``` -We also need a reference image file to process on each step. We will use the -file ``lab02.bmp``, located inside the images folder. This file is stored using the -Bitmap (BMP) image format, a popular uncompressed format widely used by the -Windows operating system. Each BMP file contains an encoded header that -specifies the ``{width, height}`` of the image, the number of bits per color plane, -and more. After the header, a subsequent string of interleaved color values -follow (e.g., in BGR). Here is a simplified example of how a 3x3 image looks -like inside the file: +If everything worked as expected, you must see a new window after a few seconds (be patient): -<img src="html/bgr.png" alt="BGR" width="800px"/> +<img src="html/fig1.jpeg" alt="Figure 1" width="800px"/> -Each BGR, from Blue / Green / Red, represents an 8-bit pixel value in the image -that encodes the intensity of each channel. The values span from 0 to 255 in -the case of BMP 24bpp2, being 0 the absence of representation by this color and -255 the full representation. - -> Other Bitmap formats, such as BMP 32bpp, can contain an extra Alpha channel -> for transparency. - -This means that we could create a completely white image by setting all the -pixel values to 255, or the opposite, a completely black image setting them to 0. -One aspect of BMP files is that it is common to encounter that the pixel -values are stored bottom-up (i.e., as if the image pixels were flipped). - -We provide you inside ``lab02_ex3_6.cu`` the functionality to read BMP 24bpp images -and retrieve the pixel data in BGR format, as single-precision floating point -values (suitable for the GPU). We also provide other set of functionality, such -as a function to store BMP 8bpp files for Black & White images. **For this block -of exercises, you will _not_ need to handle any of these aspects. We will only -ask you to implement the kernels that perform the different image processing -steps that will be applied to the reference image.** Everything else, including -the set-up of CUDA in the ``main()`` function, is already defined inside -``lab02_ex3_6.cu``. - -One important aspect is to connect to Tegner using the ``-Y`` flag to enable X11 -forwarding support. This flag is required because we are going to display the -resultant images using the [ImageMagick](https://www.imagemagick.org/) suite. -This is a free and open-source software suite for displaying, converting, and -editing raster image and vector image files. The suite contains multiple -interfaces (e.g., C API). - -In this laboratory session, we will use the terminal interface of ImageMagick -to render the images generated by each exercise. Thus, we effectively avoid to -copy the images using ``scp``. For instance, let us resize the reference ``lab02.bmp`` -image and display its content remotely. First, connect to Tegner with the ``-Y`` -flag and access to the folder that contain the exercises. Then, execute the -following command: +**You are now ready to begin with the exercises!** If you did not get a new window, please, ask any of the laboratory assistants for help. -``` -display -resize 1280x720 images/lab02.bmp -``` -If everything worked as expected, you must see a new window after a few seconds -(be patient): +## Exercise B2: Black & white image conversion -<img src="html/fig1.jpeg" alt="Figure 1" width="800px"/> +One of the frequent initial steps in the development of an edge detector is to discard the color information and work directly in black & white. The idea is to keep only the intensity of the pixels. For instance, if we consider a [YUV](https://en.wikipedia.org/wiki/YUV) color space, which is very common in video streams, one can easily work only on the Y plane and discard the color information. The reason is that the Y plane (luminance) contains the intensity of the pixel values, which represents the main content of the image. The UV planes (chrominance) define the color or tint of the pixels, but they do not necessarily add value to the features that we want to extract from the image. + +The BMP image ``lab02.bmp`` is encoded using a BGR color space, where the combination of the individual intensities of each color value represent the final intensity of the specific pixels. Therefore, the first step for our base edge detector would be to combine these pixels in order to generate a BMP 8bpp image in grayscale. In other words, we want only 8 bits per pixel. -**You are now ready to begin with the exercises!** If you did not get a new window, -please, ask any of the laboratory assistants for help. - -### Exercise 4 - Black & White Image Conversion - -One of the frequent initial steps in the development of an edge detector is to -discard the color information and work directly in black & white. The idea is -to keep only the intensity of the pixels. For instance, if we consider a -[YUV](https://en.wikipedia.org/wiki/YUV) color space, which is very common in -video streams, one can easily work only on the Y plane and discard the color -information. The reason is that the Y plane (luminance) contains the intensity -of the pixel values, which represents the main content of the image. The UV -planes (chrominance) define the color or tint of the pixels, but they do not -necessarily add value to the features that we want to extract from the image. - -The BMP image ``lab02.bmp`` is encoded using a BGR color space, where the -combination of the individual intensities of each color value represent the -final intensity of the specific pixels. Therefore, the first step for our base -edge detector would be to combine these pixels in order to generate a BMP 8bpp -image in grayscale. In other words, we want only 8 bits per pixel. - -For the conversion to grayscale, we are going to use the -[Colorimetric](https://en.wikipedia.org/wiki/Grayscale) (luminance-preserving) -method. This conversion guarantees that both the original and converted image -maintains the same absolute luminance. In practice terms, what we are going to -do is to take each BGR value of the Bitmap file and apply the following -conversion using the weighted sum of the three values: +For the conversion to grayscale, we are going to use the [Colorimetric](https://en.wikipedia.org/wiki/Grayscale) (luminance-preserving) method. This conversion guarantees that both the original and converted image maintains the same absolute luminance. In practice terms, what we are going to do is to take each BGR value of the Bitmap file and apply the following conversion using the weighted sum of the three values: <img src="html/yuv.png" alt="YUV" width="800px"/> --- -**TO-DO [4.1]** +**TO-DO [B2.1]** Open the file ``lab02_ex3_6.cu`` with your preferred text editor and briefly examine the overall content of the file. Pay particular attention to the @@ -185,9 +77,9 @@ subsequent operations. We now ask you to implement the same ``cpu_greyscale()`` function, but using a GPU kernel in CUDA instead. --- -**TO-DO [4.2]** +**TO-DO [B2.2]** -Find the declaration of ``gpu_greyscale()`` in ``lab02_ex3_6.cu`` and implement the GPU +Find the declaration of ``gpu_grayscale()`` in ``lab02_ex3_6.cu`` and implement the GPU version of the black & white color conversion filter. The source code is already set-up to call the kernel and generate the output, but you will need to uncomment the code inside ``main()``. @@ -222,7 +114,7 @@ You must get a new window that displays the converted image in black & white, su <img src="html/fig2.jpeg" alt="Figure 2" width="800px"/> -### Exercise 5 - Applying a Convolution Filter +### Exercise B3: Applying a convolution filter Converting the input image to black & white was a very good first step towards implementing our edge detector. For this exercise, we are going to apply a @@ -238,7 +130,7 @@ image noise. The reason is that the filter effectively reduces the high-frequency components of a given image. We need this filter as an intermediate step towards increasing the quality of -the result of [Exercise 6](#ex6), where we will apply a Sobel filter to define the +the result of [Exercise B4](#exb4), where we will apply a Sobel filter to define the edges of the image (i.e., the Sobel filter is very sensitive to noise). **For this exercise, we are going to apply a Gaussian filter using a 3×3 convolution matrix on all the pixels of the image.** The term convolution is the @@ -257,7 +149,7 @@ convolution matrix and apply the weights with the surrounding pixels. As we use symmetric filters, the order can be top-bottom as well. --- -**TO-DO [5.1]** +**TO-DO [B3.1]** Find the implementation of ``cpu_applyFilter()`` inside the ``lab02_ex3_6.cu`` file and try to understand how a given convolution matrix is applied to a certain pixel. @@ -280,7 +172,7 @@ is very similar to a plain C function, with the difference that now the GPU can see this code as well. --- -**TO-DO [5.2]** +**TO-DO [B3.2]** Implement the ``gpu_applyFilter()`` in ``lab02_ex3_6.cu`` that allows to apply any kind of convolution matrix to a certain pixel. Will the GPU code differ from the CPU @@ -318,7 +210,7 @@ correctly enabled the new version of the ``xxx_applyFilter()`` inside the ``gpu_gaussian()`` kernel. Ask for help to the laboratory assistants if you cannot make any progress. -### <a name="ex6"></a>Exercise 6 - Detecting Edges in the Image +### <a name="exb4"></a>Exercise B4: Detecting edges in the image The very last step of our base edge detector is to apply the Sobel filter. With this filter, we are going to compute an approximation of the gradient of the @@ -350,7 +242,7 @@ the fact that the we apply two different convolution filters to the same pixel and combine the result. --- -**TO-DO [6.1]** +**TO-DO [B4.1]** Implement ``gpu_sobel()`` in ``lab02_ex3_6.cu`` to enable the execution of the Sobel filter on the GPU. Pay special attention to the indices used on the CPU @@ -398,71 +290,38 @@ that you enjoyed the laboratory session. ## Bonus Exercises -In this section, we provide you with additional exercises with the purpose of -getting deeper into CUDA optimizations. These exercises are optional, but we -consider that advanced users might be interested in understanding how they -could improve the performance of their applications. +In this section, we provide you with additional exercises with the purpose of getting deeper into CUDA optimizations. These exercises are optional, but we consider that advanced users might be interested in understanding how they could improve the performance of their applications. -### Exercise 7 - Optimizing Memory Accesses +### Exercise B5: Optimizing memory accesses -During the lectures, we have seen that the memory hierarchy of the GPU is rich -and complex. We can encounter different layers that vary in speed and -capacity. For instance, the texture memory is a very limited and special memory -that allows you for efficient access to random locations inside a texture, -which is tremendously useful in video games. +During the lectures, we have seen that the memory hierarchy of the GPU is rich and complex. We can encounter different layers that vary in speed and capacity. For instance, the texture memory is a very limited and special memory that allows you for efficient access to random locations inside a texture, which is tremendously useful in video games. -This hierarchy is also visible from a CUDA program perspective, and effectively -selecting where to place our data can make a difference in some situations. Up -until now, we have been using the Global Memory space, which is provided by -default if nothing is specified. This Global Memory offers very high-capacity -and represents the first layer we access when copying data from the CPU to the -GPU. Unfortunately, this memory features high-latency to access the data. +This hierarchy is also visible from a CUDA program perspective, and effectively selecting where to place our data can make a difference in some situations. Up until now, we have been using the Global Memory space, which is provided by default if nothing is specified. This Global Memory offers very high-capacity and represents the first layer we access when copying data from the CPU to the GPU. Unfortunately, this memory features high-latency to access the data. -In this exercise, we are going to try to optimize the GPU versions of the -Gaussian and Sobel filter by using the Shared Memory instead. The idea is to -bring the content of the image from Global Memory to Shared Memory in blocks of -size ``BLOCK_SIZE_SH``. This constant is also the dimension of each ``block`` inside -the ``grid``, plus some additional values in X and Y. +In this exercise, we are going to try to optimize the GPU versions of the Gaussian and Sobel filter by using the Shared Memory instead. The idea is to bring the content of the image from Global Memory to Shared Memory in blocks of size ``BLOCK_SIZE_SH``. This constant is also the dimension of each ``block`` inside the ``grid``, plus some additional values in X and Y. -We ask you first to declare the ``BLOCK_SIZE_SH`` constant on top of the file, which -defines the dimension of the Shared Memory block. Use the following: +We ask you first to declare the ``BLOCK_SIZE_SH`` constant on top of the file, which defines the dimension of the Shared Memory block. Use the following: ``` #define BLOCK_SIZE_SH 18 ``` -> We will provide more details of why we use 18 here and not 16, as in the -> number of threads per block. +> We will provide more details of why we use 18 here and not 16, as in the number of threads per block. -We will use this constant for the declaration of the memory space inside -``gpu_gaussian()`` and ``gpu_sobel()``. The declaration is defined in the first -or one of the first lines of each kernel: +We will use this constant for the declaration of the memory space inside ``gpu_gaussian()`` and ``gpu_sobel()``. The declaration is defined in the first or one of the first lines of each kernel: ``` __shared__ float sh_block[BLOCK_SIZE_SH * BLOCK_SIZE_SH]; ``` -This will declare a 2D shared block in Shared Memory, using the 1D array -representation that we have already discussed in the previous exercises. The -``__shared__`` attribute is given in the declaration to suggest the compiler that -we want this variable to be located in Shared Memory and not in Local or Global -Memory. +This will declare a 2D shared block in Shared Memory, using the 1D array representation that we have already discussed in the previous exercises. The ``__shared__`` attribute is given in the declaration to suggest the compiler that we want this variable to be located in Shared Memory and not in Local or Global Memory. -Hence, the first exercise would be to declare the shared block inside -``gpu_gaussian()`` and ``gpu_sobel()``. Then, we ask you to make each thread copy a -pixel from the input image into the shared memory block. You have to call -``__syncthreads()`` to guarantee that each thread has finished retrieving its part -of the block before using the data. Thereafter, change the input of the -``applyFilter()`` function to use the shared block instead. +Hence, the first exercise would be to declare the shared block inside ``gpu_gaussian()`` and ``gpu_sobel()``. Then, we ask you to make each thread copy a pixel from the input image into the shared memory block. You have to call ``__syncthreads()`` to guarantee that each thread has finished retrieving its part of the block before using the data. Thereafter, change the input of the ``applyFilter()`` function to use the shared block instead. --- -**TO-DO [B2.1]** +**TO-DO [B5.1]** -In ``lab02_ex3_6.cu``, declare a Shared Memory block within ``gpu_gaussian()`` and -another one within ``gpu_sobel()``. Thereafter, introduce the necessary changes to -make each thread bring one pixel value to the shared block. Change the input -parameter of applyFilter() to use the shared block (i.e., instead of a -reference to the input image directly). +In ``lab02_ex3_6.cu``, declare a Shared Memory block within ``gpu_gaussian()`` and another one within ``gpu_sobel()``. Thereafter, introduce the necessary changes to make each thread bring one pixel value to the shared block. Change the input parameter of applyFilter() to use the shared block (i.e., instead of a reference to the input image directly). + Hint #1: Use ``__syncthreads()`` to guarantee that all the threads have copied their pixels to the Shared Memory. @@ -482,29 +341,20 @@ the fact that we also have to bring extra columns and rows on one of the sides of the block. Without this change, some of the threads are accessing uninitialized data. -This is the main reason why we declared the constant ``BLOCK_SIZE_SH`` with two -additional elements per dimension. This will make sure that all the threads -within the block access data that is available inside the Shared Memory space. -As such, **the final exercise for you would be to consider the boundaries of each -thread block**. We already gave you a hint in the declaration of the constant -``BLOCK_SIZE_SH`` (i.e., two extra columns and rows are needed). +This is the main reason why we declared the constant ``BLOCK_SIZE_SH`` with two additional elements per dimension. This will make sure that all the threads within the block access data that is available inside the Shared Memory space. As such, **the final exercise for you would be to consider the boundaries of each thread block**. We already gave you a hint in the declaration of the constant ``BLOCK_SIZE_SH`` (i.e., two extra columns and rows are needed). --- -**TO-DO [B2.2]** +**TO-DO [B5.2]** Extend the Shared Memory version of ``gpu_gaussian()`` and ``gpu_sobel()`` to transfer part of the surrounding pixels of the thread block to Shared Memory. Make sure that you do not exceed the boundaries of the image. -+ Hint #1: Once again, use ``__syncthreads()`` to guarantee that all the threads - have copied their pixels to the Shared Memory. You will need more than one call - to this function. +- Hint #1: Once again, use ``__syncthreads()`` to guarantee that all the threads have copied their pixels to the Shared Memory. You will need more than one call to this function. --- -After your implementation is completed, you will see that the execution time -has been reduced around 5-10ms, compared to the original implementation. The -output should state something as: +After your implementation is completed, you will see that the execution time has been reduced around 5-10ms, compared to the original implementation. The output should state something as: ``` Step #1 Completed - Result stored in "images/lab02_result_1.bmp". @@ -515,17 +365,6 @@ Step #3 Completed - Result stored in "images/lab02_result_3.bmp". Elapsed CPU: 570ms / Elapsed GPU: 20ms ``` -Despite this might not seem as a major achievement, this change represents -between 15% to 30% performance improvement. In fact, in real-time rendering -such as in games, saving 5ms could make a huge difference in performance. Here, -the limit per frame is usually around 16ms for 60FPS or 33ms for 30FPS. Hence, -game developers usually fight for any slight optimization of the code that -could make them achieve these rates. - -Nonetheless, **we must note that this was a very naive implementation**. We just -wanted you to try how you could define a Shared Memory space in the GPU, as a -fine grain performance optimization. However, we did not account for other -issues, such as memory bank conflicts, that could boost the performance -considerably. In fact, we could have combined the Gaussian and Sobel filters to -exploit data locality. Advanced users might be interested in reading the -following article from NVIDIA: [https://goo.gl/1WuZGy](https://goo.gl/1WuZGy). +Despite this might not seem as a major achievement, this change represents between 15% to 30% performance improvement. In fact, in real-time rendering such as in games, saving 5ms could make a huge difference in performance. Here, the limit per frame is usually around 16ms for 60FPS or 33ms for 30FPS. Hence, game developers usually fight for any slight optimization of the code that could make them achieve these rates. + +Nonetheless, **we must note that this was a very naive implementation**. We just wanted you to try how you could define a Shared Memory space in the GPU, as a fine grain performance optimization. However, we did not account for other issues, such as memory bank conflicts, that could boost the performance considerably. In fact, we could have combined the Gaussian and Sobel filters to exploit data locality. Advanced users might be interested in reading the following article from NVIDIA: [https://goo.gl/1WuZGy](https://goo.gl/1WuZGy).