diff --git a/mmul/README.md b/mmul/README.md index 5c71d7c0f9f4f01b2f9429d966d10139ce9f46a4..4841930b180242292e2e57023b2a0e8384fefc6f 100644 --- a/mmul/README.md +++ b/mmul/README.md @@ -18,10 +18,10 @@ For this lab, answer the following questions and address the following tasks: 1) Compute the arithmetic intensity for the numerical task $z \rightarrow x * y$ as a function of the matrix size *N*. Assume the elements of the square matrices to be double precision numbers of size 8 Byte. Ignore the existence of caches. 2) Argue whether the application is compute performance or memory bandwidth limited. 3) Repeat this analysis for the cache blocked implementations of the numerical tasks (see functions/subroutines ``mmul_block1`` and/or ``mmul_block2``) and compute the arithmetic intensity as a function of the matrix size *N* and block size *B* under the assumption that the caches are large enough to hold at least one full block of each of the 3 matrices. -4) Use Scalasca to collect performance figures for each task implementation of the numerical task (``mmul_ref``, ``mmul_block1``, ``mmul_block2``) using the following performance counters: L1 data cache misses (``PAPI_L1_DCM``),total number of instructions (``PAPI_TOT_INS``), total number of cycles (``PAPI_TOT_CYC``). Select *N*=256 and choose different block sizes *B*. Compile the application using the options ``-O3 -march=native``. Explore the following questions: +4) Use Scalasca to collect performance figures for each task implementation of the numerical task (``mmul_ref``, ``mmul_block1``, ``mmul_block2``) using the following performance counters: L2 data cache misses (``PAPI_L2_DCM``),total number of instructions (``PAPI_TOT_INS``), total number of cycles (``PAPI_TOT_CYC``). Select *N*=256 and choose different block sizes *B*. Compile the application using the options ``-O3 -march=native``. Explore the following questions: - Which is the optimal block size? - What level of Instructions per Cycle (IPC) is reached? Is this value large for the given processor architecture? - What fraction of the peak performance of the processor core is achieved? 5) How do the results obtained in task #4 compare to the theoretical results obtained in task #3? What could explain the deviations? (Hint: Review the assumptions that had been made for task #3.) 6) For (*N*,*B*)=(256,16) compile the code using different optimisation levels, i.e. ``-O0`` (no optimisation), ``-O1``, ``-O3``, and compare the results in terms of time spent in each kernel and number of instructions. -7) Bonus question: Did the compiler generate SIMD instructions? Hint: Use the compiler option ``-S`` to produce an assembler version of numerical task implementations. \ No newline at end of file +7) Bonus question: Did the compiler generate SIMD instructions? Hint: Use the compiler option ``-S`` to produce an assembler version of numerical task implementations. diff --git a/poisson2d/README.md b/poisson2d/README.md index 6246fb734b6157207462247301d7eae70ee61013..d4595ff02a7d58f1701d4627255881d5da9efbbf 100644 --- a/poisson2d/README.md +++ b/poisson2d/README.md @@ -53,6 +53,6 @@ For this lab, answer the following questions and address the following tasks: 3) Compute the expected execution time per solver iteration as a function of *NX*, *NY*, arithmetic intensity and the hardware parameter identified as relevant with the previous question, i.e. throughput of floating-point operations or memory bandwidth. 4) For the performance critical code region, identify spatial and temporal data locality properties of the relevant data objects. 5) Based on the data locality properties argue why the choice of the loop order is bad. -6) Provide empirical evidence for your arguments by measuring time as well as the L1 data cache misses using Scalasca (via the PAPI counter ``PAPI_L1_DCM``) for (*NX*,*NY*)=(16,1024). -7) After having optimised the code, what will run faster: (*NX*,*NY*)=(16,1024) or (*NX*,*NY*)=(1024,16)? Proof this by measuring time and L1 data cache misses and argue, why this meets your expectations based on data locality arguments. +6) Provide empirical evidence for your arguments by measuring time as well as the L2 data cache misses using Scalasca (via the PAPI counter ``PAPI_L2_DCM``) for (*NX*,*NY*)=(16,1024). +7) After having optimised the code, what will run faster: (*NX*,*NY*)=(16,1024) or (*NX*,*NY*)=(1024,16)? Proof this by measuring time and L2 data cache misses and argue, why this meets your expectations based on data locality arguments. 8) Compute the effective bandwidth, which is defined as the ratio of amount of loaded plus stored data (ignoring data locality) divided by the execution time. How does this compare to the hardware parameters of the node, which you have been using?