diff --git a/intro_lab/README.md b/intro_lab/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..99960ca2a91bbd5a6dde338d464ae6891acdc932
--- /dev/null
+++ b/intro_lab/README.md
@@ -0,0 +1,247 @@
+# OpenMP Lab Assignment
+
+## Overview
+
+The goal of these exercises is to familiarize you with OpenMP environment and
+make our first parallel codes with OpenMP. We will also record the code
+performance and understand race condition and false sharing. This laboratory
+contains four exercises, each with step-by-step instructions below.
+
+For your experiments, you are going to use a node of the
+[Beskow](https://www.pdc.kth.se/hpc-services/computing-systems/beskow-1.737436)
+supercomputer.  To run your code on Beskow, you need first to generate your
+executable. It is very important that you include a compiler flag telling the
+compiler that you are going to use OpenMP. If you forget the flag, the compiler
+will happily ignore all the OpenMP directives and create an executable that
+runs in serial. Different compilers have different flags. When using Cray
+compilers, the OpenMP flag is ``-openmp``.
+
+To compile your C OpenMP code using the default Cray compilers:
+
+```
+cc -O2 -openmp -lm name_source.c -o name_exec
+```
+
+In Fortran, you need to use the Intel compiler. On Beskow, you first need to
+switch first from Cray to Intel compiler with:
+
+```
+module swap PrgEnv-cray PrgEnv-intel
+```
+
+and then compile with:
+
+```
+ftn -fpp -O2 -openmp -lm name_source.f90 -o name_exec
+```
+
+<!---
+**Very Important**: Do not copy and paste from this document to your terminal
+since the - symbol will create problems when compiling.
+-->
+
+To run your code on Beskow, you will need to have an interactive allocation:
+
+```
+salloc -N 1 -t 4:00:00 -A edu18.summer --reservation=summer-2018-08-15
+```
+
+To set the number of threads, you need to set the OpenMP environment variable:
+
+```
+export OMP_NUM_THREADS=<number-of-threads>
+```
+
+To run an OpenMP code on a computing node of Beskow:
+
+```
+aprun -n 1 -d number of threads -cc none ./name_exec
+```
+
+## Exercise-1 - OpenMP Hello World: get familiar with OpenMp Environment
+
+_Concepts: Parallel regions, parallel, thread ID_
+
+Here we are going to implement the first OpenMP program. Expected knowledge
+includes basic understanding of OpenMP environment, how to compile an OpenMP
+program, how to set the number of OpenMP threads and retrieve the thread ID
+number at runtime.
+
+Your code using 4 threads should behave similarly to:
+
+Input:
+
+```
+aprun -n 1 -d 4 -cc none ./hello
+```
+
+Output:
+
+```
+Hello World from Thread 3
+Hello World from Thread 0
+Hello World from Thread 2
+Hello World from Thread 1
+```
+
+Instructions: Write a C/Fortran code to make each OpenMP thread print "``Hello
+World from Thread X!``" with ``X`` = thread ID.
+
+Hints:
+
+- Remember to include OpenMP library
+- Retrieve the ID of the thread with ``omp_get_thread_num()`` in C or in Fortran ``OMP_GET_THREAD_NUM()``.
+
+Questions:
+
+- How do you change the number of threads?
+- How many different ways are there to change the number of threads? Which one are those?
+- How can you make the output ordered from thread 0 to thread 4?
+
+## Exercise 2 - Creating Threads: calculate pi in parallel opnalryaulslienlg tphreagma 
+
+_Concepts: Parallel, default data environment, runtime library calls_
+
+Here we are going to implement a first parallel version of the pi.c / pi.f90
+code to calculate the value of pi using the parallel construct.
+
+The figure below shows the numerical technique, we are going to use to calculate pi.
+
+![PI integral](image/pi_int.png "Figure 1. Numerical integration for calculating PI")
+
+Mathematically, we know that
+
+![PI Eq. 1](image/pi_eq_1.png "")
+
+We can approximate the integral as a sum of rectangles
+
+![PI Eq. 2](image/pi_eq_2.png "")
+
+where each rectangle has width Delta_x and height F(x<sub>i</sub>) at the middle of interval i.
+
+A simple serial C code to calculate pi is the following:
+
+```
+static long num_steps = 100000000;
+double step;
+int main()
+{
+    int i;
+    double x, pi, sum = 0.0;
+    double start_time, run_time;
+    step = 1.0/(double) num_steps;
+    for (i=1; i<=num_steps; i++) {
+        x = (i-0.5)*step;
+    }
+    sum = sum + 4.0/(1.0+x*x);
+    pi = step * sum;
+}
+```
+
+Instructions: Create a parallel version of the pi.c / pi.f90 program using a
+parallel construct: ``#pragma omp parallel``. Run the parallel code and take the
+execution time with 1, 2, 4, 8, 16, 32 threads. Record the timing.
+
+Pay close attention to shared versus private variables.
+
+- In addition to a parallel construct, you might need the runtime library routines
+- ``int omp_get_num_threads()``; to get the number of threads in a team
+- ``int omp_get_thread_num()``; to get thread ID
+- ``double omp_get_wtime()``; to get the time in seconds since a fixed point in the past
+- ``omp_set_num_threads()``; to request a number of threads in a team
+
+Hints:
+
+- Use a parallel construct: ``#pragma omp parallel``.
+- Divide loop iterations between threads (use the thread ID and the number of threads).
+- Create an accumulator for each thread to hold partial sums that you can later
+  combine to generate the global sum, i.e.,
+
+```
+#define MAX_THREADS 4
+...
+int main ()
+{
+    double pi, full_sum = 0.0;
+    double sum[MAX_THREADS];
+    for (j=1; j<=MAX_THREADS; j++) {
+        /* set the number of threads to j and start timing here */
+        ...
+        #pragma omp parallel
+        {
+            ...
+            sum[id] = 0.0;
+            for (i ...) {
+                x = (i+0.5)*step;
+                sum[id] = sum[id] + 4.0/(1.0+x*x);
+            }
+            // calculate full_sum iterating over sum[id]
+            ...
+            pi = step * full_sum;
+            /* stop and report timing here */
+            ...
+        }
+    }
+}
+```
+
+Questions:
+
+- How does the execution time change varying the number of threads? Is what you
+  expected? If not, why you think it is so?
+- Is there any technique you heard in class to improve the scalability of the
+  technique? How would you implement it?
+
+## Exercise 3 - calculate picursintgical andatomic directives
+
+_Concepts: parallel region, synchronization, critical, atomic_
+
+Here we are going to implement a second and a third parallel version of the
+pi.c / pi.f90 code to calculate the value of pi using the critical and atomic
+directives.
+
+Instructions: Create two new parallel versions of the pi.c / pi.f90 program
+using the parallel construct ``#pragma omp parallel`` and 1) ``#pragma omp critical``
+2) ``#pragma omp atomic``. Run the two new parallel codes and take the execution
+time with 1, 2, 4, 8, 16, 32 threads. Record the timing in a table.
+
+Hints:
+
+- We can use a shared variable pi to be updated concurrently by different
+  threads. However, this variable needs to be protected with a critical section
+  or an atomic access.
+- Use critical and atomic before the update ``pi += step``
+
+Questions:
+
+- What would happen if you hadn’t used critical or atomic a shared variable?
+- How does the execution time change varying the number of threads? Is it what
+  you expected?
+- Do the two version of the codes differ in performance? If so, what do you
+  think it is the reason?
+
+## Exercise 4 - calculate pi with a loop and a reduction
+
+_Concepts: worksharing, parallel loop, schedule, reduction_
+
+Here we are going to implement a fourth parallel version of the pi.c / pi.f90
+code to calculate the value of pi using omp for and reduction operations.
+
+Instructions: Create a new parallel versions of the pi.c / pi.f90 program using
+the parallel construct ``#pragma omp for`` and ``reduction`` operation. Run the new
+parallel code and take the execution time for 1, 2, 4, 8, 16, 32 threads. Record
+the timing in a table. Change the schedule to dynamic and guided and measure
+the execution time for 1, 2, 4, 8, 16, 32 threads.
+
+Hints:
+
+To change the schedule, you can either change the environment variable with
+``export OMP_SCHEDULE=<type>`` where type can be any of static, dynamic, guided or in
+the source code as ``omp parallel for schedule(<type>)``.
+
+Questions:
+
+- What is the scheduling that provides the best performance? What is the reason for that?
+- What is the fastest parallel implementation of pi.c / pi.f90 program? What is
+  the reason for being the fastest? What would be an even faster implementation
+  of pi.c / pi.f90 program?
diff --git a/intro_lab/image/pi_eq_1.png b/intro_lab/image/pi_eq_1.png
new file mode 100644
index 0000000000000000000000000000000000000000..2a06f4079469db2ab7cffb0096b17f72e5840033
Binary files /dev/null and b/intro_lab/image/pi_eq_1.png differ
diff --git a/intro_lab/image/pi_eq_2.png b/intro_lab/image/pi_eq_2.png
new file mode 100644
index 0000000000000000000000000000000000000000..7a31ca0c5f6c33d537aedaa6e9f0f75f7f6dd6dc
Binary files /dev/null and b/intro_lab/image/pi_eq_2.png differ
diff --git a/intro_lab/image/pi_int.png b/intro_lab/image/pi_int.png
new file mode 100644
index 0000000000000000000000000000000000000000..6b3e22834d52e038db0eb3e46f4f1d4de2dc3141
Binary files /dev/null and b/intro_lab/image/pi_int.png differ