From 58053a3a36d4b15679dde9c329ac772446915c68 Mon Sep 17 00:00:00 2001
From: Stefan Kesselheim <s.kesselheim@fz-juelich.de>
Date: Thu, 12 Jan 2023 17:54:24 +0100
Subject: [PATCH] doc updated

---
 README.md | 37 +++++++++++++++++++++++++++++++++----
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index c12bb53..96389f1 100644
--- a/README.md
+++ b/README.md
@@ -2,10 +2,34 @@
 
 ## Interactive Spark Cluster
 Script `start_spark_cluster.sh`. Spin up a Spark cluster with the specified number of nodes. 
-To start, simply execute 
+
+### tldr;
+To start your spark cluster on the cluster, simply run:
+```bash
+sbatch start_spark_cluster.sh
+```
+You only need to pick the number of nodes you need.
+
+### Preparation
+1. Clone this repo
+```bash
+git clone https://gitlab.jsc.fz-juelich.de/AI_Recipe_Book/recipes/spark-examples.git
+```
+2. Prepare the virtual environment to install required Pyhton dependencies. 
+- `cd spark_env`
+- Edit `requirements.txt`
+- Create the virtual environment by calling `./setup.sh`
+- Create a kernel for Jupyter-JSC by calling `./create_kernel.sh`
+- To recreate the virtual environment, simple delete the folder `./venv`
+
+3. Pick the Number of nodes by adjusting the line `#SBATCH --nodes=2` in `start_spark_cluster.sh`.
+
+### Execution
+To start your spark cluster on the cluster, simply run:
 ```bash
 sbatch start_spark_cluster.sh
 ```
+
 This will return information similar to 
 ```
 Submitted batch job 6525353
@@ -16,18 +40,23 @@ In order to connect, you need to find out the hostname of you compute job.
              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            6525353 develboos spark-cl kesselhe  R      30:40      2 jwb[0129,0149]
 ```
+In this case, the spark cluster runs on the nodes jwb0129 and jwb0149. Note that to 
+access the nodes from everywhere, you must add a letter `i`. Then a valid hostname
+is `jwb0129i.juwels`. According adjustments are made automatically in the scripts. 
+The spark master always runs in the first node.
 
 Then you can run a Spark App with a command similar to
 ```bash
-module load Stages/2023  GCC  OpenMPI Spark
+source ./spark_env/activate.sh
 export MASTER_URL=spark://jwb0129i.juwels:4124
 python pyspark_pi.py
 ```
-Note the `i` that that has been added to the master hostname. 
+Note the `i` that you must replace the hostname including the `i`. 
 
+### Monitoring
 To connect to the master and workers with a browser, you need a command of the following form:
 ```bash
-ssh -L 18080:localhost:18080 -L 8080:localhost:8080 kesselheim1@jwb0085i.juwels -J kesselheim1@juwels-booster.fz-juelich.de
+ssh -L 18080:localhost:18080 -L 8080:localhost:8080 kesselheim1@jwb0129i.juwels -J kesselheim1@juwels-booster.fz-juelich.de
 ```
 Then you can navigate to (http://localhost:8080) to the the output. 
 
-- 
GitLab