Skip to content
Snippets Groups Projects
Select Git revision
  • 073ba868eae1fc38f9f90502af2f7311bd468f23
  • main default protected
2 results

spark-examples

Spark-Examples

Interactive Spark Cluster

Script start_spark_cluster.sh. Spin up a Spark cluster with the specified number of nodes.

tldr;

To start your spark cluster on the cluster, simply run:

sbatch start_spark_cluster.sh

You only need to pick the number of nodes you need.

Preparation

  1. Clone this repo
git clone https://gitlab.jsc.fz-juelich.de/AI_Recipe_Book/recipes/spark-examples.git
  1. Prepare the virtual environment to install required Pyhton dependencies.
  • cd spark_env
  • Edit requirements.txt
  • Create the virtual environment by calling ./setup.sh
  • Create a kernel for Jupyter-JSC by calling ./create_kernel.sh
  • To recreate the virtual environment, simple delete the folder ./venv
  1. Pick the Number of nodes by adjusting the line #SBATCH --nodes=2 in start_spark_cluster.sh.

Execution

To start your spark cluster on the cluster, simply run:

sbatch start_spark_cluster.sh

This will return information similar to

Submitted batch job 6525353

In order to connect, you need to find out the hostname of you compute job.

[kesselheim1@jwlogin23 spark-examples]$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           6525353 develboos spark-cl kesselhe  R      30:40      2 jwb[0129,0149]

In this case, the spark cluster runs on the nodes jwb0129 and jwb0149. Note that to access the nodes from everywhere, you must add a letter i. Then a valid hostname is jwb0129i.juwels. According adjustments are made automatically in the scripts. The spark master always runs in the first node.

Then you can run a Spark App with a command similar to

source ./spark_env/activate.sh
export MASTER_URL=spark://jwb0129i.juwels:4124
python pyspark_pi.py

Note the i that you must replace the hostname including the i.

Monitoring

To connect to the master and workers with a browser, you need a command of the following form:

ssh -L 18080:localhost:18080 -L 8080:localhost:8080 kesselheim1@jwb0129i.juwels -J kesselheim1@juwels-booster.fz-juelich.de

Then you can navigate to (http://localhost:8080) to the the output.

Open Questions

  • In the Scala Example, is uses all worker instances as expected. The Python Example uses only 2. Why?

ToDos:

  • Include a Python Virtual Environment
  • Create a Notebook that illustrates how to run the Pi example in Juypter
  • The history server does not work yet. It crashed with this error message:
Exception in thread "main" java.io.FileNotFoundException: Log directory specified does not exist: file:/tmp/spark-events Did you configure the correct one through spark.history.fs.logDirectory? 

The logdir config is not configured in the right way.

References

  • Pi Estimate (Python + Scala):
  • Simple Slurm Example (not completely compatible):