Spark-Examples
Interactive Spark Cluster
Script start_spark_cluster.sh
. Spin up a Spark cluster with the specified number of nodes.
To start, simply execute
sbatch start_spark_cluster.sh
This will return information similar to
Submitted batch job 6525353
In order to connect, you need to find out the hostname of you compute job.
[kesselheim1@jwlogin23 spark-examples]$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6525353 develboos spark-cl kesselhe R 30:40 2 jwb[0129,0149]
Then you can run a Spark App with a command similar to
module load Stages/2023 GCC OpenMPI Spark
export MASTER_URL=spark://jwb0129i.juwels:4124
python pyspark_pi.py
Note the i
that that has been added to the master hostname.
To connect to the master and workers with a browser, you need a command of the following form:
ssh -L 18080:localhost:18080 -L 8080:localhost:8080 kesselheim1@jwb0085i.juwels -J kesselheim1@juwels-booster.fz-juelich.de
Then you can navigate to (http://localhost:8080) to the the output.
Open Questions
- In the Scala Example, is uses all worker instances as expected. The Python Example uses only 2. Why?
ToDos:
- Include a Python Virtual Environment
- Create a Notebook that illustrates how to run the Pi example in Juypter
- The history server does not work yet. It crashed with this error message:
Exception in thread "main" java.io.FileNotFoundException: Log directory specified does not exist: file:/tmp/spark-events Did you configure the correct one through spark.history.fs.logDirectory?
The logdir config is not configured in the right way.