Skip to content
Snippets Groups Projects
Select Git revision
  • 4b3f3b0a62de9d97a1de4438b2abad61f01e69e1
  • main default protected
2 results

spark-examples

Spark-Examples

Interactive Spark Cluster

Script start_spark_cluster.sh. Spin up a Spark cluster with the specified number of nodes. To start, simply execute

sbatch start_spark_cluster.sh

This will return information similar to

Submitted batch job 6525353

In order to connect, you need to find out the hostname of you compute job.

[kesselheim1@jwlogin23 spark-examples]$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           6525353 develboos spark-cl kesselhe  R      30:40      2 jwb[0129,0149]

Then you can run a Spark App with a command similar to

module load Stages/2023  GCC  OpenMPI Spark
export MASTER_URL=spark://jwb0129i.juwels:4124
python pyspark_pi.py

Note the i that that has been added to the master hostname.

To connect to the master and workers with a browser, you need a command of the following form:

ssh -L 18080:localhost:18080 -L 8080:localhost:8080 kesselheim1@jwb0085i.juwels -J kesselheim1@juwels-booster.fz-juelich.de

Then you can navigate to (http://localhost:8080) to the the output.

Open Questions

  • In the Scala Example, is uses all worker instances as expected. The Python Example uses only 2. Why?

ToDos:

  • Include a Python Virtual Environment
  • Create a Notebook that illustrates how to run the Pi example in Juypter
  • The history server does not work yet. It crashed with this error message:
Exception in thread "main" java.io.FileNotFoundException: Log directory specified does not exist: file:/tmp/spark-events Did you configure the correct one through spark.history.fs.logDirectory? 

The logdir config is not configured in the right way.

References

  • Pi Estimate (Python + Scala):
  • Simple Slurm Example (not completely compatible):