Skip to content
Snippets Groups Projects
Commit 5585c3a3 authored by Ilya Zhukov's avatar Ilya Zhukov
Browse files

Address Ess's comments in the running-jobs section.

parent 40c6a555
Branches
No related tags found
No related merge requests found
Pipeline #188465 passed
...@@ -36,10 +36,13 @@ The `srun` command is used to execute commands on a set of allocated resources. ...@@ -36,10 +36,13 @@ The `srun` command is used to execute commands on a set of allocated resources.
If no resources are currently allocated, `srun` can infer from its command line arguments what resources are needed, request them from the resource manager and defer the execution of the associated commands until the resources are available. If no resources are currently allocated, `srun` can infer from its command line arguments what resources are needed, request them from the resource manager and defer the execution of the associated commands until the resources are available.
After the associated commands have been run, the resources are relinquished and running further commands will have to ask for resources again. After the associated commands have been run, the resources are relinquished and running further commands will have to ask for resources again.
This one-shot mode can be useful when you want to interactively run a few quick jobs with varying sets of resources allocated for them. This one-shot mode can be useful when you want to interactively run a few quick jobs with varying sets of resources allocated for them.
Run the `hostname` command to see how `srun` will run commands on different nodes than the log in nodes. Run the `hostname` command to see how `srun` will run commands on different nodes than the log in nodes. The `hostname` command lets you see or change the name of your computer (e.g. name of the login or compute node), which is useful for recognising it on a network or setting it up for different tasks. On JURECA and JUSUF, use this command):
On JURECA and JUSUF, use this command (Important: do not forget to replace `YYYYMMDD`, where `YYYY` and `MM` and `DD` are the current year and month and day in the Gregorian calendar, e.g. `20240522`):
[ER: should there be an explanation of what the hostname command is? I know people have forgotten to remove it before, and we're aiming this at people who don't know how to use a terminal, sometimes...] :::warning
Do not forget to replace `YYYYMMDD`, where `YYYY` and `MM` and `DD` are the current year and month and day in the Gregorian calendar, e.g. `20240522`.
:::
``` ```
$ hostname $ hostname
...@@ -179,8 +182,12 @@ Of particular interest are the development partitions on each system (look for ` ...@@ -179,8 +182,12 @@ Of particular interest are the development partitions on each system (look for `
These consist of a small number of nodes which are set aside to prioritise small and short jobs which are typically run as part of development work on your application rather than production use of the system. These consist of a small number of nodes which are set aside to prioritise small and short jobs which are typically run as part of development work on your application rather than production use of the system.
Try running the previous two examples using `hostname` on the development partition of your system by specifying it through `srun`'s `-p` option. Try running the previous two examples using `hostname` on the development partition of your system by specifying it through `srun`'s `-p` option.
:::warning
Remove the `--reservation` option, because the reservation does not include nodes from the development partition. Remove the `--reservation` option, because the reservation does not include nodes from the development partition.
[ER:should this specifially note to remove the hands-on-cluster etc. bit? I think so, if they might not fully understand ]
:::
We will have a look at other partitions later. We will have a look at other partitions later.
...@@ -233,11 +240,13 @@ jwc00n014.juwels ...@@ -233,11 +240,13 @@ jwc00n014.juwels
$ exit $ exit
``` ```
:::warning
When using `srun` in one-shot mode, your account is charged for the time it takes to run the associated command. When using `srun` in one-shot mode, your account is charged for the time it takes to run the associated command.
With `salloc` your account is charged for the duration of time you spend in the shell launched by `salloc` (and commands launched by that shell). With `salloc` your account is charged for the duration of time you spend in the shell launched by `salloc` (and commands launched by that shell).
Once you are done with the allocated resources, do not forget to exit from the shell: Once you are done with the allocated resources, do not forget to exit from the shell:
[ER: I think this should have a warning or caution, that this is likely to spend your compute time just sitting around and we don't generally recommend it, but it can be useful for cerstain things.] :::
``` ```
$ exit $ exit
...@@ -294,8 +303,7 @@ $ sbatch testjob.sh ...@@ -294,8 +303,7 @@ $ sbatch testjob.sh
Submitted batch job 3476793 Submitted batch job 3476793
``` ```
After the first line (the shebang line) the script contains specially formatted comments that act like arguments to `sbatch`. After the first line (the [shebang](https://de.wikipedia.org/wiki/Shebang) line) the script contains specially formatted comments that act like arguments to `sbatch`.
[ER: will users know what a shebang is?]
These arguments are written in their long form. These arguments are written in their long form.
Previously, you used the short form (e.g. `-N` is the same as `--nodes`). Previously, you used the short form (e.g. `-N` is the same as `--nodes`).
After the block of comments come regular shell commands. After the block of comments come regular shell commands.
...@@ -310,7 +318,15 @@ $ squeue -u $USER ...@@ -310,7 +318,15 @@ $ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3476793 batch testjob. steinbus PD 0:00 2 (Priority) 3476793 batch testjob. steinbus PD 0:00 2 (Priority)
``` ```
[ER: Worth mentioning watch squeue?]
:::tip
You can use the `watch squeue` command, which continuously updates and displays information about the status of jobs. It refreshes the output at regular intervals, allowing you to monitor changes in real time.
To exit `watch squeue` you can press `Ctrl + C`. This command interrupts the execution of `watch` and you return to the regular command prompt.
:::
You might have to wait for a while, but eventually your job will be run. You might have to wait for a while, but eventually your job will be run.
While your job is pending in the queue or already running you can execute another command to retrieve further information about your job: While your job is pending in the queue or already running you can execute another command to retrieve further information about your job:
``` ```
...@@ -475,8 +491,12 @@ hello from process 1 of 4, using 24 threads ...@@ -475,8 +491,12 @@ hello from process 1 of 4, using 24 threads
Once more, Slurm fills the node with four processes having appropriate affinity masks. Once more, Slurm fills the node with four processes having appropriate affinity masks.
The OpenMP run time figures out that each process is allowed to use 24 CPU cores and creates a team of threads to fill those CPU cores. The OpenMP run time figures out that each process is allowed to use 24 CPU cores and creates a team of threads to fill those CPU cores.
**IMPORTANT**: Do not forget to exit your salloc session at this point. :::warning
[ER: should this be a warning or caution?]
Do not forget to exit your salloc session at this point.
:::
### JSC Affinity Tools ### JSC Affinity Tools
Since we are using psslurm we have implemented a few options different than the default in Slurm. Since we are using psslurm we have implemented a few options different than the default in Slurm.
...@@ -484,8 +504,14 @@ For this reason we are offering two tools that can help you to understand the pr ...@@ -484,8 +504,14 @@ For this reason we are offering two tools that can help you to understand the pr
1. The command line executable: `psslurmgetbind` 1. The command line executable: `psslurmgetbind`
2. An online [pinning tool](https://apps.fz-juelich.de/jsc/llview/pinning/) 2. An online [pinning tool](https://apps.fz-juelich.de/jsc/llview/pinning/)
[ER: should we mention this is all broken right now?]
Further information can be found in the "Processor Affinity" chapter of the corresponding [System Documentation][System Documentation]. :::warning
After the update to Slurm version 22.05, the pinning scheme has changed. The pinning tool is still available but it does not give accurate results at the moment.
:::
Further information can be found in the "Processor Affinity" chapter of the corresponding [System Documentation](./useful-links.md#system-documentation).
## Further reading ## Further reading
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment