Skip to content
Snippets Groups Projects
Commit 6055c2d3 authored by Jan Ebert's avatar Jan Ebert
Browse files

Document more potential problems

parent bc196472
No related branches found
No related tags found
No related merge requests found
......@@ -191,7 +191,12 @@ You can safely ignore warnings like the following:
We have not noticed performance degradations or errors once PyTorch
started to emit these warnings.
### Cache directories
### File system problems
Due to file system limits on the number of inodes ("number of files")
and the amount of memory available to us, we can run into issues.
#### Cache directories
PyTorch and the libraries it uses like to save compiled code,
downloaded models, or downloaded datasets to cache directories. By
......@@ -213,8 +218,24 @@ default values.
- Triton (PyTorch dependency): `TRITON_CACHE_DIR="$HOME"/.triton/cache`
- HuggingFace: `HF_HOME="$HOME"/.cache/huggingface`
#### `venv` directories
The `venv`s we create can contain very many small files, or very large
binary blobs of compiled code. Both of these can lead to us reaching
file system limits. To avoid these problems, set up your `venv`s in
SCRATCH. The example scripts here do not follow this practice out of
simplicity, but please consider it in your own projects. Be mindful
that files in SCRATCH are deleted after 90 days of not being touched,
so make sure that the environment is reproducible (e.g., by saving an
up-to-date `modules.sh` and `requirements.txt` in PROJECT).
### GPU kernel compilation
Sometimes, additional specifications are required to build GPU
kernels.
#### GPU architecture selection
Some libraries may require you to explicitly specify the compute
architecture of your GPU for them to successfully build.
......@@ -233,6 +254,20 @@ NVIDIA H100 (compute capability 9.0) GPUs, we would set:
export TORCH_CUDA_ARCH_LIST=7.0;8.0;9.0
```
#### Kernels not being compiled
You may also find that some Python packages do not build GPU kernels
by default even if `TORCH_CUDA_ARCH_LIST` is specified. This can
happen if kernels are only built when a GPU is actually found on the
system setting up the environment. Since we are building the
environment on a login node, we won't have a GPU available. But
actually, we are still able to compile kernels as the kernel compiler
_is_ available, and that is all we require. Usually, libraries offer
an escape hatch via environment variables so you can still force GPU
kernel compilation manually. If they are not documented, you can try
to look for such escape hatches in the package's `setup.py`. Maybe an
AI chatbot can be helpful in finding these.
### PyTorch Lightning
If you are using PyTorch Lightning, you should launch jobs
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment