diff --git a/README.md b/README.md index 2073c67fbe1ca82e12e035881be7df7a8689aba6..e30c74cd81bc2af9e5b87929e15df6af8e834263 100644 --- a/README.md +++ b/README.md @@ -191,7 +191,12 @@ You can safely ignore warnings like the following: We have not noticed performance degradations or errors once PyTorch started to emit these warnings. -### Cache directories +### File system problems + +Due to file system limits on the number of inodes ("number of files") +and the amount of memory available to us, we can run into issues. + +#### Cache directories PyTorch and the libraries it uses like to save compiled code, downloaded models, or downloaded datasets to cache directories. By @@ -213,8 +218,24 @@ default values. - Triton (PyTorch dependency): `TRITON_CACHE_DIR="$HOME"/.triton/cache` - HuggingFace: `HF_HOME="$HOME"/.cache/huggingface` +#### `venv` directories + +The `venv`s we create can contain very many small files, or very large +binary blobs of compiled code. Both of these can lead to us reaching +file system limits. To avoid these problems, set up your `venv`s in +SCRATCH. The example scripts here do not follow this practice out of +simplicity, but please consider it in your own projects. Be mindful +that files in SCRATCH are deleted after 90 days of not being touched, +so make sure that the environment is reproducible (e.g., by saving an +up-to-date `modules.sh` and `requirements.txt` in PROJECT). + ### GPU kernel compilation +Sometimes, additional specifications are required to build GPU +kernels. + +#### GPU architecture selection + Some libraries may require you to explicitly specify the compute architecture of your GPU for them to successfully build. @@ -233,6 +254,20 @@ NVIDIA H100 (compute capability 9.0) GPUs, we would set: export TORCH_CUDA_ARCH_LIST=7.0;8.0;9.0 ``` +#### Kernels not being compiled + +You may also find that some Python packages do not build GPU kernels +by default even if `TORCH_CUDA_ARCH_LIST` is specified. This can +happen if kernels are only built when a GPU is actually found on the +system setting up the environment. Since we are building the +environment on a login node, we won't have a GPU available. But +actually, we are still able to compile kernels as the kernel compiler +_is_ available, and that is all we require. Usually, libraries offer +an escape hatch via environment variables so you can still force GPU +kernel compilation manually. If they are not documented, you can try +to look for such escape hatches in the package's `setup.py`. Maybe an +AI chatbot can be helpful in finding these. + ### PyTorch Lightning If you are using PyTorch Lightning, you should launch jobs