Document more potential problems

6055c2d3 · Jan Ebert · bc196472 · 6055c2d3
Commit 6055c2d3 authored 1 year ago by Jan Ebert
--- a/README.md
+++ b/README.md
@@ -191,7 +191,12 @@ You can safely ignore warnings like the following:
 We have not noticed performance degradations or errors once PyTorch
 started to emit these warnings.

-### Cache directories
+### File system problems
+
+Due to file system limits on the number of inodes ("number of files")
+and the amount of memory available to us, we can run into issues.
+
+#### Cache directories

 PyTorch and the libraries it uses like to save compiled code,
 downloaded models, or downloaded datasets to cache directories. By
@@ -213,8 +218,24 @@ default values.
 - Triton (PyTorch dependency): `TRITON_CACHE_DIR="$HOME"/.triton/cache`
 - HuggingFace: `HF_HOME="$HOME"/.cache/huggingface`

+#### `venv` directories
+
+The `venv`s we create can contain very many small files, or very large
+binary blobs of compiled code. Both of these can lead to us reaching
+file system limits. To avoid these problems, set up your `venv`s in
+SCRATCH. The example scripts here do not follow this practice out of
+simplicity, but please consider it in your own projects. Be mindful
+that files in SCRATCH are deleted after 90 days of not being touched,
+so make sure that the environment is reproducible (e.g., by saving an
+up-to-date `modules.sh` and `requirements.txt` in PROJECT).
+
 ### GPU kernel compilation

+Sometimes, additional specifications are required to build GPU
+kernels.
+
+#### GPU architecture selection
+
 Some libraries may require you to explicitly specify the compute
 architecture of your GPU for them to successfully build.

@@ -233,6 +254,20 @@ NVIDIA H100 (compute capability 9.0) GPUs, we would set:
 export TORCH_CUDA_ARCH_LIST=7.0;8.0;9.0
 ```

+#### Kernels not being compiled
+
+You may also find that some Python packages do not build GPU kernels
+by default even if `TORCH_CUDA_ARCH_LIST` is specified. This can
+happen if kernels are only built when a GPU is actually found on the
+system setting up the environment. Since we are building the
+environment on a login node, we won't have a GPU available. But
+actually, we are still able to compile kernels as the kernel compiler
+_is_ available, and that is all we require. Usually, libraries offer
+an escape hatch via environment variables so you can still force GPU
+kernel compilation manually. If they are not documented, you can try
+to look for such escape hatches in the package's `setup.py`. Maybe an
+AI chatbot can be helpful in finding these.
+
 ### PyTorch Lightning

 If you are using PyTorch Lightning, you should launch jobs