Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
P
PyTorch at JSC
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Container registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
GitLab community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Simulation and Data Lab Applied Machine Learning
PyTorch at JSC
Commits
6055c2d3
Commit
6055c2d3
authored
11 months ago
by
Jan Ebert
Browse files
Options
Downloads
Patches
Plain Diff
Document more potential problems
parent
bc196472
No related branches found
No related tags found
No related merge requests found
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
README.md
+36
-1
36 additions, 1 deletion
README.md
with
36 additions
and
1 deletion
README.md
+
36
−
1
View file @
6055c2d3
...
...
@@ -191,7 +191,12 @@ You can safely ignore warnings like the following:
We have not noticed performance degradations or errors once PyTorch
started to emit these warnings.
### Cache directories
### File system problems
Due to file system limits on the number of inodes ("number of files")
and the amount of memory available to us, we can run into issues.
#### Cache directories
PyTorch and the libraries it uses like to save compiled code,
downloaded models, or downloaded datasets to cache directories. By
...
...
@@ -213,8 +218,24 @@ default values.
-
Triton (PyTorch dependency):
`TRITON_CACHE_DIR="$HOME"/.triton/cache`
-
HuggingFace:
`HF_HOME="$HOME"/.cache/huggingface`
#### `venv` directories
The
`venv`
s we create can contain very many small files, or very large
binary blobs of compiled code. Both of these can lead to us reaching
file system limits. To avoid these problems, set up your
`venv`
s in
SCRATCH. The example scripts here do not follow this practice out of
simplicity, but please consider it in your own projects. Be mindful
that files in SCRATCH are deleted after 90 days of not being touched,
so make sure that the environment is reproducible (e.g., by saving an
up-to-date
`modules.sh`
and
`requirements.txt`
in PROJECT).
### GPU kernel compilation
Sometimes, additional specifications are required to build GPU
kernels.
#### GPU architecture selection
Some libraries may require you to explicitly specify the compute
architecture of your GPU for them to successfully build.
...
...
@@ -233,6 +254,20 @@ NVIDIA H100 (compute capability 9.0) GPUs, we would set:
export
TORCH_CUDA_ARCH_LIST
=
7.0
;
8.0
;
9.0
```
#### Kernels not being compiled
You may also find that some Python packages do not build GPU kernels
by default even if
`TORCH_CUDA_ARCH_LIST`
is specified. This can
happen if kernels are only built when a GPU is actually found on the
system setting up the environment. Since we are building the
environment on a login node, we won't have a GPU available. But
actually, we are still able to compile kernels as the kernel compiler
_is_
available, and that is all we require. Usually, libraries offer
an escape hatch via environment variables so you can still force GPU
kernel compilation manually. If they are not documented, you can try
to look for such escape hatches in the package's
`setup.py`
. Maybe an
AI chatbot can be helpful in finding these.
### PyTorch Lightning
If you are using PyTorch Lightning, you should launch jobs
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment