Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
P
PyTorch at JSC
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Container registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Simulation and Data Lab Applied Machine Learning
PyTorch at JSC
Commits
a64941f6
Commit
a64941f6
authored
11 months ago
by
Jan Ebert
Browse files
Options
Downloads
Patches
Plain Diff
Avoid MPI terminology
parent
7274481b
Branches
Branches containing commit
No related tags found
No related merge requests found
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
README.md
+3
-4
3 additions, 4 deletions
README.md
with
3 additions
and
4 deletions
README.md
+
3
−
4
View file @
a64941f6
...
@@ -467,9 +467,8 @@ For initialization, FSDP first defines a hierarchy of distinct, but
...
@@ -467,9 +467,8 @@ For initialization, FSDP first defines a hierarchy of distinct, but
possibly nested, submodules ("units") for the model. This process is
possibly nested, submodules ("units") for the model. This process is
also called "wrapping" in FSDP terminology and can be controlled using
also called "wrapping" in FSDP terminology and can be controlled using
the
`auto_wrap_policy`
argument to
`FullyShardedDataParallel`
. The
the
`auto_wrap_policy`
argument to
`FullyShardedDataParallel`
. The
parameters in each unit are then split and distributed ("sharded", or
parameters in each unit are then split and distributed ("sharded") to
scattered) to all GPUs. In the end, each GPU contains its own,
all GPUs. In the end, each GPU contains its own, distinct model shard.
distinct model shard.
Whenever we do a forward pass with the model, we sequentially pass
Whenever we do a forward pass with the model, we sequentially pass
through units in the following way: FSDP automatically collects the
through units in the following way: FSDP automatically collects the
...
@@ -511,7 +510,7 @@ shard. This also means that we have to execute saving and loading on
...
@@ -511,7 +510,7 @@ shard. This also means that we have to execute saving and loading on
every process, since the data is fully distinct.
every process, since the data is fully distinct.
The example also contains an unused
`save_model_singular`
function
The example also contains an unused
`save_model_singular`
function
that
gather
s the full model on the CPU and then saves it in a single
that
collect
s the full model on the CPU and then saves it in a single
checkpoint file which can then be loaded in a single process. Keep in
checkpoint file which can then be loaded in a single process. Keep in
mind that this way of checkpointing is slower and limited by CPU
mind that this way of checkpointing is slower and limited by CPU
memory.
memory.
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment