From a64941f6bdca4e37884f74033e43502798640b36 Mon Sep 17 00:00:00 2001 From: janEbert <janpublicebert@posteo.net> Date: Tue, 9 Jul 2024 17:26:04 +0200 Subject: [PATCH] Avoid MPI terminology --- README.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 74a9ccf..a11c5ed 100644 --- a/README.md +++ b/README.md @@ -467,9 +467,8 @@ For initialization, FSDP first defines a hierarchy of distinct, but possibly nested, submodules ("units") for the model. This process is also called "wrapping" in FSDP terminology and can be controlled using the `auto_wrap_policy` argument to `FullyShardedDataParallel`. The -parameters in each unit are then split and distributed ("sharded", or -scattered) to all GPUs. In the end, each GPU contains its own, -distinct model shard. +parameters in each unit are then split and distributed ("sharded") to +all GPUs. In the end, each GPU contains its own, distinct model shard. Whenever we do a forward pass with the model, we sequentially pass through units in the following way: FSDP automatically collects the @@ -511,7 +510,7 @@ shard. This also means that we have to execute saving and loading on every process, since the data is fully distinct. The example also contains an unused `save_model_singular` function -that gathers the full model on the CPU and then saves it in a single +that collects the full model on the CPU and then saves it in a single checkpoint file which can then be loaded in a single process. Keep in mind that this way of checkpointing is slower and limited by CPU memory. -- GitLab