diff --git a/README.md b/README.md index 74a9ccf6f7b1b5110989fce4a309badc32f0a6a7..a11c5ede7204afdf0b0d8ffa66b46a5d8f6bc45e 100644 --- a/README.md +++ b/README.md @@ -467,9 +467,8 @@ For initialization, FSDP first defines a hierarchy of distinct, but possibly nested, submodules ("units") for the model. This process is also called "wrapping" in FSDP terminology and can be controlled using the `auto_wrap_policy` argument to `FullyShardedDataParallel`. The -parameters in each unit are then split and distributed ("sharded", or -scattered) to all GPUs. In the end, each GPU contains its own, -distinct model shard. +parameters in each unit are then split and distributed ("sharded") to +all GPUs. In the end, each GPU contains its own, distinct model shard. Whenever we do a forward pass with the model, we sequentially pass through units in the following way: FSDP automatically collects the @@ -511,7 +510,7 @@ shard. This also means that we have to execute saving and loading on every process, since the data is fully distinct. The example also contains an unused `save_model_singular` function -that gathers the full model on the CPU and then saves it in a single +that collects the full model on the CPU and then saves it in a single checkpoint file which can then be loaded in a single process. Keep in mind that this way of checkpointing is slower and limited by CPU memory.