Enable running of training with Nvidia's TF1.15-container
Since the latest maintenance on Juwels (October 2021), the former CUDA versions on older Stages do not work with the Nvidia driver installed on the system. Thus, TF1 from previous Stages cannot be run anymore on any HPC-system which also hinders us to exploit the GPUs (e.g. Yan was still able to submit jobs, but training became very slow since the CPUs were selected as devices).
While the Nvidia singularity containers as used for parallelizing training (on Juwels Booster) then become a mandatory workaround, the previously deployed container (version 2021-02) does not work anymore since the update of OFED on the HPC-systems. This causes (for instance) problems with UCX since the UCX-version of the host for intra-node communication does not fit to the UCX-version used in the container. To re-enable the full functionality, a newer singularity container most be deployed whose UCS-version matches the host version (version 1.11). The container version 2021-09 is used for this purpose.
The e-mails from the support-team regarding the two mentioned obstacles are added in the comments below.