Implement Stochastic Gradient descent for parallelized training of convLSTM
In order to tackle the parallelization issue, try put stochastic gradient descent as optimizer in a distributed framework for training.
Useful link for SGD with Horovod: https://github.com/horovod/horovod/blob/master/examples/ray/pytorch_ray_elastic.py