diff --git a/README.md b/README.md index 9938dad4859ce2d3b44b22362946913dc71488cf..78c4874c8d6e133bccf5953cc894144a491deb9f 100644 --- a/README.md +++ b/README.md @@ -1,23 +1,26 @@ # Getting started with Deep Learning on Supercomputers This repository is intended to serve as a tutorial for anyone interested in utilizing the supercomputers -available at the JSC for deep learning based projects. It is assumed that the reader is proficient in one or -more of the following frameworks: +available at the Jülich Supercomputing Center (JSC) for deep learning based projects. It is assumed that +the reader is proficient in one or more of the following frameworks: * [Tensorflow](https://www.tensorflow.org/) * [Keras](https://keras.io/) -* [PyTorch](https://pytorch.org/) -* [Caffe](http://caffe.berkeleyvision.org/) * [Horovod](https://github.com/horovod/horovod) +* [Caffe](http://caffe.berkeleyvision.org/) (limited support) **Note:** This tutorial is by no means intended as an introduction to deep learning, or to any of the above mentioned frameworks. If you are interested in educational resources for beginners, please visit [this](https://gitlab.version.fz-juelich.de/MLDL_FZJ/MLDL_FZJ_Wiki/wikis/Education) page. -## Announcements +### Announcements 1. Tensorflow and Keras examples (with and without Horovod) are now fully functional on JUWELS as well. 2. Python 2 support has been removed from the tutorial for all frameworks except Caffe. +3. Even though PyTorch is available as as system-wide module on the JSC supercomputers, all PyTorch +examples have been removed from this tutorial. This is due to the fact that the tutorial +developers are not currently working with PyTorch, and are therefore not in a position to provide +support for PyTorch related issues. # Table of contents <!-- TOC --> @@ -93,20 +96,23 @@ Otherwise please join the `PADC` and `CPADC` projects. ### 4.1 JURECA and JUWELS -Following are the steps required to login (more information -[here](http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JURECA/UserInfo/QuickIntroduction.html?nn=1803700)). +Following are the steps required to login (more information: +[JURECA](https://apps.fz-juelich.de/jsc/hps/jureca/access.html#access), +[JUWELS](https://apps.fz-juelich.de/jsc/hps/juwels/access.html#access)). -1. Use SSH to login: +1. Use SSH to login. Use one of the following commands, depending on your target system: - `ssh <username>@jureca.fz-juelich.de` + `ssh <username>@jureca.fz-juelich.de` or `ssh <username>@juwels.fz-juelich.de` 2. Upon successful login, activate your project environment: `jutil env activate -p <name of compute project> -A <name of budget>` - **Note:** To view a list of all project and budget names available to you, please use the following command: `jutil user projects -o columns`. - Under the column titled "project", all names that start with the prefix "c" are compute projects, and + **Note:** To view a list of all project and budget names available to you, please use the following command: + `jutil user projects -o columns`. Each name under the column titled "project" has a corresponding type under the + column titled "project-type". All projects with "project-type" "C" are compute projects, and can be used in the `<name of compute project>` field for the command above. The `<name of budget>` field should then - contain the corresponding name under the "budgets" column. Please click [here](http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/NewUsageModel/NewUsageModel_node.html) + contain the corresponding name under the "budgets" column. Please click [here]( + http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/NewUsageModel/NewUsageModel_node.html) for more information. 3. Change to the project directory: @@ -207,7 +213,7 @@ configuration. `bsub < submit_job_juron_python3.sh` -Please note that unlike JURECA, JURON uses LSF for job submission, which is why a different +Please note that unlike JURECA and JUWELS, JURON uses LSF for job submission, which is why a different syntax is required for job configuration and submission. Moreover, email notifications are not supported on JURON. For more information on how to use LSF on JURON, use the following command: @@ -218,18 +224,20 @@ configuration. ## 7. Python 2 support -All the code samples are compatible with both Python 2 and Python 3. However, not all frameworks on all -machines are available for Python 2 (yet); in certain cases these are only available for Python 3. We have -included separate job submission scripts for Python 2 and Python 3. In cases where Python 2 is not -supported, only the job submission script for Python 3 is available. We will try our best to make -all frameworks available with Python 2 as well, but this will not be a priority as the official support -for Python 2 will be discontinued in the year 2020. +As the official support for Python 2 will be be discontinued in 2020, we decided to encourage our +users to make the switch to Python 3 already. This also enables us to provide better support for +Python 3 based modules, as we no longer have to spend time maintaining Python 2 modules. + +The only exception is Caffe, as on JURECA it is available with Python 2 only. Please note however that +other than on JURON, Caffe is only available in the JURECA Stage 2018b, i.e., one of the previous stages. +We do not intend to provide support for Caffe from Stage 2019a and onward. This is due to the fact that +Caffe is no longer being developed. ## 8. Distributed training [Horovod](https://github.com/horovod/horovod) provides a simple and efficient solution for training artificial neural networks on multiple GPUs across multiple nodes in a cluster. It can -be used with Tensorflow, Keras, and PyTorch (some other frameworks are supported as well, but +be used with Tensorflow and Keras (some other frameworks are supported as well, but not Caffe). In this repository, the `horovod` directory contains further sub-directories; one for each compatible framework that has been tested. E.g., there is a `keras` sub-directory that contains samples that utilize distributed training with Keras and Horovod (more information is available @@ -251,4 +259,5 @@ directory-local `README.md` for further information. * **Created by:** Fahad Khalid (SLNS/HPCNS, JSC) * **Installation of modules on JURON:** Andreas Herten (HPCNS, JSC) * **Installation of modules on JURECA:** Damian Alvarez (JSC), Rajalekshmi Deepu (SLNS/HPCNS, JSC) -* **Initial review/suggestions/testing:** Kai Krajsek (SLNS/HPCNS, JSC), Tabea Kirchner (SLNS/HPCNS, JSC) +* **Review/suggestions/testing:** Kai Krajsek (SLNS/HPCNS, JSC), Tabea Kirchner (SLNS/HPCNS, JSC), +Susanne Wenzel (INM-1) diff --git a/caffe/README.md b/caffe/README.md index dffc7afd8fbea6eb57ed880e680e10e6fa99e591..941c3d6d9b813f26d6b9ba7098c912d57b478d20 100644 --- a/caffe/README.md +++ b/caffe/README.md @@ -38,6 +38,6 @@ results in the generation of a learning curve plot in the current directory. Working with custom C++ layers requires recompiling Caffe with the custom code. As this is not possible with a system-wide installation, we have decided not to include an example of this use case. Nevertheless, if you must work with custom -C++ layers and require assistance, please send an email to the mailing list +C++ layers and require assistance, please send an email to the JULAIN mailing list (more information [here](https://lists.fz-juelich.de/mailman/listinfo/ml)). diff --git a/horovod/README.md b/horovod/README.md index a06e588efdf93526944189d3d9b7b9798f3c66c7..3d63a23deb70123b799da301b71b89cdafc7d649 100644 --- a/horovod/README.md +++ b/horovod/README.md @@ -2,7 +2,7 @@ All source code samples were taken from the Horovod examples repository [here](https://github.com/uber/horovod/tree/master/examples) -(last checked: February 19, 2019). The samples that work with MNIST data have been +(last checked: September 02, 2019). The samples that work with MNIST data have been slightly modified. Our changes are limited to, * The data loading mechanism @@ -22,23 +22,6 @@ for distributed training. 2. `mnist_advanced.py`: This sample is primarily the same as `mnist.py`. However, a few more advanced Horovod features are used. -## PyTorch samples - -**Note:** PyTorch samples currently DO NOT work on JURECA and JUWELS. These -do however work on JURON. - -The following PyTorch samples are included: - -1. `mnist.py`: Demonstrates distributed training using Horovod with PyTorch. A -simple convolutional neural network is trained on the MNIST dataset. -2. `synthetic_benchmark.py`: A benchmark that can be used to measure performance -of PyTorch with Horovod without using any external dataset. - -**Note:** The job scripts for JURECA and JUWELS are prefixed with `.` for these samples, so that -these scripts do not appear in the directory listing. The reason for doing this is -that our testing revealed issues with multi-node training. As soon as the issue has -been resolved, we'll make the scripts available. - ## Tensorflow samples The following Tensorflow samples are included: diff --git a/horovod/keras/mnist.py b/horovod/keras/mnist.py index 85dd94467ebaf1dcc44516fb13e66c596f4a4f9f..e31aa8a009e2fee923023dc18d5e5979c6d70203 100644 --- a/horovod/keras/mnist.py +++ b/horovod/keras/mnist.py @@ -104,7 +104,7 @@ model.fit(x_train, y_train, batch_size=batch_size, callbacks=callbacks, epochs=epochs, - verbose=1, + verbose=1 if hvd.rank() == 0 else 0, validation_data=(x_test, y_test)) score = model.evaluate(x_test, y_test, verbose=0) print('Test loss:', score[0]) diff --git a/horovod/pytorch/.submit_job_jureca.sh b/horovod/pytorch/.submit_job_jureca.sh deleted file mode 100755 index 1afd8012e9c0ebb3078c0972bb05ede9caadbf2d..0000000000000000000000000000000000000000 --- a/horovod/pytorch/.submit_job_jureca.sh +++ /dev/null @@ -1,22 +0,0 @@ -#!/usr/bin/env bash - -# Slurm job configuration -#SBATCH --nodes=2 -#SBATCH --ntasks=4 -#SBATCH --ntasks-per-node=2 -#SBATCH --output=output_%j.out -#SBATCH --error=error_%j.er -#SBATCH --time=00:10:00 -#SBATCH --job-name=HOROVOD_PYTORCH_MNIST -#SBATCH --gres=gpu:2 --partition=develgpus -#SBATCH --mail-type=ALL - -# Load the required modules -module load GCC/7.3.0 -module load MVAPICH2/2.3-GDR -module load PyTorch/1.0.0-GPU-Python-3.6.6 -module load torchvision/0.2.1-GPU-Python-3.6.6 -module load Horovod/0.15.2-GPU-Python-3.6.6 - -# Run the program -srun python -u mnist.py diff --git a/horovod/pytorch/.submit_job_juwels.sh b/horovod/pytorch/.submit_job_juwels.sh deleted file mode 100755 index 50700557fa9d4de04c861691d328a90a45580ee4..0000000000000000000000000000000000000000 --- a/horovod/pytorch/.submit_job_juwels.sh +++ /dev/null @@ -1,22 +0,0 @@ -#!/usr/bin/env bash - -# Slurm job configuration -#SBATCH --nodes=2 -#SBATCH --ntasks=8 -#SBATCH --ntasks-per-node=4 -#SBATCH --output=output_%j.out -#SBATCH --error=error_%j.er -#SBATCH --time=00:10:00 -#SBATCH --job-name=HOROVOD_PYTORCH_MNIST -#SBATCH --gres=gpu:4 --partition=develgpus -#SBATCH --mail-type=ALL - -# Load the required modules -module load GCC/8.3.0 -module load MVAPICH2/2.3.1-GDR -module load PyTorch/1.1.0-GPU-Python-3.6.8 -module load torchvision/0.3.0-GPU-Python-3.6.8 -module load Horovod/0.16.2-GPU-Python-3.6.8 - -# Run the program -srun python -u mnist.py diff --git a/horovod/pytorch/mnist.py b/horovod/pytorch/mnist.py deleted file mode 100644 index 4f431934cf4f221372c75fcf310c6086ad800d16..0000000000000000000000000000000000000000 --- a/horovod/pytorch/mnist.py +++ /dev/null @@ -1,200 +0,0 @@ -from __future__ import print_function -import os -import sys -import shutil -import argparse -import torch.nn as nn -import torch.nn.functional as F -import torch.optim as optim -from torchvision import datasets, transforms -import torch.utils.data.distributed -import horovod.torch as hvd - -# Training settings -parser = argparse.ArgumentParser(description='PyTorch MNIST Example') -parser.add_argument('--batch-size', type=int, default=64, metavar='N', - help='input batch size for training (default: 64)') -parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N', - help='input batch size for testing (default: 1000)') -parser.add_argument('--epochs', type=int, default=10, metavar='N', - help='number of epochs to train (default: 10)') -parser.add_argument('--lr', type=float, default=0.01, metavar='LR', - help='learning rate (default: 0.01)') -parser.add_argument('--momentum', type=float, default=0.5, metavar='M', - help='SGD momentum (default: 0.5)') -parser.add_argument('--no-cuda', action='store_true', default=False, - help='disables CUDA training') -parser.add_argument('--seed', type=int, default=42, metavar='S', - help='random seed (default: 42)') -parser.add_argument('--log-interval', type=int, default=10, metavar='N', - help='how many batches to wait before logging training status') -parser.add_argument('--fp16-allreduce', action='store_true', default=False, - help='use fp16 compression during allreduce') -args = parser.parse_args() -args.cuda = not args.no_cuda and torch.cuda.is_available() - -# [HPCNS] Import the DataValidator, which can then be used to -# validate and load the path to the already downloaded dataset. -sys.path.insert(0, '../../utils') -from data_utils import DataValidator - -# [HPCNS] Name of the dataset file -data_file = 'mnist/pytorch/data' - -# [HPCNS] Path to the directory containing the dataset file -data_dir = DataValidator.validated_data_dir(data_file) - -# Horovod: initialize library. -hvd.init() -torch.manual_seed(args.seed) - -if args.cuda: - # Horovod: pin GPU to local rank. - torch.cuda.set_device(hvd.local_rank()) - torch.cuda.manual_seed(args.seed) - -# Horovod: limit # of CPU threads to be used per worker. -torch.set_num_threads(1) - -# [HPCNS] Fully qualified dataset file name -dataset_file = os.path.join(data_dir, data_file) - -# [HPCNS] Dataset filename for this rank -dataset_root_for_rank = 'MNIST-data-{}'.format(hvd.rank()) -dataset_for_rank = dataset_root_for_rank + '/MNIST' - -# [HPCNS] If the path already exists, remove it -if os.path.exists(dataset_for_rank): - shutil.rmtree(dataset_for_rank) - -# [HPCNS] Make a copy of the dataset for this rank -shutil.copytree(dataset_file, dataset_for_rank) - -kwargs = {'num_workers': 1, 'pin_memory': True} if args.cuda else {} -train_dataset = \ - datasets.MNIST(dataset_root_for_rank, train=True, download=False, - transform=transforms.Compose([ - transforms.ToTensor(), - transforms.Normalize((0.1307,), (0.3081,)) - ])) -# Horovod: use DistributedSampler to partition the training data. -train_sampler = torch.utils.data.distributed.DistributedSampler( - train_dataset, num_replicas=hvd.size(), rank=hvd.rank()) -train_loader = torch.utils.data.DataLoader( - train_dataset, batch_size=args.batch_size, sampler=train_sampler, **kwargs) - -test_dataset = \ - datasets.MNIST(dataset_root_for_rank, train=False, download=False, transform=transforms.Compose([ - transforms.ToTensor(), - transforms.Normalize((0.1307,), (0.3081,)) - ])) -# Horovod: use DistributedSampler to partition the test data. -test_sampler = torch.utils.data.distributed.DistributedSampler( - test_dataset, num_replicas=hvd.size(), rank=hvd.rank()) -test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=args.test_batch_size, - sampler=test_sampler, **kwargs) - - -class Net(nn.Module): - def __init__(self): - super(Net, self).__init__() - self.conv1 = nn.Conv2d(1, 10, kernel_size=5) - self.conv2 = nn.Conv2d(10, 20, kernel_size=5) - self.conv2_drop = nn.Dropout2d() - self.fc1 = nn.Linear(320, 50) - self.fc2 = nn.Linear(50, 10) - - def forward(self, x): - x = F.relu(F.max_pool2d(self.conv1(x), 2)) - x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2)) - x = x.view(-1, 320) - x = F.relu(self.fc1(x)) - x = F.dropout(x, training=self.training) - x = self.fc2(x) - return F.log_softmax(x) - - -model = Net() - -if args.cuda: - # Move model to GPU. - model.cuda() - -# Horovod: scale learning rate by the number of GPUs. -optimizer = optim.SGD(model.parameters(), lr=args.lr * hvd.size(), - momentum=args.momentum) - -# Horovod: broadcast parameters & optimizer state. -hvd.broadcast_parameters(model.state_dict(), root_rank=0) -hvd.broadcast_optimizer_state(optimizer, root_rank=0) - -# Horovod: (optional) compression algorithm. -compression = hvd.Compression.fp16 if args.fp16_allreduce else hvd.Compression.none - -# Horovod: wrap optimizer with DistributedOptimizer. -optimizer = hvd.DistributedOptimizer(optimizer, - named_parameters=model.named_parameters(), - compression=compression) - - -def train(epoch): - model.train() - # Horovod: set epoch to sampler for shuffling. - train_sampler.set_epoch(epoch) - for batch_idx, (data, target) in enumerate(train_loader): - if args.cuda: - data, target = data.cuda(), target.cuda() - optimizer.zero_grad() - output = model(data) - loss = F.nll_loss(output, target) - loss.backward() - optimizer.step() - if batch_idx % args.log_interval == 0: - # Horovod: use train_sampler to determine the number of examples in - # this worker's partition. - print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format( - epoch, batch_idx * len(data), len(train_sampler), - 100. * batch_idx / len(train_loader), loss.item())) - - -def metric_average(val, name): - tensor = torch.tensor(val) - avg_tensor = hvd.allreduce(tensor, name=name) - return avg_tensor.item() - - -def test(): - model.eval() - test_loss = 0. - test_accuracy = 0. - for data, target in test_loader: - if args.cuda: - data, target = data.cuda(), target.cuda() - output = model(data) - # sum up batch loss - test_loss += F.nll_loss(output, target, size_average=False).item() - # get the index of the max log-probability - pred = output.data.max(1, keepdim=True)[1] - test_accuracy += pred.eq(target.data.view_as(pred)).cpu().float().sum() - - # Horovod: use test_sampler to determine the number of examples in - # this worker's partition. - test_loss /= len(test_sampler) - test_accuracy /= len(test_sampler) - - # Horovod: average metric values across workers. - test_loss = metric_average(test_loss, 'avg_loss') - test_accuracy = metric_average(test_accuracy, 'avg_accuracy') - - # Horovod: print output only on first rank. - if hvd.rank() == 0: - print('\nTest set: Average loss: {:.4f}, Accuracy: {:.2f}%\n'.format( - test_loss, 100. * test_accuracy)) - - -for epoch in range(1, args.epochs + 1): - train(epoch) - test() - -# [HPCNS] Remove the copied dataset -shutil.rmtree(dataset_root_for_rank) diff --git a/horovod/pytorch/run_on_localMachine.sh b/horovod/pytorch/run_on_localMachine.sh deleted file mode 100644 index 9c9afb4b58ee9f4a42480997dd298b6e33c71a35..0000000000000000000000000000000000000000 --- a/horovod/pytorch/run_on_localMachine.sh +++ /dev/null @@ -1,8 +0,0 @@ -#!/usr/bin/env bash - -# Run the program -mpirun -np 1 -H localhost:1 \ - -bind-to none -map-by slot \ - -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \ - -mca pml ob1 -mca btl ^openib \ - python -u mnist.py diff --git a/horovod/pytorch/submit_job_juron.sh b/horovod/pytorch/submit_job_juron.sh deleted file mode 100644 index 126c939b04c3f0cf8b3180e251b009c03ad69d0e..0000000000000000000000000000000000000000 --- a/horovod/pytorch/submit_job_juron.sh +++ /dev/null @@ -1,20 +0,0 @@ -#!/usr/bin/env bash - -#BSUB -q normal -#BSUB -W 10 -#BSUB -n 4 -#BSUB -R "span[ptile=2]" -#BSUB -gpu "num=2" -#BSUB -e "error.%J.er" -#BSUB -o "output_%J.out" -#BSUB -J PYTORCH_HOROVOD_MNIST - -# Load the required modules -module load python/3.6.1 -module load pytorch/1.0.1-gcc_5.4.0-cuda_10.0.130 -module load torchvision/0.2.1 -module load horovod/0.15.2 - -# Run the program -mpirun -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \ - -x PATH -mca pml ob1 -mca btl ^openib python -u mnist.py diff --git a/horovod/pytorch/synthetic_benchmark.py b/horovod/pytorch/synthetic_benchmark.py deleted file mode 100644 index e7a177f8b4e8583cb8169d308660fec8b7fc1664..0000000000000000000000000000000000000000 --- a/horovod/pytorch/synthetic_benchmark.py +++ /dev/null @@ -1,110 +0,0 @@ -from __future__ import print_function - -import argparse -import torch.backends.cudnn as cudnn -import torch.nn.functional as F -import torch.optim as optim -import torch.utils.data.distributed -from torchvision import models -import horovod.torch as hvd -import timeit -import numpy as np - -# Benchmark settings -parser = argparse.ArgumentParser(description='PyTorch Synthetic Benchmark', - formatter_class=argparse.ArgumentDefaultsHelpFormatter) -parser.add_argument('--fp16-allreduce', action='store_true', default=False, - help='use fp16 compression during allreduce') - -parser.add_argument('--model', type=str, default='resnet50', - help='model to benchmark') -parser.add_argument('--batch-size', type=int, default=32, - help='input batch size') - -parser.add_argument('--num-warmup-batches', type=int, default=10, - help='number of warm-up batches that don\'t count towards benchmark') -parser.add_argument('--num-batches-per-iter', type=int, default=10, - help='number of batches per benchmark iteration') -parser.add_argument('--num-iters', type=int, default=10, - help='number of benchmark iterations') - -parser.add_argument('--no-cuda', action='store_true', default=False, - help='disables CUDA training') - -args = parser.parse_args() -args.cuda = not args.no_cuda and torch.cuda.is_available() - -hvd.init() - -if args.cuda: - # Horovod: pin GPU to local rank. - torch.cuda.set_device(hvd.local_rank()) - -cudnn.benchmark = True - -# Set up standard model. -model = getattr(models, args.model)() - -if args.cuda: - # Move model to GPU. - model.cuda() - -optimizer = optim.SGD(model.parameters(), lr=0.01) - -# Horovod: (optional) compression algorithm. -compression = hvd.Compression.fp16 if args.fp16_allreduce else hvd.Compression.none - -# Horovod: wrap optimizer with DistributedOptimizer. -optimizer = hvd.DistributedOptimizer(optimizer, - named_parameters=model.named_parameters(), - compression=compression) - -# Horovod: broadcast parameters & optimizer state. -hvd.broadcast_parameters(model.state_dict(), root_rank=0) -hvd.broadcast_optimizer_state(optimizer, root_rank=0) - -# Set up fixed fake data -data = torch.randn(args.batch_size, 3, 224, 224) -target = torch.LongTensor(args.batch_size).random_() % 1000 -if args.cuda: - data, target = data.cuda(), target.cuda() - - -def benchmark_step(): - optimizer.zero_grad() - output = model(data) - loss = F.cross_entropy(output, target) - loss.backward() - optimizer.step() - - -def log(s, nl=True): - if hvd.rank() != 0: - return - print(s, end='\n' if nl else '') - - -log('Model: %s' % args.model) -log('Batch size: %d' % args.batch_size) -device = 'GPU' if args.cuda else 'CPU' -log('Number of %ss: %d' % (device, hvd.size())) - -# Warm-up -log('Running warmup...') -timeit.timeit(benchmark_step, number=args.num_warmup_batches) - -# Benchmark -log('Running benchmark...') -img_secs = [] -for x in range(args.num_iters): - time = timeit.timeit(benchmark_step, number=args.num_batches_per_iter) - img_sec = args.batch_size * args.num_batches_per_iter / time - log('Iter #%d: %.1f img/sec per %s' % (x, img_sec, device)) - img_secs.append(img_sec) - -# Results -img_sec_mean = np.mean(img_secs) -img_sec_conf = 1.96 * np.std(img_secs) -log('Img/sec per %s: %.1f +-%.1f' % (device, img_sec_mean, img_sec_conf)) -log('Total img/sec on %d %s(s): %.1f +-%.1f' % - (hvd.size(), device, hvd.size() * img_sec_mean, hvd.size() * img_sec_conf)) diff --git a/horovod/tensorflow/mnist.py b/horovod/tensorflow/mnist.py index da37944b01335cb3d78b20e5245d9518fae8779e..8099f1c22a3927c9b38adb7375a60f752b28acf2 100644 --- a/horovod/tensorflow/mnist.py +++ b/horovod/tensorflow/mnist.py @@ -118,7 +118,7 @@ def main(_): predict, loss = conv_model(image, label, tf.estimator.ModeKeys.TRAIN) # Horovod: adjust learning rate based on number of GPUs. - opt = tf.train.RMSPropOptimizer(0.001 * hvd.size()) + opt = tf.train.AdamOptimizer(0.001 * hvd.size()) # Horovod: add Horovod Distributed Optimizer. opt = hvd.DistributedOptimizer(opt) diff --git a/horovod/tensorflow/synthetic_benchmark.py b/horovod/tensorflow/synthetic_benchmark.py index abbdd20fdb933dbde47f7d92f644da2454dbd8e7..ee401a5cc8ca05def1a87a14c0d66608bab38b18 100644 --- a/horovod/tensorflow/synthetic_benchmark.py +++ b/horovod/tensorflow/synthetic_benchmark.py @@ -69,8 +69,8 @@ target = tf.random_uniform([args.batch_size, 1], minval=0, maxval=999, dtype=tf. def loss_function(): - logits = model(data, training=True) - return tf.losses.sparse_softmax_cross_entropy(target, logits) + probs = model(data, training=True) + return tf.losses.sparse_softmax_cross_entropy(target, probs) def log(s, nl=True): diff --git a/keras/README.md b/keras/README.md index 598f4e1f95aca48216c4d10b1e48c18ef7466363..4e8462ddc50a18a7219ef38e5aacca5283f02411 100644 --- a/keras/README.md +++ b/keras/README.md @@ -3,7 +3,7 @@ The `mnist.py` sample is a slightly modified version of `mnist_cnn.py` available in the Keras examples repository [here](https://github.com/keras-team/keras/tree/master/examples) -(last checked: February 19, 2019). Our changes are +(last checked: September 02, 2019). Our changes are limited to, * The data loading mechanism diff --git a/pytorch/.submit_job_juwels.sh b/pytorch/.submit_job_juwels.sh deleted file mode 100755 index 15f53ac1a55630cc5c628413738dacd4fab4429e..0000000000000000000000000000000000000000 --- a/pytorch/.submit_job_juwels.sh +++ /dev/null @@ -1,20 +0,0 @@ -#!/usr/bin/env bash - -# Slurm job configuration -#SBATCH --nodes=1 -#SBATCH --ntasks=1 -#SBATCH --ntasks-per-node=1 -#SBATCH --output=output_%j.out -#SBATCH --error=error_%j.er -#SBATCH --time=00:10:00 -#SBATCH --job-name=PYTORCH_MNIST -#SBATCH --gres=gpu:1 --partition=develgpus -#SBATCH --mail-type=ALL - -# Load the required modules -module load GCC/8.3.0 -module load PyTorch/1.1.0-GPU-Python-3.6.8 -module load torchvision/0.3.0-GPU-Python-3.6.8 - -# Run the program -srun python -u mnist.py diff --git a/pytorch/README.md b/pytorch/README.md deleted file mode 100644 index ac1ac2f2d168d7843d479c27a82d288faf10176a..0000000000000000000000000000000000000000 --- a/pytorch/README.md +++ /dev/null @@ -1,13 +0,0 @@ -# Notes - -The `mnist.py` sample is a slightly modified version of `main.py` -available in the PyTorch examples repository -[here](https://github.com/pytorch/examples/tree/master/mnist) -(last checked: February 19, 2019). Our changes are -limited to, - -* The data loading mechanism -* A bit of code cleanup -* A few additional comments pertaining to our custom data loading mechanism - -**Note:** All newly added statements follow a comment beginning with `[HPCNS]`. \ No newline at end of file diff --git a/pytorch/mnist.py b/pytorch/mnist.py deleted file mode 100644 index 19bcac053726b51c1cb8d1c393546f70d037d6fd..0000000000000000000000000000000000000000 --- a/pytorch/mnist.py +++ /dev/null @@ -1,151 +0,0 @@ -from __future__ import print_function - -import os -import sys -import shutil -import argparse -import torch -import torch.nn as nn -import torch.nn.functional as F -import torch.optim as optim -from torchvision import datasets, transforms - -# [HPCNS] Import the DataValidator, which can then be used to -# validate and load the path to the already downloaded dataset. -sys.path.insert(0, '../utils') -from data_utils import DataValidator - - -class Net(nn.Module): - def __init__(self): - super(Net, self).__init__() - self.conv1 = nn.Conv2d(1, 20, 5, 1) - self.conv2 = nn.Conv2d(20, 50, 5, 1) - self.fc1 = nn.Linear(4 * 4 * 50, 500) - self.fc2 = nn.Linear(500, 10) - - def forward(self, x): - x = F.relu(self.conv1(x)) - x = F.max_pool2d(x, 2, 2) - x = F.relu(self.conv2(x)) - x = F.max_pool2d(x, 2, 2) - x = x.view(-1, 4 * 4 * 50) - x = F.relu(self.fc1(x)) - x = self.fc2(x) - return F.log_softmax(x, dim=1) - - -def train(args, model, device, train_loader, optimizer, epoch): - model.train() - for batch_idx, (data, target) in enumerate(train_loader): - data, target = data.to(device), target.to(device) - optimizer.zero_grad() - output = model(data) - loss = F.nll_loss(output, target) - loss.backward() - optimizer.step() - if batch_idx % args.log_interval == 0: - print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format( - epoch, batch_idx * len(data), len(train_loader.dataset), - 100. * batch_idx / len(train_loader), loss.item())) - - -def test(args, model, device, test_loader): - model.eval() - test_loss = 0 - correct = 0 - with torch.no_grad(): - for data, target in test_loader: - data, target = data.to(device), target.to(device) - output = model(data) - test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss - pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability - correct += pred.eq(target.view_as(pred)).sum().item() - - test_loss /= len(test_loader.dataset) - - print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format( - test_loss, correct, len(test_loader.dataset), - 100. * correct / len(test_loader.dataset))) - - -def main(): - # Training settings - parser = argparse.ArgumentParser(description='PyTorch MNIST Example') - parser.add_argument('--batch-size', type=int, default=64, metavar='N', - help='input batch size for training (default: 64)') - parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N', - help='input batch size for testing (default: 1000)') - parser.add_argument('--epochs', type=int, default=10, metavar='N', - help='number of epochs to train (default: 10)') - parser.add_argument('--lr', type=float, default=0.01, metavar='LR', - help='learning rate (default: 0.01)') - parser.add_argument('--momentum', type=float, default=0.5, metavar='M', - help='SGD momentum (default: 0.5)') - parser.add_argument('--no-cuda', action='store_true', default=False, - help='disables CUDA training') - parser.add_argument('--seed', type=int, default=1, metavar='S', - help='random seed (default: 1)') - parser.add_argument('--log-interval', type=int, default=10, metavar='N', - help='how many batches to wait before logging training status') - - parser.add_argument('--save-model', action='store_true', default=False, - help='For Saving the current Model') - args = parser.parse_args() - use_cuda = not args.no_cuda and torch.cuda.is_available() - - torch.manual_seed(args.seed) - - device = torch.device("cuda" if use_cuda else "cpu") - - # [HPCNS] Name of the dataset file - data_file = 'mnist/pytorch/data' - - # [HPCNS] Path to the directory containing the dataset file - data_dir = DataValidator.validated_data_dir(data_file) - - # [HPCNS] Fully qualified dataset file name - dataset_file = os.path.join(data_dir, data_file) - - # [HPCNS] A copy of the dataset in the current directory - dataset_copy = 'MNIST' - - # [HPCNS] If the path already exists, remove it - if os.path.exists(dataset_copy): - shutil.rmtree(dataset_copy) - - # [HPCNS] Make a copy of the dataset, as the torch data loader used - # below expects the dataset in the current directory - shutil.copytree(dataset_file, dataset_copy) - - kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {} - train_loader = torch.utils.data.DataLoader( - datasets.MNIST('', train=True, download=False, - transform=transforms.Compose([ - transforms.ToTensor(), - transforms.Normalize((0.1307,), (0.3081,)) - ])), - batch_size=args.batch_size, shuffle=True, **kwargs) - test_loader = torch.utils.data.DataLoader( - datasets.MNIST('', train=False, download=False, transform=transforms.Compose([ - transforms.ToTensor(), - transforms.Normalize((0.1307,), (0.3081,)) - ])), - batch_size=args.test_batch_size, shuffle=True, **kwargs) - - model = Net().to(device) - optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum) - - for epoch in range(1, args.epochs + 1): - train(args, model, device, train_loader, optimizer, epoch) - test(args, model, device, test_loader) - - if (args.save_model): - torch.save(model.state_dict(), "mnist_cnn.pt") - - # [HPCNS] Remove the copied dataset - shutil.rmtree(dataset_copy) - - -if __name__ == '__main__': - main() diff --git a/pytorch/run_on_localMachine.sh b/pytorch/run_on_localMachine.sh deleted file mode 100644 index 9c5737c9fc9d6bca93e25fca9f785e52320131fc..0000000000000000000000000000000000000000 --- a/pytorch/run_on_localMachine.sh +++ /dev/null @@ -1,4 +0,0 @@ -#!/usr/bin/env bash - -# Run the program -python -u mnist.py \ No newline at end of file diff --git a/pytorch/submit_job_jureca.sh b/pytorch/submit_job_jureca.sh deleted file mode 100755 index 15f53ac1a55630cc5c628413738dacd4fab4429e..0000000000000000000000000000000000000000 --- a/pytorch/submit_job_jureca.sh +++ /dev/null @@ -1,20 +0,0 @@ -#!/usr/bin/env bash - -# Slurm job configuration -#SBATCH --nodes=1 -#SBATCH --ntasks=1 -#SBATCH --ntasks-per-node=1 -#SBATCH --output=output_%j.out -#SBATCH --error=error_%j.er -#SBATCH --time=00:10:00 -#SBATCH --job-name=PYTORCH_MNIST -#SBATCH --gres=gpu:1 --partition=develgpus -#SBATCH --mail-type=ALL - -# Load the required modules -module load GCC/8.3.0 -module load PyTorch/1.1.0-GPU-Python-3.6.8 -module load torchvision/0.3.0-GPU-Python-3.6.8 - -# Run the program -srun python -u mnist.py diff --git a/pytorch/submit_job_juron.sh b/pytorch/submit_job_juron.sh deleted file mode 100644 index 061139f19cf8f9cdc03e8d4ced3d1c15f66ae49c..0000000000000000000000000000000000000000 --- a/pytorch/submit_job_juron.sh +++ /dev/null @@ -1,18 +0,0 @@ -#!/usr/bin/env bash - -#BSUB -q normal -#BSUB -W 10 -#BSUB -n 1 -#BSUB -R "span[ptile=1]" -#BSUB -gpu "num=1" -#BSUB -e "error.%J.er" -#BSUB -o "output_%J.out" -#BSUB -J PYTORCH_MNIST - -# Load the required modules -module load python/3.6.1 -module load pytorch/1.0.1-gcc_5.4.0-cuda_10.0.130 -module load torchvision/0.2.1 - -# Run the program -python -u mnist.py \ No newline at end of file diff --git a/tensorflow/README.md b/tensorflow/README.md index cbf485424ae35ac8a1e8fcdd4650ffa8a08114df..3bf439c0cf8aaa5020252c1099d6b55b2e6ff07a 100644 --- a/tensorflow/README.md +++ b/tensorflow/README.md @@ -3,7 +3,7 @@ The `mnist.py` sample is a slightly modified version of `convolutional.py` available in the Tensorflow models repository [here](https://github.com/tensorflow/models/blob/master/tutorials/image/mnist) -(last checked: February 19, 2019). Our changes are +(last checked: September 02, 2019). Our changes are limited to, * The data loading mechanism