Skip to content
Snippets Groups Projects
Commit a6bd0826 authored by Fahad Khalid's avatar Fahad Khalid
Browse files

First commit from the already tested collection.

parent fa39acde
Branches
No related tags found
No related merge requests found
Showing
with 817 additions and 2 deletions
datasets/mnist/caffe/mnist_test_lmdb/data.mdb filter=lfs diff=lfs merge=lfs -text
datasets/mnist/caffe/mnist_test_lmdb/lock.mdb filter=lfs diff=lfs merge=lfs -text
datasets/mnist/caffe/mnist_train_lmdb/data.mdb filter=lfs diff=lfs merge=lfs -text
datasets/mnist/caffe/mnist_train_lmdb/lock.mdb filter=lfs diff=lfs merge=lfs -text
datasets/mnist/keras/mnist.npz filter=lfs diff=lfs merge=lfs -text
datasets/mnist/pytorch/data/processed/training.pt filter=lfs diff=lfs merge=lfs -text
datasets/mnist/pytorch/data/processed/test.pt filter=lfs diff=lfs merge=lfs -text
datasets/mnist/raw/t10k-images-idx3-ubyte.gz filter=lfs diff=lfs merge=lfs -text
datasets/mnist/raw/t10k-labels-idx1-ubyte.gz filter=lfs diff=lfs merge=lfs -text
datasets/mnist/raw/train-images-idx3-ubyte.gz filter=lfs diff=lfs merge=lfs -text
datasets/mnist/raw/train-labels-idx1-ubyte.gz filter=lfs diff=lfs merge=lfs -text
# Created by .ignore support plugin (hsz.mobi)
### Python template
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
# Translations
*.mo
*.pot
# Django stuff:
*.log
.static_storage/
.media/
local_settings.py
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
venv3/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
# PyCharm
.idea
keras.json
# Tensorflow/keras Checkpoints
mnist_convnet_model/
# ml_dl_on_supercomputers # Getting started with ML/DL on Supercomputers
Samples and documentation for the "Getting started with ML/DL on Supercomputers" tutorial. This repository is intended to serve as a tutorial for anyone interested in utilizing the supercomputers
\ No newline at end of file available at the JSC for ML/DL related projects. It is assumed that the reader is proficient in one or
more of the following frameworks:
* [Tensorflow](https://www.tensorflow.org/)
* [Keras](https://keras.io/)
* [PyTorch](https://pytorch.org/)
* [Caffe](http://caffe.berkeleyvision.org/)
* [Horovod](https://github.com/horovod/horovod)
**Note:** This tutorial is by no means intended as an introduction to ML/DL, or to any of the
above mentioned frameworks. If you are interested in educational resources for beginners, please
visit [this](https://gitlab.version.fz-juelich.de/MLDL_FZJ/MLDL_FZJ_Wiki/wikis/Education) page.
### A word regarding the code samples
Samples for each framework are available in the correspondingly named directory. Each such
directory typically contains at least one code sample, which trains a simple artificial neural
network on the canonical MNIST hand-written digit classification task. Moreover, job submission
scripts are included for all the supercomputers on which this tutorial has been tested. The job
scripts will hopefully make it easier to figure out which modules to load. Finally,
a `README.md` file contains further information about the contents of the directory.
**Disclaimer:** Neither are the samples intended to serve as examples of optimized code, nor do these
represent programming best practices.
### Changes made to support loading of pre-downloaded datasets
It is worth mentioning that all the code samples were taken from the corresponding framework's
official samples/tutorials repository, as practitioners are likely familiar with these (links
to the original code samples are included in the directory-local `README.md`). However, the
original examples are designed to automatically download the required dataset in a
framework-defined directory. This is not a feasible option as compute nodes on the supercomputers
do not have access to the Internet. Therefore, the samples have been slightly modified to load data from
the `datasets` directory included in this repository; specific code changes, at least for now,
have been marked by comments prefixed with the `[HPCNS]` tag. For more information see the `README.md`
available in the `datasets` directory.
## 1. Applying for user accounts on supercomputers
In case you do not already have an account on your supercomputer of interest, please take a look at the
instructions provided in the following sub-sections.
### 1.1 JURECA and JUWELS
For more information on getting accounts on JURECA and JUWELS, click
[here](http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/ComputingTime/computingTime_node.html).
### 1.2 JURON
To get a user account on JURON, please follow the steps below:
1. Write an email to [Dirk Pleiter](http://www.fz-juelich.de/SharedDocs/Personen/IAS/JSC/EN/staff/pleiter_d.html?nn=362224),
in which please introduce yourself and mention why you need the account.
2. Apply for the account via the [JuDoor](https://dspserv.zam.kfa-juelich.de/judoor/login) portal
(more information about JuDoor is available [here](http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/NewUsageModel/JuDoor.html?nn=945700)).
If your work is related to the Human Brain Project (HBP), please join the `PCP0` and `CPCP0` projects.
Otherwise please join the `PADC` and `CPADC` projects.
## 2. Logging on to the supercomputers
Assuming JURECA is the target supercomputer, following are the steps required to login
(more information [here](http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JURECA/UserInfo/QuickIntroduction.html?nn=1803700)).
1. Use SSH to login:
`ssh <username>@jureca.fz-juelich.de`
2. Upon successful login, activate your project environment (more information [here](http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/NewUsageModel/NewUsageModel_node.html)):
`jutil env activate -p <project name> -A <accounting project name>`
3. Change to the project directory:
`cd $PROJECT`
You should be in your project directory at this point. If you'd like to clone this repository
elsewhere, please change to that directory.
**Note:** The same steps are valid for logging on to JURON, except that the server address in
step 1 should be: `juron.fz-juelich.de`
## 3. Cloning the repository
In order to store the datasets within the repository, we use Git LFS. This makes cloning the
repository a little bit different. Please find below the instructions on how to clone on different
systems. To learn more about Git LFS, click [here](http://gitlab.pages.jsc.fz-juelich.de/lfs/).
**Note:** During the cloning process you will most likely be prompted for your username and
password twice; this is as expected.
### 3.1 JURECA
1. Load the Git LFS module:
`module load git-lfs/2.6.1`
2. Initialize Git LFS:
`git lfs install`
3. Clone the repository, including the datasets:
`git lfs clone https://gitlab.version.fz-juelich.de/khalid1/dl_framework_testing.git`
### 3.2 JURON
No additional setup is required on JURON. You can simply clone the repository along with the
datasets using the following command:
git lfs clone https://gitlab.version.fz-juelich.de/khalid1/dl_framework_testing.git
## 4. Running a sample
Let us consider a scenario where you would like to run the `mnist.py` sample available in the `keras`
directory. This sample trains a CNN on MNIST using Keras on a single GPU. The following sub-sections list
the steps required for different supercomputers.
### 4.1 JURECA
1. Assuming you are in the repository root, change to the keras directory:
`cd keras`
2. Submit the job to run the sample:
`sbatch submit_job_jureca_python3.sh`
That's it; this is all you need for job submission. If you'd like to receive email notifications
regarding the status of the job, add the following statement to the "SLURM job configuration"
block in the `submit_job_jureca_python3.sh` script (replace `<your email address here>` with your
email address).
#SBATCH --mail-user=<your email address here>
Output from the job is available in the `error` and `output` files, as specified in the job
configuration.
### 4.2 JURON
1. Assuming you are in the repository root, change to the keras directory:
`cd keras`
2. Submit the job to run the sample:
`bsub < submit_job_juron_python3.sh`
Please note that unlike JURECA, JURON uses LSF for job submission, which is why a different
syntax is required for job configuration and submission. Moreover, email notifications are not
supported on JURON. For more information on how to use LSF on JURON, use the following command:
man 7 juron-lsf
Output from the job is available in the `error` and `output` files, as specified in the job
configuration.
## 5. Python 2 support
All the code samples are compatible with both Python 2 and Python 3. However, not all frameworks on all
machines are available for Python 2 (yet); in certain cases these are only available for Python 3. We have
included separate job submission scripts for Python 2 and Python 3. In cases where Python 2 is not
supported, only the job submission script for Python 3 is available. We will try our best to make
all frameworks available with Python 2 as well, but this will not be a priority as the official support
for Python 2 will be discontinued in the year 2020.
## 6. Distributed training
[Horovod](https://github.com/horovod/horovod) provides a simple and efficient solution for
training artificial neural networks on multiple GPUs across multiple nodes in a cluster. It can
be used with Tensorflow, Keras, and PyTorch (some other frameworks are supported as well, but
not Caffe). In this repository, the `horovod` directory contains further sub-directories; one
for each compatible framework that has been tested. E.g., there is a `keras` sub-directory that
contains samples that utilize distributed training with Keras and Horovod (more information is available
in the directory-local `README.md`).
Please note that Horovod currently only supports a distribution strategy where the entire model is
replicated on all GPUs. It is the data that is distributed across the GPUs. If you are interested
in model-parallel training, where the model itself can be split and distributed, a different
solution is required. We hope to add a sample for model-parallel training at a later time.
Caffe does not support multi-node training. However, it has built-in support for [multi-GPU
training](https://github.com/BVLC/caffe/blob/master/docs/multigpu.md) on a single node (only
via the C/C++ interface). The `mnist_cmd` sample in the `caffe` directory contains the job
script that can be used to train the model on multiple GPUs. Please see the
directory-local `README.md` for further information.
## Credits
* **Created by:** Fahad Khalid (SLNS/HPCNS, JSC)
* **Installation of modules on JURON:** Andreas Herten (HPCNS, JSC)
* **Installation of modules on JURECA:** Damian Alvarez (JSC), Rajalekshmi Deepu (SLNS/HPCNS, JSC)
* **Review/suggestions/testing:** Kai Krajsek (SLNS/HPCNS, JSC)
# Notes
There are three ways in which Caffe can be used,
1. As a command line tool with only built-in layers
2. As a library from within a Python program. Either only built-in layers can be used,
or one or more custom layers can be written in Python.
3. As a command line tool with one or more custom C++ layers.
## Caffe as a command line tool
The `mnist_cmd` sub-directory contains configuration and job scripts for running
Caffe as a command line tool with only built-in layers. This example represents use
case 1 as described above. The `lenet_solver.prototxt` and `lenet_train_test.prototxt`
were taken from the MNIST examples directory available in the Caffe repository available
[here](https://github.com/BVLC/caffe/tree/master/examples/mnist). Minor changes have
been made just so the path to the input dataset is correct. The `caffe` command
in the job submission scripts can be modified as follows to run training on
all available GPUs on the node (value for the `-gpu` option has been changed from `0` to `all`):
caffe train --solver=lenet_solver.prototxt -gpu all
## Using Caffe within a Python program
The `lenet_python` sub-directory contains the required files for an example of
using Caffe as a library from within a Python program. This corresponds to use case
2 as described above. The `train_lenet.py` file contains source code adapted from
the IPython notebook `01-learning-lenet.ipynb` available in the Caffe examples
[here](https://github.com/BVLC/caffe/tree/master/examples). Running this example
results in the generation of a learning curve plot in the current directory.
## Caffe with custom C++ layers
Working with custom C++ layers requires recompiling Caffe with the custom code. As
this is not possible with a system-wide installation, we have decided not to
include an example of this use case. Nevertheless, if you must work with custom
C++ layers and require assistance, please send an email to the mailing list
(more information [here](https://lists.fz-juelich.de/mailman/listinfo/ml)).
# The train/test net protocol buffer definition
train_net: "lenet_auto_train.prototxt"
test_net: "lenet_auto_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
display: 100
# The maximum number of iterations
max_iter: 10000
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "snapshots/lenet"
#!/usr/bin/env bash
# Slurm job configuration
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --output=output_%j.out
#SBATCH --error=error_%j.er
#SBATCH --time=00:10:00
#SBATCH --job-name=CAFFE_LENET_PYTHON
#SBATCH --gres=gpu:1 --partition=develgpus
#SBATCH --mail-type=ALL
# Load the required modules
module use /usr/local/software/jureca/OtherStages
module load Stages/Devel-2018b
module load GCC/7.3.0
module load MVAPICH2/2.3-GDR
module load Caffe/1.0-Python-2.7.15
# Run the program
srun python -u train_lenet.py
#!/usr/bin/env bash
#BSUB -q normal
#BSUB -W 10
#BSUB -n 1
#BSUB -R "span[ptile=1]"
#BSUB -gpu "num=1"
#BSUB -e "error.%J.er"
#BSUB -o "output_%J.out"
#BSUB -J CAFFE_LENET_PYTHON
# Load the Python and Caffe modules
module load python/2.7.14
module load caffe/1.0-gcc_5.4.0-cuda_10.0.130
# Train LeNet
python -u train_lenet.py
#!/usr/bin/env bash
#BSUB -q normal
#BSUB -W 10
#BSUB -n 1
#BSUB -R "span[ptile=1]"
#BSUB -gpu "num=1"
#BSUB -e "error.%J.er"
#BSUB -o "output_%J.out"
#BSUB -J CAFFE_LENET_PYTHON
# Load the Python and Caffe modules
module load python/3.6.1
module load caffe/1.0-gcc_5.4.0-cuda_10.0.130
# Train LeNet
python -u train_lenet.py
import os
import sys
import matplotlib
# Force matplotlib to not use any Xwindows backend.
matplotlib.use('Agg')
import pylab
import caffe
from caffe import layers as L, params as P
# Import the DataValidator, which can then be used to
# validate and load the path to the already downloaded dataset.
sys.path.insert(0, '../../utils')
from data_utils import DataValidator
# Prepares network specification
def lenet(lmdb, batch_size):
# Caffe's version of LeNet: a series of linear and simple nonlinear transformations
n = caffe.NetSpec()
n.data, n.label = L.Data(batch_size=batch_size, backend=P.Data.LMDB, source=lmdb,
transform_param=dict(scale=1. / 255), ntop=2)
n.conv1 = L.Convolution(n.data, kernel_size=5, num_output=20, weight_filler=dict(type='xavier'))
n.pool1 = L.Pooling(n.conv1, kernel_size=2, stride=2, pool=P.Pooling.MAX)
n.conv2 = L.Convolution(n.pool1, kernel_size=5, num_output=50, weight_filler=dict(type='xavier'))
n.pool2 = L.Pooling(n.conv2, kernel_size=2, stride=2, pool=P.Pooling.MAX)
n.fc1 = L.InnerProduct(n.pool2, num_output=500, weight_filler=dict(type='xavier'))
n.relu1 = L.ReLU(n.fc1, in_place=True)
n.score = L.InnerProduct(n.relu1, num_output=10, weight_filler=dict(type='xavier'))
n.loss = L.SoftmaxWithLoss(n.score, n.label)
return n.to_proto()
# Names of the directories containing the LMDB files for TRAIN and TEST phases
test_dir = 'mnist/caffe/mnist_test_lmdb'
train_dir = 'mnist/caffe/mnist_train_lmdb'
# Validated path to the data root
DataValidator.validated_data_dir(train_dir)
data_dir = DataValidator.validated_data_dir(test_dir)
# Write the prototxt for TRAIN phase
with open('lenet_auto_train.prototxt', 'w') as f:
f.write(str(lenet(os.path.join(data_dir, train_dir), 64)))
# Write the prototxt for TEST phase
with open('lenet_auto_test.prototxt', 'w') as f:
f.write(str(lenet(os.path.join(data_dir, test_dir), 100)))
# Use the GPU for training
caffe.set_device(0)
caffe.set_mode_gpu()
# Load the solver and create train and test nets
solver = None # ignore this workaround for lmdb data (can't instantiate two solvers on the same data)
solver = caffe.SGDSolver('lenet_auto_solver.prototxt')
solver.net.forward() # train net
solver.test_nets[0].forward() # test net (there can be more than one)
niter = 200
test_interval = 25
# losses will also be stored in the log
train_loss = pylab.zeros(niter)
test_acc = pylab.zeros(int(pylab.ceil(niter / test_interval)))
output = pylab.zeros((niter, 8, 10))
# the main solver loop
for it in range(niter):
solver.step(1) # SGD by Caffe
# store the train loss
train_loss[it] = solver.net.blobs['loss'].data
# store the output on the first test batch
# (start the forward pass at conv1 to avoid loading new data)
solver.test_nets[0].forward(start='conv1')
output[it] = solver.test_nets[0].blobs['score'].data[:8]
# run a full test every so often
# (Caffe can also do this for us and write to a log, but we show here
# how to do it directly in Python, where more complicated things are easier.)
if it % test_interval == 0:
print('Iteration', it, 'testing...')
correct = 0
for test_it in range(100):
solver.test_nets[0].forward()
correct += sum(solver.test_nets[0].blobs['score'].data.argmax(1)
== solver.test_nets[0].blobs['label'].data)
test_acc[it // test_interval] = correct / 1e4
# Plot the training curve
_, ax1 = pylab.subplots()
ax2 = ax1.twinx()
ax1.plot(pylab.arange(niter), train_loss)
ax2.plot(test_interval * pylab.arange(len(test_acc)), test_acc, 'r')
ax1.set_xlabel('iteration')
ax1.set_ylabel('train loss')
ax2.set_ylabel('test accuracy')
ax2.set_title('Test Accuracy: {:.2f}'.format(test_acc[-1]))
# Save the plot to file. Use "bbox_inches='tight'" to remove surrounding whitespace
pylab.savefig('learning_curve.png', bbox_inches='tight')
# The train/test net protocol buffer definition
net: "lenet_train_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
display: 100
# The maximum number of iterations
max_iter: 10000
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "snapshots/lenet"
# solver mode: CPU or GPU
solver_mode: GPU
name: "LeNet"
layer {
name: "mnist"
type: "Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
scale: 0.00390625
}
data_param {
source: "../../datasets/mnist/caffe/mnist_train_lmdb"
batch_size: 64
backend: LMDB
}
}
layer {
name: "mnist"
type: "Data"
top: "data"
top: "label"
include {
phase: TEST
}
transform_param {
scale: 0.00390625
}
data_param {
source: "../../datasets/mnist/caffe/mnist_test_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
#!/usr/bin/env bash
# Slurm job configuration
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --output=output_%j.out
#SBATCH --error=error_%j.er
#SBATCH --time=00:10:00
#SBATCH --job-name=CAFFE_MNIST_CMD
#SBATCH --gres=gpu:1 --partition=develgpus
#SBATCH --mail-type=ALL
# Load the required modules
module use /usr/local/software/jureca/OtherStages
module load Stages/Devel-2018b
module load GCC/7.3.0
module load MVAPICH2/2.3-GDR
module load Caffe/1.0-Python-2.7.15
# Train the model using the 'caffe' binary
srun caffe train --solver=lenet_solver.prototxt -gpu 0
\ No newline at end of file
#!/usr/bin/env bash
#BSUB -q normal
#BSUB -W 10
#BSUB -n 1
#BSUB -R "span[ptile=1]"
#BSUB -gpu "num=1"
#BSUB -e "error.%J.er"
#BSUB -o "output_%J.out"
#BSUB -J CAFFE_MNIST_CMD
# Load the Python and Caffe modules
module load python/2.7.14
module load caffe/1.0-gcc_5.4.0-cuda_10.0.130
# Train a model for MNIST
caffe train --solver=lenet_solver.prototxt -gpu 0
\ No newline at end of file
#!/usr/bin/env bash
#BSUB -q normal
#BSUB -W 10
#BSUB -n 1
#BSUB -R "span[ptile=1]"
#BSUB -gpu "num=1"
#BSUB -e "error.%J.er"
#BSUB -o "output_%J.out"
#BSUB -J CAFFE_MNIST_CMD
# Load the Python and Caffe modules
module load python/3.6.1
module load caffe/1.0-gcc_5.4.0-cuda_10.0.130
# Train a model for MNIST
caffe train --solver=lenet_solver.prototxt -gpu 0
# Notes
To keep the code samples as simple as possible, all examples use the
[MNIST](http://yann.lecun.com/exdb/mnist/) dataset for training a Convolutional
Neural Network on the hand-written digit classification problem. Furthermore, we
decided to take code samples from the official models/examples repositories
maintained by the respective framework developers, as these are the same samples one
uses when getting started with the framework.
However, the original examples are designed to automatically download the required
dataset in a framework-defined directory. This is not a feasible option as compute
nodes on the supercomputers do not have access to the Internet. Therefore, the samples
have been slightly modified to load data from this `datasets` directory. It contains
the MNIST dataset in different formats because samples for different frameworks expect
the dataset in a different format.
It is possible to set the `DL_TEST_DATA_HOME` environment variable to point to a
different directory, however, the contents of that directory must contain a
recursive copy of the `mnist` sub-directory as available here.
\ No newline at end of file
File added
File added
File added
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment