Skip to content
Snippets Groups Projects
Commit bc80f782 authored by Martin Siggel's avatar Martin Siggel
Browse files

Fixed error of gradient computation in NN

parent 9c6fb7e0
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Neural Networks - Excercise#
Neural networks are a very versatile and popular class of algorithms that can be used for regression tasks and classification.
In this tutorial, we want to use it to classify again hand-written digits from the MNIST dataset. This MNIST dataset is often called the "Hello World" of machine learning.
![The mnist images](images/neural-network.png)
The idea is, to train a neural network on a large data set of digit images. This network should then be used to predict other test images. The aim of this exercise is:
1. Implement and check the cost function of the Neural Network
2. Classify images from a test data set and compute the accuracy.
3. Figure out, how test set accuracy and training set accuracy depend on the number of samples.
4. Try to improve the accuracy by changing the number of neurons in the hidden layer or by changing the regularization.
%% Cell type:markdown id: tags:
## Loading the data ##
Lets download the MNIST dataset
%% Cell type:code id: tags:
``` python
import mnist
imgs_train = mnist.train_images()
y_train = mnist.train_labels()
imgs_test = mnist.test_images()
y_test = mnist.test_labels()
print(imgs_train.shape)
print(imgs_test.shape)
```
%% Cell type:code id: tags:
``` python
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
fig, axes = plt.subplots(1, 8)
fig.set_size_inches(18, 8)
# show the first 8 images of the test set
for i in range(0, 8):
ax = axes[i]
ax.imshow(imgs_train[i, :, :], cmap='gray');
```
%% Cell type:markdown id: tags:
## Data normalization and preparation ##
As in the previous exercise, the data have to be normalized.
%% Cell type:code id: tags:
``` python
def normalize_and_prepare(imgs):
# normalize between -0.5 ... 0.5
imgs_norm = np.array(imgs, dtype=float) / 255. - 0.5
# linearize the 2d image
return imgs_norm.reshape((imgs.shape[0], imgs.shape[1] * imgs.shape[2]))
# we don't want to use the full data set, as our memory could run out
n_train = 10000
n_test = 10000
X_train = normalize_and_prepare(imgs_train[0:n_train, :, :])
X_test = normalize_and_prepare(imgs_test[0:n_test, :, :])
y_train = y_train[0:n_train]
y_test = y_test[0:n_test]
X_train.shape
```
%% Cell type:markdown id: tags:
## The hypothesis of Neural Networks and the loss function ##
We want to implement the following network.
![](images/fully-connected-neural-network.png)
It has the following architecture:
- The number of inputs $n_1$ equals the number of pixels (i.e. 28x28).
- The number of hidden neurons ${n_2}$ can be chosen arbirarily, but for now we choose 25
- The neural network has $K = 10$ output neurons, each of them representing one label.
The neural network can be interpreted as multiple chained logistic regressors. The output of one layer equals the input of the next layer. Let $a_l$ be the activation/output of layer $l$. Then
$$ a^{(l+1)} = \sigma\left(\sum_{j = 1}^{n_l}\theta^{(l)}_j a^{(l)}_j\right) = \sigma\left(\theta^{(l)} a^{(l)} \right) $$
The hypothesis is the output of the last layer, i.e.
$$ h_\theta(x) =a^{(3)} $$
The loss function is very similar to the logistic regression:
$$ J(\theta) := \frac 1 m \sum_{i=1}^m \sum_{k=1}^K \left[ -y^{(i)}\log\left(h_\theta(x^{(i)})\right) - \left(1-y^{(i)}\right)\log\left(1 - h_\theta(x^{(i)}) \right)\right] + \frac \lambda {2 m} \left[\sum_{k = 1}^{n_1}\sum_{j = 1}^{n_2}(\theta_{j,k}^{(1)})^2 + \sum_{k = 1}^{n_2}\sum_{j = 1}^{K}(\theta_{j,k}^{(2)})^2 \right]$$
Here, $\theta$ are the combined parameters $\theta_1$ and $\theta_1$ of the hidden and output layer. Again, $\lambda$ is the regularization parameter. The only difference is the sum over $K$, which encouters for $K$ outputs compared to just one output in the logistic regression.
### Propagation ###
To implement the forward propagation, implement the following:
- $ a^{(1)} = x $
- Add the row of 1s to $a^{(1)}$ to account for the constant bias
- $ z^{(2)} = \theta^{(1)} a^{(1)} $
- $ a^{(2)} = \sigma(z^{(2)})$
- Add the row of 1s to $a^{(2)}$ to account for the constant bias
- $ z^{(3)} = \theta^{(2)} a^{(2)} $
- $ a^{(3)} = \sigma(z^{(3)})$
- $h = a^{(3)}$
Then compute the loss function $J$ from $h$.
### Backpropagation ###
Here, we take the adjoint approach, differencing each of the equations above backwards:
- $dz_\dagger^{(3)} = \frac 1 m (h - y) $ (Error of the output layer)
- $d\theta_\dagger^{(2)} = {a^{(2)}}^T dz_\dagger^{(3)}$ (Gradient of $\theta_2$)
- $d{a_\dagger^{(2)}} = d\theta_\dagger^{(2)} {dz_\dagger^{(3)}}^T$
- $d{a_\dagger^{(2)}} = \theta^{(2)} {dz_\dagger^{(3)}}^T$
- remove first row from $d{a_\dagger^{(2)}}$
- $dz_\dagger^{(2)} = d{a_\dagger^{(2)}} \frac d {dz}\sigma(z^{(2)}) $ (Error of the hidden layer)
- $d\theta_\dagger^{(1)} = {a^{(1)}}^T dz_\dagger^{(2)}$ (Gradient of $\theta_1$)
Now, we also include the regularization term into the gradients
- $d\theta_\dagger^{(1)} = d\theta_\dagger^{(1)} + \frac \lambda m \theta^{(1)} $
- $d\theta_\dagger^{(2)} = d\theta_\dagger^{(2)} + \frac \lambda m \theta^{(2)} $
Again, we need to define the logistic function $\sigma(z)$:
%% Cell type:markdown id: tags:
### Implementation ###
__Excercise:__ Now it's your turn. Implement propagation and back-propagation:
%% Cell type:code id: tags:
``` python
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def sigmoid_grad(z):
return sigmoid(z) * (1. - sigmoid(z))
import copy
def nn_loss_function(theta, X, y, lam, n_hidden_layer, n_labels):
"""
:param theta: Parameters of the regressor
:param X: Input values (n_samples x n_features)
:param y: Ground truth labels for each sample of X
:param lam: Regularization parameter
:return: Cost value
"""
# number of data items
m_samples = X.shape[0]
n_features = X.shape[1]
# extract theta_1 and theta_2 from combined theta
theta = copy.copy(theta)
n_w_layer1 = (n_features + 1) * n_hidden_layer
theta_1 = theta[0: n_w_layer1]
theta_1 = theta_1.reshape((n_features + 1, n_hidden_layer))
theta_2 = theta[n_w_layer1:]
theta_2 = theta_2.reshape((n_hidden_layer + 1, n_labels))
#### start your code ####
# TODO: Compute the loss function of the log regression
J = 0.
# TODO: Compute the gradients of both layers.
# The resulting gradient have the following shape
theta_1_grad = np.zeros((n_features + 1, n_hidden_layer))
theta_2_grad = np.zeros((n_hidden_layer + 1, n_labels))
###### end your code #####
theta_grad = np.hstack((theta_1_grad.flatten(), theta_2_grad.flatten()))
return J, theta_grad
```
%% Cell type:markdown id: tags:
### Accuracy check ###
Lets check the accuracy of the loss function. We load some pre-defined theta values and compare it with the reference value.
%% Cell type:code id: tags:
``` python
import mytools
# this loads already theta values for all labels.
# we want to check it however just for one label
theta_check = np.load('data/nn_theta_check.npy', allow_pickle=False)
my_loss_function = lambda theta:nn_loss_function(theta, X_train[0:5000],
mytools.encode_one_hot(y_train[0:5000], 10),
0.1, 25, 10)
expected_loss = my_loss_function(theta_check)[0]
# this value must be roughly 6.730543
if np.abs(expected_loss - 6.730543) > 1e-4:
print("Oooops... please check your loss function")
else:
print("Hooray, your loss function looks good")
```
%% Cell type:markdown id: tags:
Now lets check the gradient. We do this again using the finite difference approximation
$$ \frac {dJ} {d\theta_j}(\theta) \approx \frac {J(\theta + e_j h) - J(\theta)} h$$
%% Cell type:code id: tags:
``` python
import mytools
# perform checking in the first and second layer
mytools.check_gradient(my_loss_function, theta_check, [1, 2, 3, 4, 5, 6, 19700, 19701, 19702, 19703, 19704]);
```
%% Cell type:markdown id: tags:
## Training ##
If your loss function passes the accuracy tests, it is time to do the training!
Do do symmetry breaking, we initialize the parameters $\theta$ with some small random numbers.
%% Cell type:code id: tags:
``` python
def initial_layer_weights(n_input, n_output):
"""
Initialize theta randomly so that we break the symmetry while
training the neural network.
"""
eps = 0.12
theta = np.random.rand(n_input + 1, n_output) * eps * 2. - eps
return theta
```
%% Cell type:markdown id: tags:
This defines out training procedure...
%% Cell type:code id: tags:
``` python
import scipy.optimize
def train(X, y, n_hidden_layers, num_labels, regularization, max_iter):
n_features = X.shape[1]
# initialize parameters
theta_1 = initial_layer_weights(n_features, n_hidden_layers)
theta_2 = initial_layer_weights(n_hidden_layers, num_labels)
# we have to linearize then for the optimizer
theta = np.hstack((theta_1.flatten(), theta_2.flatten()))
def cost_function(t):
return nn_loss_function(t, X, y, regularization, n_hidden_layers, num_labels)
print("Training neural network... time to get a coffee")
res = scipy.optimize.minimize(cost_function,
theta, jac=True, options={'disp': True, 'maxiter': max_iter}, method='CG')
# restore layer 1 and 2 parameters
theta_res = res.x
theta_1_res = theta_res[0: (n_features + 1) * n_hidden_layers]
theta_1_res = theta_1_res.reshape((n_features + 1, n_hidden_layers))
n_base = (n_features + 1) * n_hidden_layers
theta_2_res = theta_res[n_base:]
theta_2_res = theta_2_res.reshape((n_hidden_layers + 1, num_labels))
return theta_1_res, theta_2_res
```
%% Cell type:markdown id: tags:
Now do the training!
%% Cell type:code id: tags:
``` python
n_hidden_layers = 25
regularization = 0.1
n_train_samples = 5000
max_iter = 1000
theta_1, theta_2 = train(X_train[0: n_train_samples, :],
mytools.encode_one_hot(y_train[0: n_train_samples], 10),
n_hidden_layers, 10, regularization, max_iter)
```
%% Cell type:markdown id: tags:
## Classification ##
First we implement our predictor...
%% Cell type:code id: tags:
``` python
def predict_label(theta_1, theta_2, X):
"""
Predicts the data using the logistic regression approach
:param theta_1: The parameters of hidden layer of the neural network
:param theta_w: The parameters of output layer of the neural network
:param X: The input data to be predicted
:return:
"""
# number of data items
m_samples = X.shape[0]
# add the constant bias feature
a_1 = np.hstack((np.ones((m_samples, 1)), X))
a_2 = sigmoid(np.dot(a_1, theta_1))
# add the constant bias feature
a_2 = np.hstack((np.ones((m_samples, 1)), a_2))
a_3 = sigmoid(np.dot(a_2, theta_2))
# return index of maximum probability and probability
return np.argmax(a_3, axis=1), np.max(a_3, axis=1)
```
%% Cell type:markdown id: tags:
Lets predict the first 8 images of the test set
%% Cell type:code id: tags:
``` python
fig, axes = plt.subplots(1, 8)
fig.set_size_inches(18, 8)
# show the first 8 images of the test set
for i in range(0, 8):
ax = axes[i]
ax.imshow(imgs_test[i, :, :], cmap='gray');
```
%% Cell type:code id: tags:
``` python
prediction, probabilty = predict_label(theta_1, theta_2, X_test[0:8, :])
print ("Prediction: ", prediction)
```
%% Cell type:markdown id: tags:
Lets have a look at the probabilities:
%% Cell type:code id: tags:
``` python
print ("Probability: ", probabilty)
```
%% Cell type:markdown id: tags:
## Accuracy ##
Now, lets compute the accuracy of the classifier for the whole test set. A completely untrained classifier should roughly score 10%.
__Excercise:__
- Investigate, how test accuracy and training accuracy depend on the test set size. Train the classifier with different n_train_samples and compute accuracies. What do you see?
- Play around with the number of hidden layer neurons. What effect does it have?
- How does the regularization parameter $\lambda$ effect the performance?
%% Cell type:code id: tags:
``` python
labels_predicted, probability = predict_label(theta_1, theta_2, X_test)
accuracy = np.mean(np.array(labels_predicted == y_test, dtype=float))
print ("Accuracy of the neural network on the test set: %g%%" % (accuracy*100.))
```
%% Cell type:code id: tags:
``` python
labels_predicted, probability = predict_label(theta_1, theta_2, X_train[0:n_train_samples, :])
accuracy = np.mean(np.array(labels_predicted == y_train[0:n_train_samples], dtype=float))
print ("Accuracy of the neural network on the training set: %g%%" % (accuracy*100.))
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment