Fixed error of gradient computation in NN

bc80f782 · Martin Siggel · 9c6fb7e0 · bc80f782
Commit bc80f782 authored Mar 22, 2018 by Martin Siggel
--- a/classification/Neural Networks.ipynb
+++ b/classification/Neural Networks.ipynb
@@ -148,7 +148,7 @@
    "\n",
    "- $dz_\\dagger^{(3)} = \\frac 1 m (h - y) $ (Error of the output layer)\n",
    "- $d\\theta_\\dagger^{(2)} = {a^{(2)}}^T dz_\\dagger^{(3)}$ (Gradient of $\\theta_2$)\n",
-    "- $d{a_\\dagger^{(2)}} = d\\theta_\\dagger^{(2)} {dz_\\dagger^{(3)}}^T$\n",
+    "- $d{a_\\dagger^{(2)}} = \\theta^{(2)} {dz_\\dagger^{(3)}}^T$\n",
    "- remove first row from $d{a_\\dagger^{(2)}}$\n",
    "- $dz_\\dagger^{(2)} = d{a_\\dagger^{(2)}} \\frac d {dz}\\sigma(z^{(2)}) $ (Error of the hidden layer)\n",
    "- $d\\theta_\\dagger^{(1)} = {a^{(1)}}^T dz_\\dagger^{(2)}$ (Gradient of $\\theta_1$)\n",

 %% Cell type:markdown id: tags:
 # Neural Networks - Excercise#
 Neural networks are a very versatile and popular class of algorithms that can be used for regression tasks and classification.
 In this tutorial, we want to use it to classify again hand-written digits from the MNIST dataset. This MNIST dataset is often called the "Hello World" of machine learning.
 ![The mnist images](images/neural-network.png)
 The idea is, to train a neural network on a large data set of digit images. This network should then be used to predict other test images. The aim of this exercise is:
 1. Implement and check the cost function of the Neural Network
 2. Classify images from a test data set and compute the accuracy.
 3. Figure out, how test set accuracy and training set accuracy depend on the number of samples.
 4. Try to improve the accuracy by changing the number of neurons in the hidden layer or by changing the regularization.
 %% Cell type:markdown id: tags:
 ## Loading the data ##
 Lets download the MNIST dataset
 %% Cell type:code id: tags:
 ``` python
 import mnist
 imgs_train = mnist.train_images()
 y_train = mnist.train_labels()
 imgs_test = mnist.test_images()
 y_test = mnist.test_labels()
 print(imgs_train.shape)
 print(imgs_test.shape)
 ```
 %% Cell type:code id: tags:
 ``` python
 import matplotlib
 import numpy as np
 import matplotlib.pyplot as plt
 %matplotlib inline
 fig, axes = plt.subplots(1, 8)
 fig.set_size_inches(18, 8)
 # show the first 8 images of the test set
 for i in range(0, 8):
    ax = axes[i]
    ax.imshow(imgs_train[i, :, :], cmap='gray');
 ```
 %% Cell type:markdown id: tags:
 ## Data normalization and preparation ##
 As in the previous exercise, the data have to be normalized.
 %% Cell type:code id: tags:
 ``` python
 def normalize_and_prepare(imgs):
    # normalize between -0.5 ... 0.5
    imgs_norm = np.array(imgs, dtype=float) / 255. - 0.5
    # linearize the 2d image
    return imgs_norm.reshape((imgs.shape[0], imgs.shape[1] * imgs.shape[2]))
 # we don't want to use the full data set, as our memory could run out
 n_train = 10000
 n_test = 10000
 X_train = normalize_and_prepare(imgs_train[0:n_train, :, :])
 X_test = normalize_and_prepare(imgs_test[0:n_test, :, :])
 y_train = y_train[0:n_train]
 y_test = y_test[0:n_test]
 X_train.shape
 ```
 %% Cell type:markdown id: tags:
 ## The hypothesis of Neural Networks and the loss function ##
 We want to implement the following network.
 ![](images/fully-connected-neural-network.png)
 It has the following architecture:
 - The number of inputs $n_1$ equals the number of pixels (i.e. 28x28).
 - The number of hidden neurons ${n_2}$ can be chosen arbirarily, but for now we choose 25
 - The neural network has $K = 10$ output neurons, each of them representing one label.
 The neural network can be interpreted as multiple chained logistic regressors. The output of one layer equals the input of the next layer. Let $a_l$ be the activation/output of layer $l$. Then
 $$ a^{(l+1)} = \sigma\left(\sum_{j = 1}^{n_l}\theta^{(l)}_j a^{(l)}_j\right) = \sigma\left(\theta^{(l)} a^{(l)} \right) $$
 The hypothesis is the output of the last layer, i.e.
 $$ h_\theta(x) =a^{(3)} $$
 The loss function is very similar to the logistic regression:
 $$ J(\theta) := \frac 1 m \sum_{i=1}^m \sum_{k=1}^K \left[ -y^{(i)}\log\left(h_\theta(x^{(i)})\right) - \left(1-y^{(i)}\right)\log\left(1 - h_\theta(x^{(i)}) \right)\right] + \frac \lambda {2 m} \left[\sum_{k = 1}^{n_1}\sum_{j = 1}^{n_2}(\theta_{j,k}^{(1)})^2  + \sum_{k = 1}^{n_2}\sum_{j = 1}^{K}(\theta_{j,k}^{(2)})^2 \right]$$
 Here, $\theta$ are the combined parameters $\theta_1$ and $\theta_1$ of the hidden and output layer. Again, $\lambda$ is the regularization parameter. The only difference is the sum over $K$, which encouters for $K$ outputs compared to just one output in the logistic regression.
 ### Propagation ###
 To implement the forward propagation, implement the following:
 - $ a^{(1)} = x $
 - Add the row of 1s to $a^{(1)}$ to account for the constant bias
 - $ z^{(2)} = \theta^{(1)} a^{(1)} $
 - $ a^{(2)} = \sigma(z^{(2)})$
 - Add the row of 1s to $a^{(2)}$ to account for the constant bias
 - $ z^{(3)} = \theta^{(2)} a^{(2)} $
 - $ a^{(3)} = \sigma(z^{(3)})$
 - $h = a^{(3)}$
 Then compute the loss function $J$ from $h$.
 ### Backpropagation ###
 Here, we take the adjoint approach, differencing each of the equations above backwards:
 - $dz_\dagger^{(3)} = \frac 1 m (h - y) $ (Error of the output layer)
 - $d\theta_\dagger^{(2)} = {a^{(2)}}^T dz_\dagger^{(3)}$ (Gradient of $\theta_2$)
- $d{a_\dagger^{(2)}} = d\theta_\dagger^{(2)} {dz_\dagger^{(3)}}^T$
+- $d{a_\dagger^{(2)}} = \theta^{(2)} {dz_\dagger^{(3)}}^T$
 - remove first row from $d{a_\dagger^{(2)}}$
 - $dz_\dagger^{(2)} = d{a_\dagger^{(2)}} \frac d {dz}\sigma(z^{(2)}) $ (Error of the hidden layer)
 - $d\theta_\dagger^{(1)} = {a^{(1)}}^T dz_\dagger^{(2)}$ (Gradient of $\theta_1$)
 Now, we also include the regularization term into the gradients
 - $d\theta_\dagger^{(1)} = d\theta_\dagger^{(1)} + \frac \lambda m \theta^{(1)} $
 - $d\theta_\dagger^{(2)} = d\theta_\dagger^{(2)} + \frac \lambda m \theta^{(2)} $
 Again, we need to define the logistic function $\sigma(z)$:
 %% Cell type:markdown id: tags:
 ### Implementation ###
 __Excercise:__ Now it's your turn. Implement propagation and back-propagation:
 %% Cell type:code id: tags:
 ``` python
 def sigmoid(z):
    return 1 / (1 + np.exp(-z))
 def sigmoid_grad(z):
    return sigmoid(z) * (1. - sigmoid(z))
 import copy
 def nn_loss_function(theta, X, y, lam, n_hidden_layer, n_labels):
    """
    :param theta: Parameters of the regressor
    :param X: Input values (n_samples x n_features)
    :param y: Ground truth labels for each sample of X
    :param lam: Regularization parameter
    :return: Cost value
    """
    # number of data items
    m_samples = X.shape[0]
    n_features = X.shape[1]
    # extract theta_1 and theta_2 from combined theta
    theta = copy.copy(theta)
    n_w_layer1 = (n_features + 1) * n_hidden_layer
    theta_1 = theta[0: n_w_layer1]
    theta_1 = theta_1.reshape((n_features + 1, n_hidden_layer))
    theta_2 = theta[n_w_layer1:]
    theta_2 = theta_2.reshape((n_hidden_layer + 1, n_labels))
    #### start your code ####
    # TODO: Compute the loss function of the log regression
    J = 0.
    # TODO: Compute the gradients of both layers.
    # The resulting gradient have the following shape
    theta_1_grad = np.zeros((n_features + 1, n_hidden_layer))
    theta_2_grad = np.zeros((n_hidden_layer + 1, n_labels))
    ###### end your code #####
    theta_grad = np.hstack((theta_1_grad.flatten(), theta_2_grad.flatten()))
    return J, theta_grad
 ```
 %% Cell type:markdown id: tags:
 ### Accuracy check ###
 Lets check the accuracy of the loss function. We load some pre-defined theta values and compare it with the reference value.
 %% Cell type:code id: tags:
 ``` python
 import mytools
 # this loads already theta values for all labels.
 # we want to check it however just for one label
 theta_check = np.load('data/nn_theta_check.npy', allow_pickle=False)
 my_loss_function = lambda theta:nn_loss_function(theta, X_train[0:5000],
                                                 mytools.encode_one_hot(y_train[0:5000], 10),
                                                 0.1, 25, 10)
 expected_loss = my_loss_function(theta_check)[0]
 # this value must be roughly 6.730543
 if np.abs(expected_loss - 6.730543) > 1e-4:
    print("Oooops... please check your loss function")
 else:
    print("Hooray, your loss function looks good")
 ```
 %% Cell type:markdown id: tags:
 Now lets check the gradient. We do this again using the finite difference approximation
 $$ \frac {dJ} {d\theta_j}(\theta) \approx \frac {J(\theta + e_j h) - J(\theta)} h$$
 %% Cell type:code id: tags:
 ``` python
 import mytools
 # perform checking in the first and second layer
 mytools.check_gradient(my_loss_function, theta_check, [1, 2, 3, 4, 5, 6, 19700, 19701, 19702, 19703, 19704]);
 ```
 %% Cell type:markdown id: tags:
 ## Training  ##
 If your loss function passes the accuracy tests, it is time to do the training!
 Do do symmetry breaking, we initialize the parameters $\theta$ with some small random numbers.
 %% Cell type:code id: tags:
 ``` python
 def initial_layer_weights(n_input, n_output):
    """
    Initialize theta randomly so that we break the symmetry while
                training the neural network.
    """
    eps = 0.12
    theta = np.random.rand(n_input + 1, n_output) * eps * 2. - eps
    return theta
 ```
 %% Cell type:markdown id: tags:
 This defines out training procedure...
 %% Cell type:code id: tags:
 ``` python
 import scipy.optimize
 def train(X, y, n_hidden_layers, num_labels, regularization, max_iter):
    n_features = X.shape[1]
    # initialize parameters
    theta_1 = initial_layer_weights(n_features, n_hidden_layers)
    theta_2 = initial_layer_weights(n_hidden_layers, num_labels)
    # we have to linearize then for the optimizer
    theta = np.hstack((theta_1.flatten(), theta_2.flatten()))
    def cost_function(t):
        return nn_loss_function(t, X, y, regularization, n_hidden_layers, num_labels)
    print("Training neural network... time to get a coffee")
    res = scipy.optimize.minimize(cost_function,
                                  theta, jac=True, options={'disp': True, 'maxiter': max_iter}, method='CG')
    # restore layer 1 and 2 parameters
    theta_res = res.x
    theta_1_res = theta_res[0: (n_features + 1) * n_hidden_layers]
    theta_1_res = theta_1_res.reshape((n_features + 1, n_hidden_layers))
    n_base = (n_features + 1) * n_hidden_layers
    theta_2_res = theta_res[n_base:]
    theta_2_res = theta_2_res.reshape((n_hidden_layers + 1, num_labels))
    return theta_1_res, theta_2_res
 ```
 %% Cell type:markdown id: tags:
 Now do the training!
 %% Cell type:code id: tags:
 ``` python
 n_hidden_layers = 25
 regularization = 0.1
 n_train_samples = 5000
 max_iter = 1000
 theta_1, theta_2 = train(X_train[0: n_train_samples, :],
                         mytools.encode_one_hot(y_train[0: n_train_samples], 10),
                         n_hidden_layers, 10, regularization, max_iter)
 ```
 %% Cell type:markdown id: tags:
 ## Classification  ##
 First we implement our predictor...
 %% Cell type:code id: tags:
 ``` python
 def predict_label(theta_1, theta_2, X):
    """
    Predicts the data using the logistic regression approach
    :param theta_1: The parameters of hidden layer of the neural network
    :param theta_w: The parameters of output layer of the neural network
    :param X: The input data to be predicted
    :return:
    """
    # number of data items
    m_samples = X.shape[0]
    # add the constant bias feature
    a_1 = np.hstack((np.ones((m_samples, 1)), X))
    a_2 = sigmoid(np.dot(a_1, theta_1))
    # add the constant bias feature
    a_2 = np.hstack((np.ones((m_samples, 1)), a_2))
    a_3 = sigmoid(np.dot(a_2, theta_2))
    # return index of maximum probability and probability
    return np.argmax(a_3, axis=1), np.max(a_3, axis=1)
 ```
 %% Cell type:markdown id: tags:
 Lets predict the first 8 images of the test set
 %% Cell type:code id: tags:
 ``` python
 fig, axes = plt.subplots(1, 8)
 fig.set_size_inches(18, 8)
 # show the first 8 images of the test set
 for i in range(0, 8):
    ax = axes[i]
    ax.imshow(imgs_test[i, :, :], cmap='gray');
 ```
 %% Cell type:code id: tags:
 ``` python
 prediction, probabilty = predict_label(theta_1, theta_2, X_test[0:8, :])
 print ("Prediction: ", prediction)
 ```
 %% Cell type:markdown id: tags:
 Lets have a look at the probabilities:
 %% Cell type:code id: tags:
 ``` python
 print ("Probability: ", probabilty)
 ```
 %% Cell type:markdown id: tags:
 ## Accuracy ##
 Now, lets compute the accuracy of the classifier for the whole test set. A completely untrained classifier should roughly score 10%.
 __Excercise:__
 - Investigate, how test accuracy and training accuracy depend on the test set size. Train the classifier with different n_train_samples and compute accuracies. What do you see?
 - Play around with the number of hidden layer neurons. What effect does it have?
 - How does the regularization parameter $\lambda$ effect the performance?
 %% Cell type:code id: tags:
 ``` python
 labels_predicted, probability = predict_label(theta_1, theta_2, X_test)
 accuracy = np.mean(np.array(labels_predicted == y_test, dtype=float))
 print ("Accuracy of the neural network on the test set: %g%%" % (accuracy*100.))
 ```
 %% Cell type:code id: tags:
 ``` python
 labels_predicted, probability = predict_label(theta_1, theta_2, X_train[0:n_train_samples, :])
 accuracy = np.mean(np.array(labels_predicted == y_train[0:n_train_samples], dtype=float))
 print ("Accuracy of the neural network on the training set: %g%%" % (accuracy*100.))
 ```