{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Neural Networks - Excercise#\n",
    "\n",
    "Neural networks are a very versatile and popular class of algorithms that can be used for regression tasks and classification.\n",
    "\n",
    "In this tutorial, we want to use it to classify again hand-written digits from the MNIST dataset. This MNIST dataset is often called the \"Hello World\" of machine learning.\n",
    "\n",
    "![The mnist images](images/neural-network.png)\n",
    "\n",
    "The idea is, to train a neural network on a large data set of digit images. This network should then be used to predict other test images. The aim of this exercise is:\n",
    " 1. Implement and check the cost function of the Neural Network\n",
    " 2. Classify images from a test data set and compute the accuracy.\n",
    " 3. Figure out, how test set accuracy and training set accuracy depend on the number of samples.\n",
    " 4. Try to improve the accuracy by changing the number of neurons in the hidden layer or by changing the regularization."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the data ##\n",
    "\n",
    "Lets download the MNIST dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import print_function\n",
    "import mnist\n",
    "\n",
    "imgs_train = mnist.train_images()\n",
    "y_train = mnist.train_labels()\n",
    "imgs_test = mnist.test_images()\n",
    "y_test = mnist.test_labels()\n",
    "\n",
    "print(imgs_train.shape)\n",
    "print(imgs_test.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline \n",
    "fig, axes = plt.subplots(1, 8)\n",
    "fig.set_size_inches(18, 8)\n",
    "\n",
    "# show the first 8 images of the test set\n",
    "for i in range(0, 8):\n",
    "    ax = axes[i]\n",
    "    ax.imshow(imgs_train[i, :, :], cmap='gray');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data normalization and preparation ##\n",
    "\n",
    "As in the previous exercise, the data have to be normalized."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def normalize_and_prepare(imgs):\n",
    "    # normalize between -0.5 ... 0.5\n",
    "    imgs_norm = np.array(imgs, dtype=float) / 255. - 0.5\n",
    "    # linearize the 2d image\n",
    "    return imgs_norm.reshape((imgs.shape[0], imgs.shape[1] * imgs.shape[2]))\n",
    "\n",
    "# we don't want to use the full data set, as our memory could run out\n",
    "n_train = 10000\n",
    "n_test = 10000\n",
    "\n",
    "X_train = normalize_and_prepare(imgs_train[0:n_train, :, :])\n",
    "X_test = normalize_and_prepare(imgs_test[0:n_test, :, :])\n",
    "\n",
    "y_train = y_train[0:n_train]\n",
    "y_test = y_test[0:n_test]\n",
    "\n",
    "X_train.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The hypothesis of Neural Networks and the loss function ##\n",
    "\n",
    "We want to implement the following network.\n",
    "\n",
    "![](images/fully-connected-neural-network.png)\n",
    "\n",
    "It has the following architecture:\n",
    " - The number of inputs $n_1$ equals the number of pixels (i.e. 28x28).\n",
    " - The number of hidden neurons ${n_2}$ can be chosen arbirarily, but for now we choose 25\n",
    " - The neural network has $K = 10$ output neurons, each of them representing one label. \n",
    "\n",
    "The neural network can be interpreted as multiple chained logistic regressors. The output of one layer equals the input of the next layer. Let $a_l$ be the activation/output of layer $l$. Then\n",
    "\n",
    "$$ a^{(l+1)} = \\sigma\\left(\\sum_{j = 1}^{n_l}\\theta^{(l)}_j a^{(l)}_j\\right) = \\sigma\\left(\\theta^{(l)} a^{(l)} \\right) $$\n",
    "\n",
    "The hypothesis is the output of the last layer, i.e.\n",
    "\n",
    "$$ h_\\theta(x) =a^{(3)} $$\n",
    "\n",
    "The loss function is very similar to the logistic regression:\n",
    "\n",
    "$$ J(\\theta) := \\frac 1 m \\sum_{i=1}^m \\sum_{k=1}^K \\left[ -y^{(i)}\\log\\left(h_\\theta(x^{(i)})\\right) - \\left(1-y^{(i)}\\right)\\log\\left(1 - h_\\theta(x^{(i)}) \\right)\\right] + \\frac \\lambda {2 m} \\left[\\sum_{k = 1}^{n_1}\\sum_{j = 1}^{n_2}(\\theta_{j,k}^{(1)})^2  + \\sum_{k = 1}^{n_2}\\sum_{j = 1}^{K}(\\theta_{j,k}^{(2)})^2 \\right]$$\n",
    "\n",
    "Here, $\\theta$ are the combined parameters $\\theta_1$ and $\\theta_1$ of the hidden and output layer. Again, $\\lambda$ is the regularization parameter. The only difference is the sum over $K$, which encouters for $K$ outputs compared to just one output in the logistic regression.\n",
    "\n",
    "### Propagation ###\n",
    "\n",
    "To implement the forward propagation, implement the following:\n",
    "\n",
    " - $ a^{(1)} = x $\n",
    " - Add the row of 1s to $a^{(1)}$ to account for the constant bias\n",
    " - $ z^{(2)} = \\theta^{(1)} a^{(1)} $\n",
    " - $ a^{(2)} = \\sigma(z^{(2)})$\n",
    " - Add the row of 1s to $a^{(2)}$ to account for the constant bias\n",
    " - $ z^{(3)} = \\theta^{(2)} a^{(2)} $\n",
    " - $ a^{(3)} = \\sigma(z^{(3)})$\n",
    " - $h = a^{(3)}$\n",
    " \n",
    " Then compute the loss function $J$ from $h$.\n",
    "\n",
    "### Backpropagation ###\n",
    "\n",
    "Here, we take the adjoint approach, differencing each of the equations above backwards:\n",
    "\n",
    "- $dz_\\dagger^{(3)} = \\frac 1 m (h - y) $ (Error of the output layer)\n",
    "- $d\\theta_\\dagger^{(2)} = {a^{(2)}}^T dz_\\dagger^{(3)}$ (Gradient of $\\theta_2$)\n",
    "- $d{a_\\dagger^{(2)}} = \\theta^{(2)} {dz_\\dagger^{(3)}}^T$\n",
    "- remove first row from $d{a_\\dagger^{(2)}}$\n",
    "- $dz_\\dagger^{(2)} = d{a_\\dagger^{(2)}} \\frac d {dz}\\sigma(z^{(2)}) $ (Error of the hidden layer)\n",
    "- $d\\theta_\\dagger^{(1)} = {a^{(1)}}^T dz_\\dagger^{(2)}$ (Gradient of $\\theta_1$)\n",
    "\n",
    "Now, we also include the regularization term into the gradients\n",
    " \n",
    " - $d\\theta_\\dagger^{(1)} = d\\theta_\\dagger^{(1)} + \\frac \\lambda m \\theta^{(1)} $\n",
    " - $d\\theta_\\dagger^{(2)} = d\\theta_\\dagger^{(2)} + \\frac \\lambda m \\theta^{(2)} $\n",
    "\n",
    "\n",
    "Again, we need to define the logistic function $\\sigma(z)$:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Implementation ###\n",
    "\n",
    "__Excercise:__ Now it's your turn. Implement propagation and back-propagation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def sigmoid(z):\n",
    "    return 1 / (1 + np.exp(-z))\n",
    "\n",
    "def sigmoid_grad(z):\n",
    "    return sigmoid(z) * (1. - sigmoid(z))\n",
    "\n",
    "import copy\n",
    "def nn_loss_function(theta, X, y, lam, n_hidden_layer, n_labels):\n",
    "    \"\"\"\n",
    "    :param theta: Parameters of the regressor\n",
    "    :param X: Input values (n_samples x n_features)\n",
    "    :param y: Ground truth labels for each sample of X\n",
    "    :param lam: Regularization parameter\n",
    "    :return: Cost value\n",
    "    \"\"\"\n",
    "\n",
    "    # number of data items\n",
    "    m_samples = X.shape[0]\n",
    "    n_features = X.shape[1]\n",
    "\n",
    "    # extract theta_1 and theta_2 from combined theta\n",
    "    theta = copy.copy(theta)\n",
    "    n_w_layer1 = (n_features + 1) * n_hidden_layer\n",
    "    theta_1 = theta[0: n_w_layer1]\n",
    "    theta_1 = theta_1.reshape((n_features + 1, n_hidden_layer))\n",
    "\n",
    "    theta_2 = theta[n_w_layer1:]\n",
    "    theta_2 = theta_2.reshape((n_hidden_layer + 1, n_labels))\n",
    "\n",
    "    #### start your code ####\n",
    "    # TODO: Compute the loss function of the log regression\n",
    "    J = 0.\n",
    "\n",
    "    # TODO: Compute the gradients of both layers.\n",
    "    # The resulting gradient have the following shape\n",
    "    theta_1_grad = np.zeros((n_features + 1, n_hidden_layer))\n",
    "    theta_2_grad = np.zeros((n_hidden_layer + 1, n_labels))\n",
    "\n",
    "\n",
    "    ###### end your code #####\n",
    "\n",
    "    theta_grad = np.hstack((theta_1_grad.flatten(), theta_2_grad.flatten()))\n",
    "\n",
    "    return J, theta_grad"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Accuracy check ###\n",
    "Lets check the accuracy of the loss function. We load some pre-defined theta values and compare it with the reference value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import mytools\n",
    "\n",
    "# this loads already theta values for all labels.\n",
    "# we want to check it however just for one label\n",
    "theta_check = np.load('data/nn_theta_check.npy')\n",
    "\n",
    "my_loss_function = lambda theta:nn_loss_function(theta, X_train[0:5000],\n",
    "                                                 mytools.encode_one_hot(y_train[0:5000], 10),\n",
    "                                                 0.1, 25, 10)\n",
    "\n",
    "expected_loss = my_loss_function(theta_check)[0]\n",
    "\n",
    "# this value must be roughly 6.730543\n",
    "if np.abs(expected_loss - 6.730543) > 1e-4:\n",
    "    print(\"Oooops... please check your loss function\")\n",
    "else:\n",
    "    print(\"Hooray, your loss function looks good\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now lets check the gradient. We do this again using the finite difference approximation\n",
    "$$ \\frac {dJ} {d\\theta_j}(\\theta) \\approx \\frac {J(\\theta + e_j h) - J(\\theta)} h$$ "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import mytools\n",
    "# perform checking in the first and second layer\n",
    "mytools.check_gradient(my_loss_function, theta_check, [1, 2, 3, 4, 5, 6, 19700, 19701, 19702, 19703, 19704]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training  ##\n",
    "\n",
    "If your loss function passes the accuracy tests, it is time to do the training!\n",
    "\n",
    "Do do symmetry breaking, we initialize the parameters $\\theta$ with some small random numbers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def initial_layer_weights(n_input, n_output):\n",
    "    \"\"\"\n",
    "    Initialize theta randomly so that we break the symmetry while\n",
    "                training the neural network.\n",
    "    \"\"\"\n",
    "\n",
    "    eps = 0.12\n",
    "    theta = np.random.rand(n_input + 1, n_output) * eps * 2. - eps\n",
    "    return theta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This defines out training procedure..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import scipy.optimize\n",
    "def train(X, y, n_hidden_layers, num_labels, regularization, max_iter):\n",
    "    \n",
    "    n_features = X.shape[1]\n",
    "\n",
    "    # initialize parameters\n",
    "    theta_1 = initial_layer_weights(n_features, n_hidden_layers)\n",
    "    theta_2 = initial_layer_weights(n_hidden_layers, num_labels)\n",
    "\n",
    "    # we have to linearize then for the optimizer\n",
    "    theta = np.hstack((theta_1.flatten(), theta_2.flatten()))\n",
    "\n",
    "    def cost_function(t):\n",
    "        return nn_loss_function(t, X, y, regularization, n_hidden_layers, num_labels)\n",
    "\n",
    "    print(\"Training neural network... time to get a coffee\")\n",
    "\n",
    "    res = scipy.optimize.minimize(cost_function,\n",
    "                                  theta, jac=True, options={'disp': True, 'maxiter': max_iter}, method='CG')\n",
    "\n",
    "    # restore layer 1 and 2 parameters\n",
    "    theta_res = res.x\n",
    "\n",
    "    theta_1_res = theta_res[0: (n_features + 1) * n_hidden_layers]\n",
    "    theta_1_res = theta_1_res.reshape((n_features + 1, n_hidden_layers))\n",
    "\n",
    "    n_base = (n_features + 1) * n_hidden_layers\n",
    "    theta_2_res = theta_res[n_base:]\n",
    "    theta_2_res = theta_2_res.reshape((n_hidden_layers + 1, num_labels))\n",
    "\n",
    "    return theta_1_res, theta_2_res"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now do the training!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_hidden_layers = 25\n",
    "regularization = 0.1\n",
    "n_train_samples = 5000\n",
    "max_iter = 1000\n",
    "\n",
    "theta_1, theta_2 = train(X_train[0: n_train_samples, :],\n",
    "                         mytools.encode_one_hot(y_train[0: n_train_samples], 10),\n",
    "                         n_hidden_layers, 10, regularization, max_iter)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classification  ##\n",
    "\n",
    "First we implement our predictor..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def predict_label(theta_1, theta_2, X):\n",
    "    \"\"\"\n",
    "    Predicts the data using the logistic regression approach\n",
    "    :param theta_1: The parameters of hidden layer of the neural network\n",
    "    :param theta_w: The parameters of output layer of the neural network\n",
    "    :param X: The input data to be predicted\n",
    "    :return:\n",
    "    \"\"\"\n",
    "    # number of data items\n",
    "    m_samples = X.shape[0]\n",
    "\n",
    "    # add the constant bias feature\n",
    "    a_1 = np.hstack((np.ones((m_samples, 1)), X))\n",
    "    a_2 = sigmoid(np.dot(a_1, theta_1))\n",
    "    # add the constant bias feature\n",
    "    a_2 = np.hstack((np.ones((m_samples, 1)), a_2))\n",
    "    a_3 = sigmoid(np.dot(a_2, theta_2))\n",
    "\n",
    "    # return index of maximum probability and probability\n",
    "    return np.argmax(a_3, axis=1), np.max(a_3, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets predict the first 8 images of the test set\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 8)\n",
    "fig.set_size_inches(18, 8)\n",
    "\n",
    "# show the first 8 images of the test set\n",
    "for i in range(0, 8):\n",
    "    ax = axes[i]\n",
    "    ax.imshow(imgs_test[i, :, :], cmap='gray');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prediction, probabilty = predict_label(theta_1, theta_2, X_test[0:8, :])\n",
    "print (\"Prediction: \", prediction)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets have a look at the probabilities:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print (\"Probability: \", probabilty)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Accuracy ##\n",
    "\n",
    "Now, lets compute the accuracy of the classifier for the whole test set. A completely untrained classifier should roughly score 10%.\n",
    "\n",
    "__Excercise:__\n",
    " - Investigate, how test accuracy and training accuracy depend on the test set size. Train the classifier with different n_train_samples and compute accuracies. What do you see?\n",
    " - Play around with the number of hidden layer neurons. What effect does it have?\n",
    " - How does the regularization parameter $\\lambda$ effect the performance?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels_predicted, probability = predict_label(theta_1, theta_2, X_test)\n",
    "accuracy = np.mean(np.array(labels_predicted == y_test, dtype=float))\n",
    "print (\"Accuracy of the neural network on the test set: %g%%\" % (accuracy*100.))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels_predicted, probability = predict_label(theta_1, theta_2, X_train[0:n_train_samples, :])\n",
    "accuracy = np.mean(np.array(labels_predicted == y_train[0:n_train_samples], dtype=float))\n",
    "print (\"Accuracy of the neural network on the training set: %g%%\" % (accuracy*100.))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}