"- remove first row from $d{a_\\dagger^{(2)}}$\n",
"- $dz_\\dagger^{(2)} = d{a_\\dagger^{(2)}} \\frac d {dz}\\sigma(z^{(2)}) $ (Error of the hidden layer)\n",
"- $d\\theta_\\dagger^{(1)} = {a^{(1)}}^T dz_\\dagger^{(2)}$ (Gradient of $\\theta_1$)\n",
...
...
%% Cell type:markdown id: tags:
# Neural Networks - Excercise#
Neural networks are a very versatile and popular class of algorithms that can be used for regression tasks and classification.
In this tutorial, we want to use it to classify again hand-written digits from the MNIST dataset. This MNIST dataset is often called the "Hello World" of machine learning.

The idea is, to train a neural network on a large data set of digit images. This network should then be used to predict other test images. The aim of this exercise is:
1. Implement and check the cost function of the Neural Network
2. Classify images from a test data set and compute the accuracy.
3. Figure out, how test set accuracy and training set accuracy depend on the number of samples.
4. Try to improve the accuracy by changing the number of neurons in the hidden layer or by changing the regularization.
%% Cell type:markdown id: tags:
## Loading the data ##
Lets download the MNIST dataset
%% Cell type:code id: tags:
``` python
importmnist
imgs_train=mnist.train_images()
y_train=mnist.train_labels()
imgs_test=mnist.test_images()
y_test=mnist.test_labels()
print(imgs_train.shape)
print(imgs_test.shape)
```
%% Cell type:code id: tags:
``` python
importmatplotlib
importnumpyasnp
importmatplotlib.pyplotasplt
%matplotlibinline
fig,axes=plt.subplots(1,8)
fig.set_size_inches(18,8)
# show the first 8 images of the test set
foriinrange(0,8):
ax=axes[i]
ax.imshow(imgs_train[i,:,:],cmap='gray');
```
%% Cell type:markdown id: tags:
## Data normalization and preparation ##
As in the previous exercise, the data have to be normalized.
## The hypothesis of Neural Networks and the loss function ##
We want to implement the following network.

It has the following architecture:
- The number of inputs $n_1$ equals the number of pixels (i.e. 28x28).
- The number of hidden neurons ${n_2}$ can be chosen arbirarily, but for now we choose 25
- The neural network has $K = 10$ output neurons, each of them representing one label.
The neural network can be interpreted as multiple chained logistic regressors. The output of one layer equals the input of the next layer. Let $a_l$ be the activation/output of layer $l$. Then
Here, $\theta$ are the combined parameters $\theta_1$ and $\theta_1$ of the hidden and output layer. Again, $\lambda$ is the regularization parameter. The only difference is the sum over $K$, which encouters for $K$ outputs compared to just one output in the logistic regression.
### Propagation ###
To implement the forward propagation, implement the following:
- $ a^{(1)} = x $
- Add the row of 1s to $a^{(1)}$ to account for the constant bias
- $ z^{(2)} = \theta^{(1)} a^{(1)} $
- $ a^{(2)} = \sigma(z^{(2)})$
- Add the row of 1s to $a^{(2)}$ to account for the constant bias
- $ z^{(3)} = \theta^{(2)} a^{(2)} $
- $ a^{(3)} = \sigma(z^{(3)})$
- $h = a^{(3)}$
Then compute the loss function $J$ from $h$.
### Backpropagation ###
Here, we take the adjoint approach, differencing each of the equations above backwards:
- $dz_\dagger^{(3)} = \frac 1 m (h - y) $ (Error of the output layer)
- $d\theta_\dagger^{(2)} = {a^{(2)}}^T dz_\dagger^{(3)}$ (Gradient of $\theta_2$)
Now, lets compute the accuracy of the classifier for the whole test set. A completely untrained classifier should roughly score 10%.
__Excercise:__
- Investigate, how test accuracy and training accuracy depend on the test set size. Train the classifier with different n_train_samples and compute accuracies. What do you see?
- Play around with the number of hidden layer neurons. What effect does it have?
- How does the regularization parameter $\lambda$ effect the performance?