# Atelier Day 3

# Introduction to deep learning with `keras` (and `tensorflow` backend)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import tensorflow
from tensorflow import keras

from tensorflow.keras import layers
from tensorflow.keras import models

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras import activations

# 0. Introduction: about `tensorflow` and `keras`

The `numpy` library does some expensive operations outside Python using efficient code (Fortran, C/C++). However, switching back to python after each operation cause a big overhead because of unnecessary copies of the data. 

The library `tensorflow` does all the computations outside of Python: the python API is used to define a graph of operations, that will run entirely using C++ binaries. This architecture allows to get rid of the overhead. Besides, knowing the computational graph beforehand allows to parallelize and/or distribute the computation more easily. As a result, `tensorflow` can run the computations on multiple CPUs or GPUs, and on multiple servers.

However, for quick an easy model prototying, the library `keras` is simpler to use than `tensorflow`. 
Deep learning models can be constructed thanks to `keras` in few lines of python. So in this notebook, we won't see direct calls to `tensorflow`, but only to `keras`, even if the computations are done by `tensorflow`.

# 1. Handwritten digit recognition with MNIST

For the first part of this tutorial, we will use the [MNIST](http://yann.lecun.com/exdb/mnist) dataset.
This dataset contains images representing handwritten digits. 
Each image is 28 x 28 pixels, and each pixel is represented by a number (gray level). 
These arrays can be flattened into vectors of 28 x 28 = 784 numbers.
You can then see each image as a point in a 784-dimensional vector space. 
You can find interesting visualisations of this vector space [http://colah.github.io/posts/2014-10-Visualizing-MNIST/](http://colah.github.io/posts/2014-10-Visualizing-MNIST/).

## 1.1. Introduction

The labels in $\{0, 1, 2, \ldots, 9\}$ giving the digit on the image are be represented using one-hot encoding: labels in $\{0, 1, 2, \ldots, 9\}$ are replaced by labels in $\{ 0, 1\}^{10}$, namely $0$ is replaced by $(1, 0, \ldots 0)$, $1$ is replaced by $(0, 1, 0, \ldots 0)$, $2$ is replaced by $(0, 0, 1, 0, \ldots, 0)$, etc.

Also, MNIST data is grayscale pixels in $\{0, \ldots, 255\}$. The pixels should be normalized to belong to $[0, 1]$.
Indeed, working with big floats can lead to important numerical errors, in particular in deep learning models.

## 1.2. Load the data

MNIST is a very old and standard benchmark dataset for image classification, so it's built-in (ready to be downloaded) in all machine learning libraries (including `keras` and `tensorflow`).

In [None]:
import requests
import os

url = 'https://stephanegaiffas.github.io/files/formation_cnrs/MNIST.pickle.zip'
r = requests.get(url)
path_data = '../data/' 

with open(os.path.join(path_data, 'MNIST.pickle.zip'), 'wb') as f:
 f.write(r.content)

In [None]:
### pour charger les données si le fichier 'MNIST.pickle' est zippé et est en local
import pickle as pkl
import os
import zipfile

path_data = '../data/'
filename = 'MNIST.pickle.zip'
archive = zipfile.ZipFile(os.path.join(path_data, filename), 'r')

with archive.open('MNIST.pickle') as f:
 data = pkl.load(f)

In [None]:
from tensorflow.keras import backend as K

# Number of classes
num_labels = 10
# input image dimensions
img_rows, img_cols = 28, 28

# the data, shuffled and split between train and test sets
# chargement sur le web_keras
#(x_train, y_train), (x_test, y_test) = mnist.load_data()
# chargement local
(x_train, y_train), (x_test, y_test) = data


if K.image_data_format() == 'channels_first':
 x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
 x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
 input_shape = (1, img_rows, img_cols)
else:
 x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
 x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
 input_shape = (img_rows, img_cols, 1)

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')


In [None]:
input_shape

In [None]:
import pandas as pd
import seaborn as sns

y_counts = pd.DataFrame({
 'data': np.array(['train'] * num_labels + ['test'] * num_labels),
 'class': np.tile(np.arange(num_labels), 2),
 'prop': np.hstack([np.bincount(y_train) / y_train.shape[0], 
 np.bincount(y_test) / y_test.shape[0]])
})

fig, ax = plt.subplots(figsize=(8, 4))
sns.barplot(x='class', y='prop', hue='data', data=y_counts, ax=ax)

## 1.3. A first look at the data

In the next cell we illustrate the first for elements of the training data: 
pixels grayscale of the digit and their corresponding label.

In [None]:
plt.figure(figsize=(8, 2))
for i in range(4):
 plt.subplot(1, 4, i+1)
 plt.imshow(x_train[i].reshape(28, 28), 
 interpolation="none", cmap="gray_r")
 plt.title('Label=%d' % y_train[i], fontsize=14)
 plt.axis("off")
plt.tight_layout()

In [None]:
n_rows = 4
n_cols = 8
plt.figure(figsize=(8, 4))
for i in range(n_rows * n_cols):
 plt.subplot(n_rows, n_cols, i+1)
 plt.imshow(x_train[i].reshape(28, 28),
 interpolation="none", cmap="gray_r")
 plt.axis("off")
plt.tight_layout()

The first character is a 5 digit, encoded in grayscale matrix as follows

In [None]:
print(np.array2string(x_train[0].astype(np.int).reshape(28, 28), 
 max_line_width=150))

## 1.4. Normalization and preprocessing of the data

We need to normalize the images and one-hot encode the labels.

**Warning:** call this cell only once (otherwise you'll devide several times by 255), which might be problematic later on.

In [None]:
x_train /= x_train.max()
x_test /= x_test.max()
print(x_train.min(axis=None), x_train.max(axis=None))

In [None]:
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_labels)
y_test = keras.utils.to_categorical(y_test, num_labels)
y_train[:10]


# 2. A first model: softmax (or multinomial logistic) regression 

Remember that each image $x$ is a $p\times p = 28\times 28\times 1$ matrix ( $x=(x_{ij} )$ ) or a $pp = 784$ vector ( $x = (x_{j})$ ). 

We want to classify these pictures or equivalently to predict the digit $k$ varying in $\{0, \ldots, 9\}$ they represent.
A simple model allowing to do that is softmax regression.

## 2.1. Description of the model


The idea behind this model is to produce a score for each input image $x$ using a simple linear model. 
To do so, we assume that belonging to a class $k$ (corresponding to digit $k$) can be expressed by a weigthed sum of the pixel intensities, with weights $W_{k, 1}, \ldots, W_{k, 784}$ and to a bias $b_k$ capturing variability independent of the input:
$$
\text{score}_k(x) = \sum_{j=1}^{784} W_{k, j} x_j + b_k,
$$
These scores are sometimes called the "logits" in the deep learning community.
We then use the softmax function to convert the scores into predicted probabilities $p_k$:
$$
\forall k =0,\ldots,9,\quad p_k(x) = \text{softmax}(\text{score}_k(x)) = \frac{\exp(\text{score}_k(x))}{\sum_{\ell =0}^{9}\exp(\text{score}_{\ell}(x))}.
$$

## 2.2. The computational graph for training of softmax regression

To train the model parameters (the bias $b_k$ and the weights $W_{k, j}$ where $k=0, \ldots, 9$ and $j=1, \ldots, 784$), the considered goodness-of-fit is the negative log-likelihood defined by the cross-entropy between the score $p_k(x)$ and the label $y$ :
$$
- \sum_{k=0}^{9} y_{k} \log(p_k(x))
$$
For this first model, we simply use stochastic gradient descent over small batches of data. It can be done easily with TensorFlow, as it (automatically and efficiently) computes the gradient from the graph, then apply an optimization algorithm of your choice to perform the parameters update.

In [None]:
# We use a sequential model: we stack layers
model_softmax = Sequential()
# First we need to flatten the data: replace 28 * 28 matrices by flat vectors
# This is always necessary before feeding data to a fully-connected layer (Dense object)
model_softmax.add(Flatten(input_shape=input_shape, name='flatten'))
# We add one dense (fully connected layer) with softmax activation function
# Since it's the first layer, we need to give the size of input data
model_softmax.add(Dense(num_labels, activation='softmax', name='output'))

# We "compile" this model, 
model_softmax.compile(
 # specifying the loss as the cross-entropy
 loss=keras.losses.categorical_crossentropy,
 # We choose the Adagrad solver, but you can choose others
 optimizer=keras.optimizers.Adagrad(),
 # We will monitor the accuracy on a testing set along optimization
 metrics=['accuracy']
)
model_softmax.summary()

## 2.3. Run the training of the model

In [None]:
batch_size = 64

# number of steps
epochs = 5

# Run the train
history_softmax = model_softmax.fit(x_train, y_train,
 batch_size=batch_size,
 epochs=epochs,
 verbose=1,
 validation_data=(x_test, y_test))
score_softmax = model_softmax.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score_softmax[0])
print('Test accuracy:', score_softmax[1])

In [None]:
def plot_history(history, title=''):
 plt.figure(figsize=(7, 5))
 plt.plot(history.epoch, history.history['accuracy'], lw=3, label='Training')
 plt.plot(history.epoch, history.history['val_accuracy'], lw=3, label='Testing')
 plt.legend(fontsize=14)
 plt.title(title, fontsize=16)
 plt.xlabel('Epoch', fontsize=14)
 plt.ylabel('Accuracy', fontsize=14)
 plt.tight_layout()

In [None]:
plot_history(history_softmax, title='Accuracy of softmax regression')

**QUESTION**

- Run 70 epochs and look at the training and testing accuracy curves

## 2.4. Visualisation of the model weights

Weight matrices plots show that the learned weights are consistant with the digits they should predict (see below).
You should be able to see rough shapes corresponding to the digits 0, 1, 2, 3, etc.

In [None]:
weights, biases = model_softmax.get_layer('output').get_weights()
imgs = weights.reshape(28, 28, 10)

fig = plt.figure(figsize=(10, 5))
vmin, vmax = imgs.min(), imgs.max()
for i in range(10):
 ax = plt.subplot(2, 5, i + 1)
 im = imgs[:, :, i]
 mappable = ax.imshow(im, interpolation="nearest", 
 vmin=vmin, vmax=vmax, cmap='gray')
 ax.axis('off')
 ax.set_title("%i" % i)
plt.tight_layout()

## 2.5. Prediction of the labels for new images

In [None]:
# prediction de test avec Logistique
pred_log = model_softmax.predict(x_test);

## 2.6. Saving the model

In the next cell we save the model in a file, so that it can be used later on.
This is particularly helpful when the training of models is long: we can save it every once in a while, and 
eventually continue to train it later on.

**Warning:** You need to create a `"models"` folder in the folder containing this notebook

In [None]:
!mkdir models

In [None]:
models_path = 'models/'
model_softmax.save(os.path.join(models_path, 'mnist_softmax.h5'))
with open(os.path.join(models_path, 'mnist_softmax_history.pkl'), 'wb') as f:
 pkl.dump(history_softmax.history, f)

In [None]:
!ls -al models

## 2.6. Conclusion with MNIST

You should have reached an accuracy better than 0.9 with this simple model. 
**Too easy !** You almost solved the problem using a simple softmax regression. 
Weight matrices plots show that the learned weights are consistant with the digits they should predict (see below).
You should be able to see rough shapes corresponding to the digits 0, 1, 2, 3, etc.

# 3. Feed-Forward Neural Network (FFNN)

Now, let's build a better model for MNIST using more layers. 
Let's start with a feed-forward neural net (FFNN) with one hidden layer and relu activation.

## 3.1 Description

The softmax regression is a linear model, with $(784+1)\times 10 = 7850$ parameters. 
It is easy to fit, numerically stable, but might be too simple for some tasks. 
The aim of neural networks is to consider nonlinear models, while keeping the nice features of linear ones. 
The idea is to keep parameters into linear functions, and link these small linear model using non linear operations.

A simple nonlinearity which is often used to do this is the **Rectified Linear Unit**: $\quad \text{ReLU}(x) = \max(0, x)$

The derivative of this function is very easy to compute, and it is parameter-free. If we stack models such as softmax regression and ReLUs, it is still very easy to compute the gradient using the chain rule, as the model is a combination of simple functions.

The backpropagation algorithm allows efficient computation of complex derivatives as long as the function is made of simple blocks with simple derivatives. 
This algorithm efficiency is based on data reuse: when working with parallel architectures such as GPUs, you want to minimize communication (data transfer) as it is very time consuming in comparison to the computing time.

## 3.2. Computational graph for a single hidden layer FFNN

We create the graph for a fully connected feed-forward neural network with one hidden layer with 128 units and a relu activation function. We use what you did for softmax regression : we just need to add a **single line** to the code creating the softmax regression.

In [None]:
# Define the model
model_ffnn = Sequential()

model_ffnn.add(Flatten(input_shape=input_shape, name='flatten'))
## The new next line adds the extra layer
model_ffnn.add(Dense(128, activation='relu', name='dense'))
model_ffnn.add(Dense(num_labels, activation='softmax', name='output'))

model_ffnn.compile(
 loss=keras.losses.categorical_crossentropy,
 optimizer=keras.optimizers.Adagrad(),
 metrics=['accuracy']
)

model_ffnn.summary()

In [None]:
# Run the model
batch_size = 32
epochs = 5

# Run the train
history_ffnn = model_ffnn.fit(x_train, y_train,
 batch_size=batch_size,
 epochs=epochs,
 verbose=1,
 validation_data=(x_test, y_test))
score_ffnn = model_ffnn.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score_ffnn[0])
print('Test accuracy:', score_ffnn[1])

In [None]:
plot_history(history_ffnn, title='Accuracy of one hidden layer feed forward')

In [None]:
model_ffnn.save(os.path.join(models_path, 'mnist_ffnn.h5'))
with open(os.path.join(models_path, 'mnist_ffnn_history.pkl'), 'wb') as f:
 pkl.dump(history_ffnn.history, f)

## 3.4 Your job

Run 60 epochs and look at the training and testing accuracy curves.


# 4. Convolutional Neural Network (CNN)

In practice, increasing the size of hidden layers is not very effective. 
It is often a better idea to add more layers. 
Intuitively, if the phenomenon you try to learn has a hierarchical structure, adding more layers can be interpreted as a way to learn more levels of abstraction. 
For example, if you are trying to recognize objects, it is easier to express shapes from edges and objects from shapes, than to express objects from pixels. 
Thus, a good design should try to exploit this hierarchy.

In particular cases, such as grid-like data (time series, images), you might want to detect a pattern which can happen in different locations of the data. 
For example, you try to detect a cat, but the cat can be in the middle or the left of the picture. 
Thus you need to build a model which is translation invariant: it is easier to learn how to recognize an object independently of its location. 

## 4.1 Description

When two inputs might contain the same kind of information, then it is useful to share their weights and train the weights jointly for those inputs to learn statistical invariants (things that don't change much on average across time or space). 
Using this concept on images leads to convolutional neural networks (CNNs), on text, it results on recurrent neural networks (RNNs). 
When using CNNs, you set weights to a small kernel that will be used to perform a convolution across the image.

The image is represented as a 3-dimensional tensor: (width, height, depth). Width and height charecterize the size of the image (eg. 28 x 28 pixels), and depth the color space (e.g. 1 for grey levels, 3 for RGB pictures since each pixel is represented by a triplet $(R,G,B)$).

The convolution will map patches of this image, combined with the convolution kernel, for example

$$
\text{output} = \text{ReLU}(\text{patch} \times W + b)
$$

Depending on the shape of the $W$ weights tensor, the tensor resulting from the convolution can have a different depth. Note that in the context of a CNN, the "kernel" can be also called a "filter".

Performing the convolution between the image and the kernel consist to move the kernel across the image, and to produce an output for each patch. 
The way you move across the image is defined by two parameters:

- **Stride:** the stride is the number of pixels you are shifting each time you move your kernel during the convolution.
- **Padding:** defines what happens when the kernel reaches a border of the image when doing the convolution. 
"Valid" padding means that you stop at the edge, while "Same" padding allows to go off the edge and pad with zeros so that the width and the height of the output and input tensors are the same.

For example, a convolution with a stride $> 1$ and valid padding results in a tensor of smaller width and height. 
You can compute the size of a tensor after convolution using the following formulas:

#### Valid padding
$$
\text{out}_{\text{height}} = \bigg\lceil \frac{\text{in}_{\text{height}} - \text{kernel}_{\text{height}} + 1}{\text{stride}_{\text{vertical}}} \bigg\rceil \quad \text{ and } \quad
\text{out}_{\text{width}} = \bigg\lceil \frac{\text{in}_{\text{width}} - \text{kernel}_{\text{width}} + 1}{\text{stride}_{\text{horizontal}}} \bigg\rceil
$$

#### Same padding
$$
\text{out}_{\text{height}} = \bigg\lceil \frac{\text{in}_{\text{height}}}{\text{strides}_{\text{vertical}}} 
\bigg\rceil \quad \text{ and } \quad 
\text{out}_{\text{width}} = \bigg\lceil \frac{\text{in}_{\text{width}}}{\text{strides}_{\text{horizontal}}} \bigg\rceil
$$

**Example.**
Assume the input tensor is 28x28x3 and the convolution kernel takes in 4x4x3 tensors and outputs 1x1x32 tensors (height x width x depth), i.e the kernel takes in a patch of size 4x4 and depth 3, and output a patch of size 1x1 and depth 32. To do so, the weights tensor $W$ should be 3x3x3x32 (in-height, in-width, in-depth, out-depth). 
If we are using a stride of 1, the output tensor will be 28x28x32 with same padding, and 25x25x32 with valid padding.
Using a stride of 2, the output tensor will be 14x14x32 with same padding, and 13x13x32 with valid padding.

Striding is an agressive method to reduce the image size. 
Instead, it can be a better idea to use a stride of 1 and to combine the convolution's outputs being in some neighborhood. Such an operation combining elements of a tensor is called **pooling**. 
Neighborhoods are define by the pooling window dimension (width x height) and the strides you use when moving this window across the image.

**Example.**
Max pooling aggregate several outputs in a neighborhood $N$ using a max operation: 

$$
\text{output}'_i = \max_{j \in N}\text{output}_j, \quad i \in N.
$$
The formulas to compute the size of the ouput tensor are the same as for convolution padding and striding.

Many successful architectures stack convolution layers in a "pyramidal" way: each convolution layer result in a tensor with increased depth and decreased height and width. 
Roughly, increasing the depth increases the complexity of the semantic compexity of your representation, and allows to keep the relevant information in a smaller space (height x width). 

## 4.2. Computational graph 

We implement a CNN having the following structure:

- Convolutional layer with 32 filters and 3 * 3 kernel sizes and 'relu' activation (use the `Conv2D` object)
- Convolutional layer with 64 filters and 3 * 3 kernel sizes and 'relu' activation (use the `Conv2D` object)
- Max pooling with pool size 2 * 2 (use the `MaxPooling2D` object)
- Dropout with probability 0.25 (use the `Dropout` object)
- Dense layer with 128 units with relu activation
- Dropout with probability 0.5
- Dense output layer with softmax activation

In [None]:
model_cnn = Sequential()
model_cnn.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape, name='conv2d_1'))
model_cnn.add(Conv2D(64, kernel_size=(3, 3), activation='relu', name='conv2d_2'))
model_cnn.add(MaxPooling2D(pool_size=(2, 2), name='max_pool_1'))
model_cnn.add(Dropout(0.25, name='dropout_1'))
model_cnn.add(Flatten(name='flatten'))
model_cnn.add(Dense(128, activation='relu', name='dense'))
model_cnn.add(Dropout(0.5, name='dropout_2'))
model_cnn.add(Dense(num_labels, activation='softmax', name='output'))
 
model_cnn.compile(loss=keras.losses.categorical_crossentropy,
 optimizer=keras.optimizers.Adadelta(),
 metrics=['accuracy'])

model_cnn.summary()

In [None]:
print(x_train.shape)
print(input_shape)

In [None]:
#run the model
batch_size = 32
epochs = 1

# Run the train
history_cnn = model_cnn.fit(x_train, y_train,
 batch_size=batch_size,
 epochs=epochs,
 verbose=1,
 validation_data=(x_test, y_test))
score_cnn = model_cnn.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score_cnn[0])
print('Test accuracy:', score_cnn[1])

In [None]:
plot_history(history_cnn, title='Accuracy of CNN')

In [None]:
# save the model and its history using the following cell to continue to train it later
model_cnn.save(os.path.join(models_path, 'mnist_cnn.h5'))
with open(os.path.join(models_path, 'mnist_cnn_history.pkl'), 'wb') as f:
 pkl.dump(history_cnn.history, f)

# 5. MNIST is too easy: let's classify weird letters now (notMNIST)

MNIST is a very very **clean** dataset. Digits are rescaled, smoothed, centered, and pixel values are normalized beforehand. Let's switch to a slightly harder dataset: [notMNIST](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html).

This time, labels are letters from 'A' to 'J' (10 classes). 
These letters are taken from digital fonts instead of handwriting pictures. 
We will use a reduced amount of data to ensure a reasonable training time. 
The training set you will use has 200K labelled examples, while the validation and test sets both contain 10K labelled examples.

**Note** : The notMNIST data that we'll load is already normalized in [-0.5, 0.5] with one-hot encoded labels

## 5.1. Load the notMNIST dataset

In [None]:
import requests
import os

url = 'https://stephanegaiffas.github.io/files/formation_cnrs/notMNIST_100.pkl.gz'
r = requests.get(url)
path_data = '../data/' 

with open(os.path.join(path_data, 'notMNIST_100.pkl.gz'), 'wb') as f:
 f.write(r.content)

In [None]:
import gzip

filename = 'notMNIST_100.pkl.gz'
with gzip.open(os.path.join(path_data, 'notMNIST_100.pkl.gz')) as f:
 data = pkl.load(f)

In [None]:
data.keys()

In [None]:
from tensorflow.keras import backend as K

def reshape(x, image_data_format, img_rows, img_cols):
 if image_data_format == 'channels_first':
 return x.astype(np.float32).reshape((-1, 1, img_rows, img_cols))
 else:
 return x.astype(np.float32).reshape((-1, img_rows, img_cols, 1))

img_rows, img_cols = 28, 28
num_labels = 10
image_data_format = K.image_data_format()

if image_data_format == 'channels_first':
 input_shape = (1, img_rows, img_cols)
else:
 input_shape = (img_rows, img_cols, 1)
 
x_train = reshape(data['train_dataset'], image_data_format, img_rows, img_cols)
x_valid = reshape(data['valid_dataset'], image_data_format, img_rows, img_cols)
x_test = reshape(data['test_dataset'], image_data_format, img_rows, img_cols)

y_train = keras.utils.to_categorical(data['train_labels'])
y_valid = keras.utils.to_categorical(data['valid_labels'])
y_test = keras.utils.to_categorical(data['test_labels'])

print('x_train shape:', x_train.shape)
print('x_valid shape:', x_valid.shape)
print('x_test shape:', x_test.shape)
print('y_train shape:', y_train.shape)
print('y_valid shape:', y_valid.shape)
print('y_test shape:', y_test.shape)

print(x_train.shape[0], 'training samples')
print(x_valid.shape[0], 'validation samples')
print(x_test.shape[0], 'testing samples')

In [None]:
# plt.figure(figsize=(8, 4))
n_rows = 10
n_cols = 8
plt.figure(figsize=(n_cols, n_rows))

letters = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
def get_label(y):
 return letters[y.argmax()]

for i in range(n_rows * n_cols):
 ax = plt.subplot(n_rows, n_cols, i+1)
 ax.imshow(x_train[i].reshape(28, 28),
 interpolation="none", cmap="gray_r")
 ax.set_title(get_label(y_train[i]), fontsize=14)
 ax.axis("off")
plt.tight_layout()

## 5.2 Training a softmax model on notMNIST

**Your job**

- train a softmax regression : start with a small number of epochs, and increase the number of epochs later
- visualize the weight
- plot the convergence curves
- save the model and its history

## 5.3. Training a one-layer FFNN on notMNIST

**Your job**

- Train FFNN with one hidden layer with 128 units
- visualize the weight
- plot the convergence curves
- save the model and its history

## 5.4 Training a deeper CNN for notMNIST


**Your job**


Train a CNN with the following structure:

- Convolutional layer with 32 filters and 5 * 5 kernel sizes and 'relu' activation
- Max pooling with pool size 2 * 2
- Convolutional layer with 64 filters and 5 * 5 kernel sizes and 'relu' activation
- Max pooling with pool size 2 * 2
- Dropout with probability 0.25
- Dense layer with 1024 units
- Dropout with probability 0.5
- Dense output layer with softmax activation

Use the Adam solver. Train for 20 epochs or more (this might take a loooong) time.

You should achieve >= 97% accuracy on test set

- Save the model and visualize the last fully connected layers

In [None]:
# We use a sequential model: we stack layers
model_softmax = Sequential()
# First we need to flatten the data: replace 28 * 28 matrices by flat vectors
# This is always necessary before feeding data to a fully-connected layer (Dense object)
model_softmax.add(Flatten(input_shape=input_shape, name='flatten'))
# We add one dense (fully connected layer) with softmax activation function
# Since it's the first layer, we need to give the size of input data
model_softmax.add(Dense(num_labels, activation='softmax', name='output'))

# We "compile" this model, 
model_softmax.compile(
 # specifying the loss as the cross-entropy
 loss=keras.losses.categorical_crossentropy,
 # We choose the Adagrad solver, but you can choose others
 optimizer=keras.optimizers.Adagrad(),
 # We will monitor the accuracy on a testing set along optimization
 metrics=['accuracy']
)
model_softmax.summary()

In [None]:
batch_size = 64

# number of steps
epochs = 2

# Run the train
history_softmax = model_softmax.fit(x_train, y_train,
 batch_size=batch_size,
 epochs=epochs,
 verbose=1,
 validation_data=(x_test, y_test))
score_softmax = model_softmax.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score_softmax[0])
print('Test accuracy:', score_softmax[1])

In [None]:
weights, biases = model_softmax.get_layer('output').get_weights()
imgs = weights.reshape(28, 28, 10)

fig = plt.figure(figsize=(10, 5))
vmin, vmax = imgs.min(), imgs.max()
for i in range(10):
 ax = plt.subplot(2, 5, i + 1)
 im = imgs[:, :, i]
 mappable = ax.imshow(im, interpolation="nearest", 
 vmin=vmin, vmax=vmax, cmap='gray')
 ax.axis('off')
 ax.set_title("%i" % i)
plt.tight_layout()

In [None]:
model_cnn = Sequential()
model_cnn.add(Conv2D(32, kernel_size=(5,5), activation='relu', input_shape=input_shape, name='conv2d_1'))
model_cnn.add(MaxPooling2D(pool_size=(2, 2), name='max_pool_1'))
model_cnn.add(Conv2D(64, kernel_size=(5, 5), activation='relu', name='conv2d_2'))
model_cnn.add(MaxPooling2D(pool_size=(2, 2), name='max_pool_2'))
model_cnn.add(Dropout(0.25, name='dropout_1'))
model_cnn.add(Flatten(name='flatten'))
model_cnn.add(Dense(1024, activation='relu', name='dense'))
model_cnn.add(Dropout(0.5, name='dropout_2'))
model_cnn.add(Dense(num_labels, activation='softmax', name='output'))
 
model_cnn.compile(loss=keras.losses.categorical_crossentropy,
 optimizer=keras.optimizers.Adadelta(),
 metrics=['accuracy'])

model_cnn.summary()

In [None]:
#run the model
batch_size = 64
epochs = 2

# Run the train
history_cnn = model_cnn.fit(x_train, y_train,
 batch_size=batch_size,
 epochs=epochs,
 verbose=1,
 validation_data=(x_test, y_test))
score_cnn = model_cnn.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score_cnn[0])
print('Test accuracy:', score_cnn[1])

In [None]:
# prediction de test avec CNN
pred_cnn = model_cnn.predict(x_test);
# prediction de test avec Logistic
pred_log = model_softmax.predict(x_test);

In [None]:
print("Quelques prédictions avec CNN")

# plt.figure(figsize=(8, 4))
n_rows = 5
n_cols = 10
plt.figure(figsize=(n_cols, n_rows))

letters = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
def get_label(y):
 return letters[y.argmax()]

for i in range(n_rows * n_cols):
 ax = plt.subplot(n_rows, n_cols, i+1)
 ax.imshow(x_test[i].reshape(28, 28),
 interpolation="none", cmap="gray_r")
 #ax.set_title(get_label(y_test[i]), fontsize=14)
 ax.set_title(get_label(pred_cnn[i]), fontsize=14)
 #ax.set_title(get_label(pred_log[i]), fontsize=14)
 ax.axis("off")
plt.tight_layout()


In [None]:
print("Quelques prédictions avec Logistique")
# plt.figure(figsize=(8, 4))
n_rows = 5
n_cols = 10
plt.figure(figsize=(n_cols, n_rows))

letters = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
def get_label(y):
 return letters[y.argmax()]

for i in range(n_rows * n_cols):
 ax = plt.subplot(n_rows, n_cols, i+1)
 ax.imshow(x_test[i].reshape(28, 28),
 interpolation="none", cmap="gray_r")
 #ax.set_title(get_label(y_test[i]), fontsize=14)
 #ax.set_title(get_label(pred_cnn[i]), fontsize=14)
 ax.set_title(get_label(pred_log[i]), fontsize=14)
 ax.axis("off")
plt.tight_layout()

# 6. A fashionable use case : Clothing Classification with fashion-mnist

- create a new notebook
- load data with `fashion_mnist.load_data()` - no validation sets, only train and test sets. 
- labels names are :

`LABEL_NAMES = ['t_shirt', 'pantalon', 'pull', 'robe', 'manteau', 'sandale', 'chemise', 'baskets', 'sac', 'bottes']`

- copy *preliminaries* [1]
- copy and adapt *load* [2] : what are the shapes ? the labels distributions ? what does the data look like ?
- normalize the images [9] and one-hot encode the labels [10]
- create a model with 3 layers CONV+POOL+DROP - take inspiration from the [23] and add a layer. 
- run the model as [25] does
- plot the convergence curves as [26] does
- make prediction as [15] does
- study the errors : what kind of clothes are difficult to classify ?


***Take care with the kernel sizes***


By instance

 - Convolutional layer with 64 filters and 5 * 5 kernel sizes and 'relu' activation
 - Max pooling with pool size 2 * 2
 - Dropout with probability 0.25

 - Convolutional layer with 128 filters and 5 * 5 kernel sizes and 'relu' activation
 - Max pooling with pool size 2 * 2
 - Dropout with probability 0.25

 - Convolutional layer with 256 filters and 3 * 3 kernel sizes and 'relu' activation
 - Max pooling with pool size 2 * 2
 - Dropout with probability 0.25

 - Dense layer with 256 units
 - Dropout with probability 0.5
 - Dense output layer with softmax activation

In [None]:
import requests
import os

url = 'https://stephanegaiffas.github.io/files/formation_cnrs/fashionMNIST.pickle.zip'
r = requests.get(url)
path_data = '../data/' 

with open(os.path.join(path_data, 'fashionMNIST.pickle.zip'), 'wb') as f:
 f.write(r.content)

In [None]:
import zipfile

filename = 'fashionMNIST.pickle.zip'
archive = zipfile.ZipFile(os.path.join(path_data, filename), 'r')
with archive.open('fashionMNIST.pickle') as f:
 data = pkl.load(f)