View on GitHub

data-science

Notebooks and Python about data science

If you like this project please add your Star

Convolutional versus Dense layers in neural networks - Part 1

Design, optimization and performance of the two networks

Convolutional layers in deep neural networks are known to have a dense (perceptron) equivalent. However, the topology of the convolutional layers is enforcing a parameter sharing: instead of copying parameters on several neurons, neurons are reused. This property of CNN is leading to a massive reduction in the number of coefficients.

In this series of notebooks, we will show this property as well as some other interesting properties of CNN as compared to the dense equivalent.

In this first part, let's design the models to be compared and optimize the performance and size.

MNIST dataset of digits [6] is well known as the initial use case of CNN by Lecun in 1998 [1, 2]. Today, it remains one of the canonical datasets for machine learning training and algorithm comparisons.

Tensorflow's documentation has a Dense implementation of a classifier for the MNIST [2]. Some implementations of the LeNet in Keras are also available [3, 4].

Let's review these two architectures in the light of Tensorflow 2.0.

The classifiers studied in this notebook are quite different from the one of the Classification of a 2D problem (HTML / Jupyter) in which the goal is to partition the hyper-plan of the features to match the labels. In this notebook, the features are pixels of images, the classifier partition is on the set of images. However, there is a great simplification compared to general image classification: on the MNIST dataset there is only a single digit per image.

Learning goals:

  • Examplify the equivalency of the convolutional and dense networks and show the decrease in number of coefficients
  • Optimize the performance of the network through regularization using Dropout layers
  • Optimize the size of the networks to get a fair comparison between the Dense and Convolution DNNs
In [ ]:
#!pip install tensorview
In [24]:
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras import activations, datasets, layers, losses, metrics, models, optimizers, regularizers
import seaborn as sns
from sklearn import metrics
import pandas as pd
import tensorview as tv

Constants and Helpers

In [2]:
metricNames = ['Categorical cross-entropy', 'Accuracy']
metricNamesRaw = ['loss', 'accuracy']
batchSize = 128
In [32]:
def plotHeatMap(X, classes='auto', title=None, fmt='.2g', ax=None, xlabel=None, ylabel=None,
               vmin=None, vmax=None, cbar=True):
    """ Fix heatmap plot from Seaborn with pyplot 3.1.0, 3.1.1
        https://stackoverflow.com/questions/56942670/matplotlib-seaborn-first-and-last-row-cut-in-half-of-heatmap-plot
    """
    ax = sns.heatmap(X, xticklabels=classes, yticklabels=classes, annot=True, 
                     fmt=fmt, vmin=vmin, vmax=vmax, cbar=cbar, cmap=plt.cm.Reds, ax=ax) 
    bottom, top = ax.get_ylim()
    ax.set_ylim(bottom + 0.5, top - 0.5)
    if title:
        ax.set_title(title)
    if xlabel:
        ax.set_xlabel(xlabel)
    if ylabel:
        ax.set_ylabel(ylabel)
        
def plotHistory(hist, with_validation=False, sharey=False):
    """ Plot a classification history as outputted by Keras """
    metrics = list(filter(lambda s: not s.startswith('val_'), hist.history.keys()))
    fig, axes = plt.subplots(1, len(metrics), figsize=(15,6), sharey=sharey)
    for m, ax in zip(metrics, axes):
        ax.plot(hist.history[m])
        if with_validation:
            ax.plot(hist.history['val_'  + m])
            ax.legend(('train', 'valid'))
        ax.set_title(m)
        ax.set_xlabel('epoch')
        ax.grid()
        
def plotConfusionMatrix(yTrue, yEst, classes, title=None, fmt='.2g', ax=None):
    """ Compute and plot the confusion matrix """
    plotHeatMap(metrics.confusion_matrix(yTrue, yEst), classes, title, fmt, ax, \
                xlabel='Estimations', ylabel='True values');

MNIST Data

MNIST dataset is derived from the NIST dataset. It contains 60000 training and 10000 testing images of size 28x28 in grayscale. Digits have been recentered on the raster using the center of mass.

In [30]:
(xTrain, yTrain),(xTest, yTest) = datasets.mnist.load_data()
xTrain = xTrain / 255.
xTest  = xTest  / 255.

digitClasses = np.array([str(d) for d in range(10)])

xTrain.shape, xTest.shape
Out[30]:
((60000, 28, 28), (10000, 28, 28))
In [6]:
fig, axes = plt.subplots(5, 20, figsize=(16, 5))
for i, ax in enumerate(axes.ravel()):
    ax.imshow(xTrain[i], cmap='gray_r')
plt.setp(axes, xticks=[], yticks=[], frame_on=False)
plt.tight_layout(h_pad=0.5, w_pad=0.01)

Dense Model design and performance

This implementation is taken from the Tensorflow tutorial [3].

The initial 28x28 pixel image is first flattened into a a vector of 784 features. These features are evaluated by 512 neurons in parallel. Their outputs are activated by the ReLU before being combined into the 10 output neurons giving softmax probabilities for each of the 10 digits.

A dropout layer is inserted in between the two layers. This layer is performing regularization in order to avoid overfitting [7].

In [7]:
modelDense0 = models.Sequential([
        layers.Flatten(input_shape=(28, 28)),
        layers.Dense(512, activation=activations.relu, name='dense_0'),
        layers.Dropout(0.2),
        layers.Dense(10, activation=activations.softmax, name='dense_1')
    ])

modelDense0.compile(optimizer=optimizers.Adam(),
                   loss=losses.sparse_categorical_crossentropy,
                   metrics=['accuracy'])

modelDense0.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dense_0 (Dense)              (None, 512)               401920    
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                5130      
=================================================================
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
_________________________________________________________________
In [8]:
nEpochs = 100
tvPlot = tv.train.PlotMetricsOnEpoch(metrics_name=metricNames,
                                      cell_size=(6,4), columns=2, iter_num=nEpochs, wait_num=2)
historyDense0 = modelDense0.fit(xTrain, yTrain,
                epochs=nEpochs, batch_size=batchSize, 
                validation_split=0.1, 
                verbose=0,
                callbacks=[tvPlot]);

This model has 2 issues:

  • Even if there is regularization through the dropout layer there is still a quite large overfitting as the train and validation performance is quite different, and the validation loss is worsening from epoch 20 forward
  • The variance of the validation loss and metric is high, hinting also for overfitting
In [9]:
loss, acc = modelDense0.evaluate(xTest, yTest, verbose=0)
print("Model Dense #0, test accuracy = %.2f %%" % (acc*100))
Model Dense #0, test accuracy = 98.41 %
In [10]:
weightsDense0 = modelDense0.get_weights()

fig, axes = plt.subplots(1, 2, figsize=(15, 5))
for ax, layerIndex, weightIndex in zip(axes, [0, 1], [0, 2]):
    ax.hist(weightsDense0[weightIndex].ravel(), density=False, bins=40)
    ax.set_title("Dense layer #%d coefficients" % layerIndex)

Adding more regularization to the Dense network

The flatten layer has no training coefficients (weights, bias), but is transforming the topology of the feature space and is actually the wider feature space of the network (784 for this layer versus 512 at the next layer). Let's add a Dropout layer on this feature space. Also, as advised in [7], let's change the hidden layer dropout ratio from 20% to 50%.

In [0]:
modelDense1 = models.Sequential([
        layers.Flatten(input_shape=(28, 28), name='flatten'),
        layers.Dropout(0.2, name='dropout_0'),                             # <----
        layers.Dense(512, activation=activations.relu, name='dense_0'),
        layers.Dropout(0.5, name='dropout_1'),                             # <----
        layers.Dense(10, activation=activations.softmax, name='dense_1')
    ])

modelDense1.compile(optimizer=optimizers.Adam(),
       loss=losses.sparse_categorical_crossentropy,
       metrics=['accuracy']);
In [12]:
nEpochs = 100
tvPlot = tv.train.PlotMetricsOnEpoch(metrics_name=metricNames,
                                      cell_size=(6,4), columns=2, iter_num=nEpochs, wait_num=2)
historyDense1 = modelDense1.fit(xTrain, yTrain,
                epochs=nEpochs, batch_size=batchSize, 
                validation_split=0.1, 
                verbose=0,
                callbacks=[tvPlot]);

There is still overfitting on this network (delta between train and validation), but the variation on the validation curves are much lower and performance has increased by 0.2%.

The original paper of Dropout [7] is listing results better than 99% accuracy using a 2 layer dense network with 8096 units on each layer.

The validation metrics are better than the training during first epochs, this is a direct impact of the dropout: the training has fewer neurons active than the validation.

Moreover, on the histogram below for layer #0, many coefficients are close to 0. In the following section, we will attempt to decrease the model size while keeping similar performance.

In [13]:
loss, acc = modelDense1.evaluate(xTest, yTest, verbose=0)
print("Model Dense #1, test accuracy = %.2f %%" % (acc*100))
Model Dense #1, test accuracy = 98.54 %
In [14]:
weightsDense1 = modelDense1.get_weights()

fig, axes = plt.subplots(1, 2, figsize=(15, 5))
for ax, layerIndex, weightIndex in zip(axes, [0, 1], [0, 2]):
    ax.hist(weightsDense1[weightIndex].ravel(), density=False, bins=40)
    ax.set_title("Dense layer #%d coefficients" % layerIndex)

Size optimized Dense model

This model seems overdesigned, many neurons are not used in layer #1. In order to have a fair comparison to the CNN implementation, let's optimize the number of neurons in layer #0.

In [15]:
batchSize = 128
nEpochs = 100

numNeuronsRange = [32, 64, 128, 256]
perfDense2 = []
for numNeurons in numNeuronsRange:
    modelDense = models.Sequential([
            layers.Flatten(input_shape=(28, 28)),
            layers.Dropout(0.2),
            layers.Dense(numNeurons, activation=activations.relu, name='dense_0'),
            layers.Dropout(0.5),
            layers.Dense(10, activation=activations.softmax, name='dense_1')
        ])

    modelDense.compile(optimizer=optimizers.Adam(),
           loss=losses.sparse_categorical_crossentropy,
           metrics=['accuracy'])

    historyDense = modelDense.fit(xTrain, yTrain,
            epochs=nEpochs, batch_size=batchSize, 
            validation_split=0.1, 
            verbose=0)
    
    perfDense2.append(modelDense.evaluate(xTest, yTest, verbose=0))
    print("Accuracy for %d neurons in layer #0 : %.2f %%" % (numNeurons, perfDense2[-1][1] * 100))
Accuracy for 32 neurons in layer #0 : 95.38 %
Accuracy for 64 neurons in layer #0 : 97.17 %
Accuracy for 128 neurons in layer #0 : 98.17 %
Accuracy for 256 neurons in layer #0 : 98.35 %
In [16]:
plt.plot(numNeuronsRange, np.array(perfDense2)[:,1])
plt.title("Accuracy as function of number of neurons in layer #0")
plt.grid()

Based on above grid search, in order to get a trade-off on performance and size, we select the first layer to be 128 neurons wide.

In [5]:
modelDense2 = models.Sequential([
            layers.Flatten(input_shape=(28, 28), name='flatten'),
            layers.Dropout(0.2, name='dropout_0'),
            layers.Dense(128, activation=activations.relu, name='dense_0'),
            layers.Dropout(0.5, name='dropout_1'),
            layers.Dense(10, activation=activations.softmax, name='dense_1')
        ])

modelDense2.compile(optimizer=optimizers.Adam(),
                   loss=losses.sparse_categorical_crossentropy,
                   metrics=['accuracy'])

modelDense2.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dropout_0 (Dropout)          (None, 784)               0         
_________________________________________________________________
dense_0 (Dense)              (None, 128)               100480    
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                1290      
=================================================================
Total params: 101,770
Trainable params: 101,770
Non-trainable params: 0
_________________________________________________________________
In [7]:
nEpochs = 100
tvPlot = tv.train.PlotMetricsOnEpoch(metrics_name=metricNames,
                                      cell_size=(6,4), columns=2, iter_num=nEpochs, wait_num=2)
historyDense2 = modelDense2.fit(xTrain, yTrain,
                epochs=nEpochs, batch_size=batchSize, 
                validation_split=0.1, 
                verbose=0,
                callbacks=[tvPlot]);

The performance of the optimized network is 0.4 to 0.5 % lower than the original but with a quarter of the coefficients.

Also, overfitting is no longer visible on the loss and metrics curves.

In [13]:
loss, acc = modelDense2.evaluate(xTest, yTest, verbose=0)
yEstDense2 = modelDense2.predict(xTest)
print("Model Dense #1, test accuracy = %.2f %%" % (acc*100))
Model Dense #1, test accuracy = 98.02 %
In [20]:
weightsDense2 = modelDense2.get_weights()

fig, axes = plt.subplots(1, 2, figsize=(15, 5))
for ax, layerIndex, weightIndex in zip(axes, [0, 1], [0, 2]):
    ax.hist(weightsDense2[weightIndex].ravel(), density=False, bins=40)
    ax.set_title("Dense layer #%d coefficients" % layerIndex)

Evaluation on the Test set

In [20]:
fig, ax = plt.subplots(figsize=(9, 8))
plotConfusionMatrix(yTest, np.argmax(yEstDense2, axis=1), digitClasses)

Overall, very few misclassifications. The top 2 confusions are 5s estimated as 3, and 7s estimated as 2.

Save Dense optimized model

Saving this model, to be reused in Part 2 and 3.

In [0]:
modelDense2.save('models/MNIST_dense128.h5')

LeNet model

Adapted from [4] in which the activation function is already the ReLU, original activation is Tanh. The network has two sections: convolutional and dense.

The convolutional section is made of two pairs of convolutional + average pooling. The first pair operates 6 neurons on the input (30x30 pixels without padding), the second has 16 neurons on the 15x15x6 outputs of previous.

The images are padded to reach 32x32 pixels at input of the classifier and the dimension is increased to cope with the convolution layers.

For an introduction to convolution filter/layer, see [8]

In [24]:
imagePadWidth  = 32
imagePadHeight = 32
xTrainPad  = np.pad(xTrain.reshape(-1, 28, 28, 1), ((0,0),(2,2),(2,2),(0,0)), 'constant')
xTestPad   = np.pad(xTest.reshape(-1, 28, 28, 1), ((0,0),(2,2),(2,2),(0,0)), 'constant')
    
print("Padded image shape: {}".format(xTrainPad[0].shape))
Padded image shape: (32, 32, 1)
In [25]:
modelLeNet0 = models.Sequential([
    layers.Conv2D(filters=6, kernel_size=(3, 3), activation=activations.relu, 
                  input_shape=(imagePadWidth, imagePadHeight,1), name='conv_0'),
    layers.AveragePooling2D(),
    layers.Conv2D(filters=16, kernel_size=(3, 3), activation='relu', name='conv_1'),
    layers.AveragePooling2D(),
    layers.Flatten(),
    layers.Dense(units=120, activation=activations.relu, name='dense_0'),
    layers.Dense(units=84, activation=activations.relu, name='dense_1'),
    layers.Dense(units=10, activation =activations.softmax, name='dense_2')])

modelLeNet0.compile(optimizer=optimizers.Adam(),
                   loss=losses.sparse_categorical_crossentropy,
                   metrics=['accuracy'])

modelLeNet0.summary()
Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv_0 (Conv2D)              (None, 30, 30, 6)         60        
_________________________________________________________________
average_pooling2d (AveragePo (None, 15, 15, 6)         0         
_________________________________________________________________
conv_1 (Conv2D)              (None, 13, 13, 16)        880       
_________________________________________________________________
average_pooling2d_1 (Average (None, 6, 6, 16)          0         
_________________________________________________________________
flatten_5 (Flatten)          (None, 576)               0         
_________________________________________________________________
dense_0 (Dense)              (None, 120)               69240     
_________________________________________________________________
dense_1 (Dense)              (None, 84)                10164     
_________________________________________________________________
dense_2 (Dense)              (None, 10)                850       
=================================================================
Total params: 81,194
Trainable params: 81,194
Non-trainable params: 0
_________________________________________________________________
In [26]:
nEpochs = 60
tvPlot = tv.train.PlotMetricsOnEpoch(metrics_name=metricNames,
                                      cell_size=(6,4), columns=2, iter_num=nEpochs, wait_num=2)
historyLeNet0 = modelLeNet0.fit(xTrainPad, yTrain,
                epochs=nEpochs, batch_size=batchSize, 
                validation_split=0.1, 
                verbose=0,
                callbacks=[tvPlot]);

Optimization of this model is very quick. But is also overfitting and the train accuracy is reaching 100% while loss is flooring at 0.

In [27]:
loss, acc = modelLeNet0.evaluate(xTestPad, yTest, verbose=0)
print("Model LeNet #0, test accuracy = %.2f %%" % (acc*100))
Model LeNet #0, test accuracy = 99.05 %
In [28]:
weightsLeNet0 = modelLeNet0.get_weights()

fig, axes = plt.subplots(2, 3, figsize=(16, 7), sharex=True, sharey=True)
for i, ax in enumerate(axes.ravel()):
    plotHeatMap(weightsLeNet0[0][:,:, 0, i], [0, 1, 2], ax=ax, vmin=-1., vmax=1., cbar=False)

axes[0,0].set_title("LeNet's convolution layer #0 weights");
In [29]:
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
for ax, layerIndex, weightIndex in zip(axes, [0, 1, 2], [4, 6, 8]):
    ax.hist(weightsLeNet0[weightIndex].ravel(), density=False, bins=40)
    ax.set_title("Dense layer #%d coefficients" % layerIndex)

LeNet with L1-L2 regularization

In the following model, a mix of norm L1 (absolute value) and L2 (euclidian distance) is applied on the first dense layer in order to push the coefficients toward 0. This combination of regularization is also known as Elastic Net [9]: $$ \hat{\theta} = \underset{\theta \in \mathbb{R}^p, \lambda_{L_1} \in \mathbb{R}^+, \lambda_{L_2} \in \mathbb{R}^+}{argmin}\left[ \mathcal{L}(X, y, \theta) + \lambda_{L_1} \Vert \theta \Vert_1 + \lambda_{L_2} \frac 12 \Vert \theta \Vert_2^2 \right] $$

With :

  • $X$, $y$ the features and labels
  • $\theta$ the weights to learn
  • $\mathcal{L}(X, y, \theta)$ the loss function
  • $\lambda_{L_1}$, $\lambda_{L_2}$ the regularization parameters

L1 and L2 regularizations can be interpreted with Lagrange multipliers. The L1 penalizations has iso-levels that are hyper-cubes. The L2 iso-levels are hyper-bowls.

L1-L2 regularization is added before the two large dense layers. The second regularization is less important and is set to a fixed mild penalization. The first's L1-L2 penalization coefficients are tuned through a square grid search.

In [30]:
nEpochs = 60
lambdaRange = [0, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2]
historyLeNet1 = []
accLeNet1 = []
lossLeNet1 = []
weightsLeNet1 = []

for lambdaL1 in lambdaRange:
    print("Lambda L1=%.0e... L2=" % (lambdaL1), end='')
    for lambdaL2 in lambdaRange:
        print('%.0e' % lambdaL2, end='', flush=True)
        modelLeNet = models.Sequential([
            layers.Conv2D(filters=6, kernel_size=(3, 3), activation=activations.relu, 
                          input_shape=(imagePadWidth, imagePadHeight,1), name='conv_0'),
            layers.AveragePooling2D(),
            layers.Conv2D(filters=16, kernel_size=(3, 3), activation=activations.relu, 
                          name='conv_1'),
            layers.AveragePooling2D(),
            layers.Flatten(),
            layers.Dense(units=120, activation=activations.relu, 
                         kernel_regularizer=regularizers.l1_l2(lambdaL1, lambdaL2), # <---
                         name='dense_0'),
            layers.Dense(units=84, activation=activations.relu, 
                         kernel_regularizer=regularizers.l1_l2(0.000001, 0.000001), # <---
                         name='dense_1'),
            layers.Dense(units=10, activation =activations.softmax,
                         name='dense_2')])

        modelLeNet.compile(optimizer=optimizers.Adam(),
                           loss=losses.sparse_categorical_crossentropy,
                           metrics=['accuracy'])
        hist = modelLeNet.fit(xTrainPad, yTrain,
                    epochs=nEpochs, batch_size=batchSize, 
                    validation_split=0.1, 
                    verbose=0)
        loss, acc = modelLeNet.evaluate(xTestPad, yTest, verbose=0)
       
        historyLeNet1.append(hist)
        accLeNet1.append(acc)
        lossLeNet1.append(loss)
        weightsLeNet1.append(modelLeNet.get_weights())
        print('(%.2f%%), ' % (acc*100), end='', flush=True)
    print('', flush=True)
Lambda L1=0e+00... L2=0e+00(99.12%), 1e-06(99.10%), 1e-05(99.15%), 1e-04(98.95%), 1e-03(98.96%), 1e-02(98.79%), 
Lambda L1=1e-06... L2=0e+00(98.92%), 1e-06(99.00%), 1e-05(98.81%), 1e-04(98.90%), 1e-03(98.85%), 1e-02(98.56%), 
Lambda L1=1e-05... L2=0e+00(99.06%), 1e-06(98.93%), 1e-05(98.67%), 1e-04(98.95%), 1e-03(98.73%), 1e-02(98.69%), 
Lambda L1=1e-04... L2=0e+00(98.85%), 1e-06(98.58%), 1e-05(98.68%), 1e-04(98.60%), 1e-03(98.87%), 1e-02(98.69%), 
Lambda L1=1e-03... L2=0e+00(98.78%), 1e-06(97.95%), 1e-05(98.34%), 1e-04(98.64%), 1e-03(98.35%), 1e-02(98.37%), 
Lambda L1=1e-02... L2=0e+00(97.27%), 1e-06(96.82%), 1e-05(97.75%), 1e-04(97.12%), 1e-03(97.43%), 1e-02(97.02%), 

Selected LeNet model with L1-L2 regularization

In [31]:
modelLeNet1Index = np.argmax(accLeNet1)
modelLeNet1IndexL1 = modelLeNet1Index // len(lambdaRange)
modelLeNet1IndexL2 = modelLeNet1Index % len(lambdaRange)
print("Selected model lambda L1=%.0e, L2=%.0e, accuracy=%.2f%%" % \
      (lambdaRange[modelLeNet1IndexL1], lambdaRange[modelLeNet1IndexL2], accLeNet1[modelLeNet1Index]*100))
Selected model lambda L1=0e+00, L2=1e-05, accuracy=99.15%
In [65]:
plotHistory(historyLeNet1[modelLeNet1Index], True)

The performance of the CNN with L1-L2 regularization is very close to the original one. As shown on the histograms below, there are numerous coefficients close to 0, that's room for network size optimization.

In [33]:
weightsLeNet1 = weightsLeNet1[modelLeNet1Index]

fig, axes = plt.subplots(2, 3, figsize=(16, 7), sharex=True, sharey=True)
for i, ax in enumerate(axes.ravel()):
    plotHeatMap(weightsLeNet1[0][:,:, 0, i], [0, 1, 2], ax=ax, vmin=-1., vmax=1., cbar=False)

axes[0,0].set_title("LeNet's convolution layer #0 weights");
In [34]:
fig, axes = plt.subplots(1, 3, figsize=(16, 4))
for ax, layerIndex, weightIndex in zip(axes, [0, 1, 2], [4, 6, 8]):
    ax.hist(weightsLeNet1[weightIndex].ravel(), density=False, bins=40)
    ax.set_title("Dense layer #%d coefficients" % layerIndex)

The effect of the L2 regularization is clearly visible on the left and center histograms in which there is a large share of coefficients clost to 0.

LeNet with Dropout regularization

Dropout [7] is another very effective regularization technic in which a share of the layer (here the convolution layer #1 with 16 neurons) is disabled at each iteration. The selection of neurons to disable is random.

The goal of dropout is to avoid strong links between units of consecutive layers. These links would probably mean that the down stream neuron is correcting a bias of the up stream counterpart.

Let's insert a dropout before each of the 2 large dense layers of the LeNet-5 model. Dropout rates are tuned through a grid search.

In [48]:
nEpochs = 60
dropoutRange = [0, 0.1, 0.2, 0.3, 0.4, 0.5]
historyLeNet2 = []
accLeNet2 = []
lossLeNet2 = []
weightsLeNet2 = []

for dropoutRate1 in dropoutRange:
    print("Dropout rate #1 = %.2f...#2 = " % (dropoutRate1), end='')
    for dropoutRate2 in dropoutRange:
        print('%.2f' % dropoutRate2, end='', flush=True)
        modelLeNet = models.Sequential([
            layers.Conv2D(filters=6, kernel_size=(3, 3), activation=activations.relu, 
                          input_shape=(imagePadWidth, imagePadHeight,1), name='conv_0'),
            layers.AveragePooling2D(),
            layers.Conv2D(filters=16, kernel_size=(3, 3), activation=activations.relu, 
                          name='conv_1'),
            layers.AveragePooling2D(),
            layers.Flatten(),
            layers.Dropout(dropoutRate1, name='dropout_0'),  
            layers.Dense(units=120, activation=activations.relu, name='dense_0'),
            layers.Dropout(dropoutRate2, name='dropout_1'),  
            layers.Dense(units=84, activation=activations.relu, name='dense_1'),
            layers.Dense(units=10, activation =activations.softmax, name='dense_2')])

        modelLeNet.compile(optimizer=optimizers.Adam(),
                           loss=losses.sparse_categorical_crossentropy,
                           metrics=['accuracy'])
        hist = modelLeNet.fit(xTrainPad, yTrain,
                    epochs=nEpochs, batch_size=batchSize, 
                    validation_split=0.1, 
                    verbose=0)
        loss, acc = modelLeNet.evaluate(xTestPad, yTest, verbose=0)
        historyLeNet2.append(hist)
        accLeNet2.append(acc)
        lossLeNet2.append(loss)
        weightsLeNet2.append(modelLeNet.get_weights())
        print('(%.2f%%), ' % (acc*100), end='', flush=True)
    print('', flush=True)
Dropout rate #1 = 0.00...#2 = 0.00(99.04%), 0.10(99.13%), 0.20(99.21%), 0.30(99.12%), 0.40(99.23%), 0.50(99.19%), 
Dropout rate #1 = 0.10...#2 = 0.00(99.14%), 0.10(99.18%), 0.20(99.21%), 0.30(99.19%), 0.40(99.17%), 0.50(99.31%), 
Dropout rate #1 = 0.20...#2 = 0.00(99.27%), 0.10(99.26%), 0.20(99.27%), 0.30(99.26%), 0.40(99.20%), 0.50(99.28%), 
Dropout rate #1 = 0.30...#2 = 0.00(99.19%), 0.10(99.34%), 0.20(99.36%), 0.30(99.34%), 0.40(99.38%), 0.50(99.30%), 
Dropout rate #1 = 0.40...#2 = 0.00(99.25%), 0.10(99.19%), 0.20(99.33%), 0.30(99.43%), 0.40(99.41%), 0.50(99.25%), 
Dropout rate #1 = 0.50...#2 = 0.00(99.29%), 0.10(99.21%), 0.20(99.23%), 0.30(99.43%), 0.40(99.26%), 0.50(99.19%), 

Selected model with Dropout regularization

In [49]:
modelLeNet2Index = np.argmax(accLeNet2)
modelLeNet2Dropout1 = dropoutRange[modelLeNet2Index // len(dropoutRange)]
modelLeNet2Dropout2 = dropoutRange[modelLeNet2Index  % len(dropoutRange)]
print("Selected model dropout rate #1=%.2f, #2=%.2f, accuracy=%.2f%%" % \
      (modelLeNet2Dropout1, modelLeNet2Dropout2, accLeNet2[modelLeNet2Index]*100))
Selected model dropout rate #1=0.40, #2=0.30, accuracy=99.43%
In [66]:
plotHistory(historyLeNet2[modelLeNet2Index], True)
In [51]:
weightsLeNet2 = weightsLeNet2[modelLeNet2Index]

fig, axes = plt.subplots(2, 3, figsize=(16, 7), sharex=True, sharey=True)
for i, ax in enumerate(axes.ravel()):
    plotHeatMap(weightsLeNet2[0][:,:, 0, i], [0, 1, 2], ax=ax, vmin=-1., vmax=1., cbar=False)

axes[0,0].set_title("LeNet's convolution layer #0 weights");
In [52]:
fig, axes = plt.subplots(1, 3, figsize=(16, 4))
for ax, layerIndex, weightIndex in zip(axes, [0, 1, 2], [4, 6, 8]):
    ax.hist(weightsLeNet2[weightIndex].ravel(), density=False, bins=40)
    ax.set_title("Dense layer #%d coefficients" % layerIndex)

While L1-L2 regularization is bringing very little improvement, Dropout is gaining 0.3% of performance. In the following the same Dropout is used with a smaller network.

Size optimized LeNet with Dropout

The two large dense layers are halved keeping the Dropout setup as optimized above with a grid.

In [53]:
modelLeNet3 = models.Sequential([
    layers.Conv2D(filters=6, kernel_size=(3, 3), activation=activations.relu, 
                  input_shape=(imagePadWidth, imagePadHeight, 1), 
                  name='conv_0'),
    layers.AveragePooling2D(),
    layers.Conv2D(filters=16, kernel_size=(3, 3), activation=activations.relu, 
                  name='conv_1'),         
    layers.AveragePooling2D(),    
    layers.Flatten(name='flatten'),
    layers.Dropout(modelLeNet2Dropout1, name='dropout_0'),  
    layers.Dense(units=60, activation=activations.relu,                   # <---
                 name='dense_0'),
    layers.Dropout(modelLeNet2Dropout2, name='dropout_1'),                                
    layers.Dense(units=42, activation=activations.relu,                   # <---
                  name='dense_1'),
    layers.Dense(units=10, activation=activations.softmax, 
                 name='dense_2')])

modelLeNet3.compile(optimizer=optimizers.Adam(),
                   loss=losses.sparse_categorical_crossentropy,
                   metrics=['accuracy'])

modelLeNet3.summary()
Model: "sequential_131"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv_0 (Conv2D)              (None, 30, 30, 6)         60        
_________________________________________________________________
average_pooling2d_248 (Avera (None, 15, 15, 6)         0         
_________________________________________________________________
conv_1 (Conv2D)              (None, 13, 13, 16)        880       
_________________________________________________________________
average_pooling2d_249 (Avera (None, 6, 6, 16)          0         
_________________________________________________________________
flatten (Flatten)            (None, 576)               0         
_________________________________________________________________
dropout_0 (Dropout)          (None, 576)               0         
_________________________________________________________________
dense_0 (Dense)              (None, 60)                34620     
_________________________________________________________________
dropout_1 (Dropout)          (None, 60)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 42)                2562      
_________________________________________________________________
dense_2 (Dense)              (None, 10)                430       
=================================================================
Total params: 38,552
Trainable params: 38,552
Non-trainable params: 0
_________________________________________________________________
In [54]:
nEpochs = 80
tvPlot = tv.train.PlotMetricsOnEpoch(metrics_name=metricNames,
                                      cell_size=(6,4), columns=2, iter_num=nEpochs, wait_num=2)
historyLeNet3 = modelLeNet3.fit(xTrainPad, yTrain,
                epochs=nEpochs, batch_size=batchSize, 
                validation_split=0.1, 
                verbose=0, callbacks=[tvPlot]);

This model has better performance than original LeNet-5 but with half of its coefficients.

In [55]:
loss, acc = modelLeNet3.evaluate(xTestPad, yTest, verbose=0)
print("Model LeNet #3, test accuracy = %.2f %%" % (acc*100))
Model LeNet #3, test accuracy = 99.22 %
In [56]:
weightsLeNet3 = modelLeNet3.get_weights()

fig, axes = plt.subplots(2, 3, figsize=(16, 7), sharex=True, sharey=True)
for i, ax in enumerate(axes.ravel()):
    plotHeatMap(weightsLeNet3[0][:,:, 0, i], [0, 1, 2], ax=ax, vmin=-1., vmax=1., cbar=False)

axes[0,0].set_title("LeNet's convolution layer #0 weights");
In [57]:
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
for ax, layerIndex, weightIndex in zip(axes, [0, 1, 2], [4, 6, 8]):
    ax.hist(weightsLeNet3[weightIndex].ravel(), density=False, bins=40)
    ax.set_title("Dense layer #%d coefficients" % layerIndex)

Save model

In [0]:
modelLeNet3.save('models/MNIST_LeNet60.h5')

Conclusion

In this first part, we have designed and optimized for performance and size the Dense and "LeNet" classifiers for the MNIST digits. Here are our findings :

Network Number of coefficients Regularization Accuracy on MNIST test
Dense "Tensor Tutorial" 407,050 Single Dropout ~98.4%
Dense with improved regularization 407,050 2 Dropout layers ~98.6%
Dense with 128 neurons 101,770 2 Dropout layers ~97.9%
LeNet with ReLU 81,194 No regularization ~99.1%
LeNet with ReLU and Elasticnet 81,194 L1-L2 regularization on 2 dense layers ~99.1%
LeNet with ReLU and Dropout 81,194 Dropout before 2 dense layers ~99.4%
LeNet with ReLU and Dropout, halved dense layers 38,552 Dropout before 2 dense layers ~99.2%

In this notebook, two regularization technics were used in order to tune generalization properties (i.e. fight against overfitting) and performance. Dropout has been shown to be more effective than classical L1-L2 penalizations. Dropout is also easier to tune than L1-L2 penalization, the hyper-parameter of Dropout seem to be less related to the model architecture than the hyperparameters of the L1-L2 penalization. Note that there are other regularizations technics like adding some (white) noise on the gradient, or early stopping.

It has been shown that decreasing the number of neurons is not always decreasing the performance. It might lead to a better tradeoff between model complexity (and execution cost in production), overfitting and performance.

Convolutional neural network is performing better and at a lower model complexity than the equivalent Dense (fully connnected) network.

It is worth noting that any model downloaded from the Internet should be evaluated and tuned as it might be sub-optimal in performance and complexity.

Where to go from here

  • Dense vs. Convolutional Part-2 Visualization (HTML / Jupyter) using UMAP to view the Dense layers of the two networks
  • Dense vs. Convolutional Part-3 coming soon, we will study the impact of geometry transformations (translation, rotation, scale...) on the two kinds of networks
  • Classification of a 2D problem based on the Czech and Norvegian flags (HTML / Jupyter)
  • Using convolutional layer as a linear filter to detect edges - Part 1 (HTML / Jupyter)

References:

  1. Gradient-Based Learning Applied to Document Recognition, Lecun et al. - http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf
  2. Demonstrations on the LeNet classification of digits, Lecun et al. - http://yann.lecun.com/exdb/lenet/index.html
  3. Dense implementation of the MNIST classifier - https://www.tensorflow.org/tensorboard/get_started
  4. LeNet implementation with Tensorflow Keras - https://colab.research.google.com/drive/1CVm50PGE4vhtB5I_a_yc4h5F-itKOVL9
  5. LeNet implementation with Tensorflow 1.0 - https://github.com/udacity/CarND-LeNet-Lab/blob/master/LeNet-Lab-Solution.ipynb
  6. MNIST dataset - http://yann.lecun.com/exdb/mnist/
  7. Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Nitish Srivastava et al. - http://jmlr.org/papers/v15/srivastava14a.html
  8. A Beginner’s Guide to Convolutional Neural Networks (CNNs), Suhyun Kim - https://towardsdatascience.com/a-beginners-guide-to-convolutional-neural-networks-cnns-14649dbddce8
  9. Regularization and variable selection via the elastic net, Hui Zou and Trevor Hastie - https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.124.4696