Convolutional versus Dense layers in neural networks - Part 1¶
Design, optimization and performance of the two networks¶
Convolutional layers in deep neural networks are known to have a dense (perceptron) equivalent. However, the topology of the convolutional layers is enforcing a parameter sharing: instead of copying parameters on several neurons, neurons are reused. This property of CNN is leading to a massive reduction in the number of coefficients.
In this series of notebooks, we will show this property as well as some other interesting properties of CNN as compared to the dense equivalent.
In this first part, let's design the models to be compared and optimize the performance and size.
MNIST dataset of digits [6] is well known as the initial use case of CNN by Lecun in 1998 [1, 2]. Today, it remains one of the canonical datasets for machine learning training and algorithm comparisons.
Tensorflow's documentation has a Dense implementation of a classifier for the MNIST [2]. Some implementations of the LeNet in Keras are also available [3, 4].
Let's review these two architectures in the light of Tensorflow 2.0.
The classifiers studied in this notebook are quite different from the one of the Classification of a 2D problem (HTML / Jupyter) in which the goal is to partition the hyper-plan of the features to match the labels. In this notebook, the features are pixels of images, the classifier partition is on the set of images. However, there is a great simplification compared to general image classification: on the MNIST dataset there is only a single digit per image.
Learning goals:¶
- Examplify the equivalency of the convolutional and dense networks and show the decrease in number of coefficients
- Optimize the performance of the network through regularization using Dropout layers
- Optimize the size of the networks to get a fair comparison between the Dense and Convolution DNNs
#!pip install tensorview
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras import activations, datasets, layers, losses, metrics, models, optimizers, regularizers
import seaborn as sns
from sklearn import metrics
import pandas as pd
import tensorview as tv
Constants and Helpers¶
metricNames = ['Categorical cross-entropy', 'Accuracy']
metricNamesRaw = ['loss', 'accuracy']
batchSize = 128
def plotHeatMap(X, classes='auto', title=None, fmt='.2g', ax=None, xlabel=None, ylabel=None,
vmin=None, vmax=None, cbar=True):
""" Fix heatmap plot from Seaborn with pyplot 3.1.0, 3.1.1
https://stackoverflow.com/questions/56942670/matplotlib-seaborn-first-and-last-row-cut-in-half-of-heatmap-plot
"""
ax = sns.heatmap(X, xticklabels=classes, yticklabels=classes, annot=True,
fmt=fmt, vmin=vmin, vmax=vmax, cbar=cbar, cmap=plt.cm.Reds, ax=ax)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
if title:
ax.set_title(title)
if xlabel:
ax.set_xlabel(xlabel)
if ylabel:
ax.set_ylabel(ylabel)
def plotHistory(hist, with_validation=False, sharey=False):
""" Plot a classification history as outputted by Keras """
metrics = list(filter(lambda s: not s.startswith('val_'), hist.history.keys()))
fig, axes = plt.subplots(1, len(metrics), figsize=(15,6), sharey=sharey)
for m, ax in zip(metrics, axes):
ax.plot(hist.history[m])
if with_validation:
ax.plot(hist.history['val_' + m])
ax.legend(('train', 'valid'))
ax.set_title(m)
ax.set_xlabel('epoch')
ax.grid()
def plotConfusionMatrix(yTrue, yEst, classes, title=None, fmt='.2g', ax=None):
""" Compute and plot the confusion matrix """
plotHeatMap(metrics.confusion_matrix(yTrue, yEst), classes, title, fmt, ax, \
xlabel='Estimations', ylabel='True values');
MNIST Data¶
MNIST dataset is derived from the NIST dataset. It contains 60000 training and 10000 testing images of size 28x28 in grayscale. Digits have been recentered on the raster using the center of mass.
(xTrain, yTrain),(xTest, yTest) = datasets.mnist.load_data()
xTrain = xTrain / 255.
xTest = xTest / 255.
digitClasses = np.array([str(d) for d in range(10)])
xTrain.shape, xTest.shape
fig, axes = plt.subplots(5, 20, figsize=(16, 5))
for i, ax in enumerate(axes.ravel()):
ax.imshow(xTrain[i], cmap='gray_r')
plt.setp(axes, xticks=[], yticks=[], frame_on=False)
plt.tight_layout(h_pad=0.5, w_pad=0.01)
Dense Model design and performance¶
This implementation is taken from the Tensorflow tutorial [3].
The initial 28x28 pixel image is first flattened into a a vector of 784 features. These features are evaluated by 512 neurons in parallel. Their outputs are activated by the ReLU before being combined into the 10 output neurons giving softmax probabilities for each of the 10 digits.
A dropout layer is inserted in between the two layers. This layer is performing regularization in order to avoid overfitting [7].
modelDense0 = models.Sequential([
layers.Flatten(input_shape=(28, 28)),
layers.Dense(512, activation=activations.relu, name='dense_0'),
layers.Dropout(0.2),
layers.Dense(10, activation=activations.softmax, name='dense_1')
])
modelDense0.compile(optimizer=optimizers.Adam(),
loss=losses.sparse_categorical_crossentropy,
metrics=['accuracy'])
modelDense0.summary()
nEpochs = 100
tvPlot = tv.train.PlotMetricsOnEpoch(metrics_name=metricNames,
cell_size=(6,4), columns=2, iter_num=nEpochs, wait_num=2)
historyDense0 = modelDense0.fit(xTrain, yTrain,
epochs=nEpochs, batch_size=batchSize,
validation_split=0.1,
verbose=0,
callbacks=[tvPlot]);
This model has 2 issues:
- Even if there is regularization through the dropout layer there is still a quite large overfitting as the train and validation performance is quite different, and the validation loss is worsening from epoch 20 forward
- The variance of the validation loss and metric is high, hinting also for overfitting
loss, acc = modelDense0.evaluate(xTest, yTest, verbose=0)
print("Model Dense #0, test accuracy = %.2f %%" % (acc*100))
weightsDense0 = modelDense0.get_weights()
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
for ax, layerIndex, weightIndex in zip(axes, [0, 1], [0, 2]):
ax.hist(weightsDense0[weightIndex].ravel(), density=False, bins=40)
ax.set_title("Dense layer #%d coefficients" % layerIndex)
Adding more regularization to the Dense network¶
The flatten layer has no training coefficients (weights, bias), but is transforming the topology of the feature space and is actually the wider feature space of the network (784 for this layer versus 512 at the next layer). Let's add a Dropout layer on this feature space. Also, as advised in [7], let's change the hidden layer dropout ratio from 20% to 50%.
modelDense1 = models.Sequential([
layers.Flatten(input_shape=(28, 28), name='flatten'),
layers.Dropout(0.2, name='dropout_0'), # <----
layers.Dense(512, activation=activations.relu, name='dense_0'),
layers.Dropout(0.5, name='dropout_1'), # <----
layers.Dense(10, activation=activations.softmax, name='dense_1')
])
modelDense1.compile(optimizer=optimizers.Adam(),
loss=losses.sparse_categorical_crossentropy,
metrics=['accuracy']);
nEpochs = 100
tvPlot = tv.train.PlotMetricsOnEpoch(metrics_name=metricNames,
cell_size=(6,4), columns=2, iter_num=nEpochs, wait_num=2)
historyDense1 = modelDense1.fit(xTrain, yTrain,
epochs=nEpochs, batch_size=batchSize,
validation_split=0.1,
verbose=0,
callbacks=[tvPlot]);
There is still overfitting on this network (delta between train and validation), but the variation on the validation curves are much lower and performance has increased by 0.2%.
The original paper of Dropout [7] is listing results better than 99% accuracy using a 2 layer dense network with 8096 units on each layer.
The validation metrics are better than the training during first epochs, this is a direct impact of the dropout: the training has fewer neurons active than the validation.
Moreover, on the histogram below for layer #0, many coefficients are close to 0. In the following section, we will attempt to decrease the model size while keeping similar performance.
loss, acc = modelDense1.evaluate(xTest, yTest, verbose=0)
print("Model Dense #1, test accuracy = %.2f %%" % (acc*100))
weightsDense1 = modelDense1.get_weights()
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
for ax, layerIndex, weightIndex in zip(axes, [0, 1], [0, 2]):
ax.hist(weightsDense1[weightIndex].ravel(), density=False, bins=40)
ax.set_title("Dense layer #%d coefficients" % layerIndex)
Size optimized Dense model¶
This model seems overdesigned, many neurons are not used in layer #1. In order to have a fair comparison to the CNN implementation, let's optimize the number of neurons in layer #0.
batchSize = 128
nEpochs = 100
numNeuronsRange = [32, 64, 128, 256]
perfDense2 = []
for numNeurons in numNeuronsRange:
modelDense = models.Sequential([
layers.Flatten(input_shape=(28, 28)),
layers.Dropout(0.2),
layers.Dense(numNeurons, activation=activations.relu, name='dense_0'),
layers.Dropout(0.5),
layers.Dense(10, activation=activations.softmax, name='dense_1')
])
modelDense.compile(optimizer=optimizers.Adam(),
loss=losses.sparse_categorical_crossentropy,
metrics=['accuracy'])
historyDense = modelDense.fit(xTrain, yTrain,
epochs=nEpochs, batch_size=batchSize,
validation_split=0.1,
verbose=0)
perfDense2.append(modelDense.evaluate(xTest, yTest, verbose=0))
print("Accuracy for %d neurons in layer #0 : %.2f %%" % (numNeurons, perfDense2[-1][1] * 100))
plt.plot(numNeuronsRange, np.array(perfDense2)[:,1])
plt.title("Accuracy as function of number of neurons in layer #0")
plt.grid()
Based on above grid search, in order to get a trade-off on performance and size, we select the first layer to be 128 neurons wide.
modelDense2 = models.Sequential([
layers.Flatten(input_shape=(28, 28), name='flatten'),
layers.Dropout(0.2, name='dropout_0'),
layers.Dense(128, activation=activations.relu, name='dense_0'),
layers.Dropout(0.5, name='dropout_1'),
layers.Dense(10, activation=activations.softmax, name='dense_1')
])
modelDense2.compile(optimizer=optimizers.Adam(),
loss=losses.sparse_categorical_crossentropy,
metrics=['accuracy'])
modelDense2.summary()
nEpochs = 100
tvPlot = tv.train.PlotMetricsOnEpoch(metrics_name=metricNames,
cell_size=(6,4), columns=2, iter_num=nEpochs, wait_num=2)
historyDense2 = modelDense2.fit(xTrain, yTrain,
epochs=nEpochs, batch_size=batchSize,
validation_split=0.1,
verbose=0,
callbacks=[tvPlot]);
The performance of the optimized network is 0.4 to 0.5 % lower than the original but with a quarter of the coefficients.
Also, overfitting is no longer visible on the loss and metrics curves.
loss, acc = modelDense2.evaluate(xTest, yTest, verbose=0)
yEstDense2 = modelDense2.predict(xTest)
print("Model Dense #1, test accuracy = %.2f %%" % (acc*100))
weightsDense2 = modelDense2.get_weights()
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
for ax, layerIndex, weightIndex in zip(axes, [0, 1], [0, 2]):
ax.hist(weightsDense2[weightIndex].ravel(), density=False, bins=40)
ax.set_title("Dense layer #%d coefficients" % layerIndex)
Evaluation on the Test set¶
fig, ax = plt.subplots(figsize=(9, 8))
plotConfusionMatrix(yTest, np.argmax(yEstDense2, axis=1), digitClasses)
Overall, very few misclassifications. The top 2 confusions are 5s estimated as 3, and 7s estimated as 2.
Save Dense optimized model¶
Saving this model, to be reused in Part 2 and 3.
modelDense2.save('models/MNIST_dense128.h5')
LeNet model¶
Adapted from [4] in which the activation function is already the ReLU, original activation is Tanh. The network has two sections: convolutional and dense.
The convolutional section is made of two pairs of convolutional + average pooling. The first pair operates 6 neurons on the input (30x30 pixels without padding), the second has 16 neurons on the 15x15x6 outputs of previous.
The images are padded to reach 32x32 pixels at input of the classifier and the dimension is increased to cope with the convolution layers.
For an introduction to convolution filter/layer, see [8]
imagePadWidth = 32
imagePadHeight = 32
xTrainPad = np.pad(xTrain.reshape(-1, 28, 28, 1), ((0,0),(2,2),(2,2),(0,0)), 'constant')
xTestPad = np.pad(xTest.reshape(-1, 28, 28, 1), ((0,0),(2,2),(2,2),(0,0)), 'constant')
print("Padded image shape: {}".format(xTrainPad[0].shape))
modelLeNet0 = models.Sequential([
layers.Conv2D(filters=6, kernel_size=(3, 3), activation=activations.relu,
input_shape=(imagePadWidth, imagePadHeight,1), name='conv_0'),
layers.AveragePooling2D(),
layers.Conv2D(filters=16, kernel_size=(3, 3), activation='relu', name='conv_1'),
layers.AveragePooling2D(),
layers.Flatten(),
layers.Dense(units=120, activation=activations.relu, name='dense_0'),
layers.Dense(units=84, activation=activations.relu, name='dense_1'),
layers.Dense(units=10, activation =activations.softmax, name='dense_2')])
modelLeNet0.compile(optimizer=optimizers.Adam(),
loss=losses.sparse_categorical_crossentropy,
metrics=['accuracy'])
modelLeNet0.summary()
nEpochs = 60
tvPlot = tv.train.PlotMetricsOnEpoch(metrics_name=metricNames,
cell_size=(6,4), columns=2, iter_num=nEpochs, wait_num=2)
historyLeNet0 = modelLeNet0.fit(xTrainPad, yTrain,
epochs=nEpochs, batch_size=batchSize,
validation_split=0.1,
verbose=0,
callbacks=[tvPlot]);
Optimization of this model is very quick. But is also overfitting and the train accuracy is reaching 100% while loss is flooring at 0.
loss, acc = modelLeNet0.evaluate(xTestPad, yTest, verbose=0)
print("Model LeNet #0, test accuracy = %.2f %%" % (acc*100))
weightsLeNet0 = modelLeNet0.get_weights()
fig, axes = plt.subplots(2, 3, figsize=(16, 7), sharex=True, sharey=True)
for i, ax in enumerate(axes.ravel()):
plotHeatMap(weightsLeNet0[0][:,:, 0, i], [0, 1, 2], ax=ax, vmin=-1., vmax=1., cbar=False)
axes[0,0].set_title("LeNet's convolution layer #0 weights");
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
for ax, layerIndex, weightIndex in zip(axes, [0, 1, 2], [4, 6, 8]):
ax.hist(weightsLeNet0[weightIndex].ravel(), density=False, bins=40)
ax.set_title("Dense layer #%d coefficients" % layerIndex)
LeNet with L1-L2 regularization¶
In the following model, a mix of norm L1 (absolute value) and L2 (euclidian distance) is applied on the first dense layer in order to push the coefficients toward 0. This combination of regularization is also known as Elastic Net [9]: $$ \hat{\theta} = \underset{\theta \in \mathbb{R}^p, \lambda_{L_1} \in \mathbb{R}^+, \lambda_{L_2} \in \mathbb{R}^+}{argmin}\left[ \mathcal{L}(X, y, \theta) + \lambda_{L_1} \Vert \theta \Vert_1 + \lambda_{L_2} \frac 12 \Vert \theta \Vert_2^2 \right] $$
With :
- $X$, $y$ the features and labels
- $\theta$ the weights to learn
- $\mathcal{L}(X, y, \theta)$ the loss function
- $\lambda_{L_1}$, $\lambda_{L_2}$ the regularization parameters
L1 and L2 regularizations can be interpreted with Lagrange multipliers. The L1 penalizations has iso-levels that are hyper-cubes. The L2 iso-levels are hyper-bowls.
L1-L2 regularization is added before the two large dense layers. The second regularization is less important and is set to a fixed mild penalization. The first's L1-L2 penalization coefficients are tuned through a square grid search.
nEpochs = 60
lambdaRange = [0, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2]
historyLeNet1 = []
accLeNet1 = []
lossLeNet1 = []
weightsLeNet1 = []
for lambdaL1 in lambdaRange:
print("Lambda L1=%.0e... L2=" % (lambdaL1), end='')
for lambdaL2 in lambdaRange:
print('%.0e' % lambdaL2, end='', flush=True)
modelLeNet = models.Sequential([
layers.Conv2D(filters=6, kernel_size=(3, 3), activation=activations.relu,
input_shape=(imagePadWidth, imagePadHeight,1), name='conv_0'),
layers.AveragePooling2D(),
layers.Conv2D(filters=16, kernel_size=(3, 3), activation=activations.relu,
name='conv_1'),
layers.AveragePooling2D(),
layers.Flatten(),
layers.Dense(units=120, activation=activations.relu,
kernel_regularizer=regularizers.l1_l2(lambdaL1, lambdaL2), # <---
name='dense_0'),
layers.Dense(units=84, activation=activations.relu,
kernel_regularizer=regularizers.l1_l2(0.000001, 0.000001), # <---
name='dense_1'),
layers.Dense(units=10, activation =activations.softmax,
name='dense_2')])
modelLeNet.compile(optimizer=optimizers.Adam(),
loss=losses.sparse_categorical_crossentropy,
metrics=['accuracy'])
hist = modelLeNet.fit(xTrainPad, yTrain,
epochs=nEpochs, batch_size=batchSize,
validation_split=0.1,
verbose=0)
loss, acc = modelLeNet.evaluate(xTestPad, yTest, verbose=0)
historyLeNet1.append(hist)
accLeNet1.append(acc)
lossLeNet1.append(loss)
weightsLeNet1.append(modelLeNet.get_weights())
print('(%.2f%%), ' % (acc*100), end='', flush=True)
print('', flush=True)
Selected LeNet model with L1-L2 regularization¶
modelLeNet1Index = np.argmax(accLeNet1)
modelLeNet1IndexL1 = modelLeNet1Index // len(lambdaRange)
modelLeNet1IndexL2 = modelLeNet1Index % len(lambdaRange)
print("Selected model lambda L1=%.0e, L2=%.0e, accuracy=%.2f%%" % \
(lambdaRange[modelLeNet1IndexL1], lambdaRange[modelLeNet1IndexL2], accLeNet1[modelLeNet1Index]*100))
plotHistory(historyLeNet1[modelLeNet1Index], True)
The performance of the CNN with L1-L2 regularization is very close to the original one. As shown on the histograms below, there are numerous coefficients close to 0, that's room for network size optimization.
weightsLeNet1 = weightsLeNet1[modelLeNet1Index]
fig, axes = plt.subplots(2, 3, figsize=(16, 7), sharex=True, sharey=True)
for i, ax in enumerate(axes.ravel()):
plotHeatMap(weightsLeNet1[0][:,:, 0, i], [0, 1, 2], ax=ax, vmin=-1., vmax=1., cbar=False)
axes[0,0].set_title("LeNet's convolution layer #0 weights");
fig, axes = plt.subplots(1, 3, figsize=(16, 4))
for ax, layerIndex, weightIndex in zip(axes, [0, 1, 2], [4, 6, 8]):
ax.hist(weightsLeNet1[weightIndex].ravel(), density=False, bins=40)
ax.set_title("Dense layer #%d coefficients" % layerIndex)
The effect of the L2 regularization is clearly visible on the left and center histograms in which there is a large share of coefficients clost to 0.
LeNet with Dropout regularization¶
Dropout [7] is another very effective regularization technic in which a share of the layer (here the convolution layer #1 with 16 neurons) is disabled at each iteration. The selection of neurons to disable is random.
The goal of dropout is to avoid strong links between units of consecutive layers. These links would probably mean that the down stream neuron is correcting a bias of the up stream counterpart.
Let's insert a dropout before each of the 2 large dense layers of the LeNet-5 model. Dropout rates are tuned through a grid search.
nEpochs = 60
dropoutRange = [0, 0.1, 0.2, 0.3, 0.4, 0.5]
historyLeNet2 = []
accLeNet2 = []
lossLeNet2 = []
weightsLeNet2 = []
for dropoutRate1 in dropoutRange:
print("Dropout rate #1 = %.2f...#2 = " % (dropoutRate1), end='')
for dropoutRate2 in dropoutRange:
print('%.2f' % dropoutRate2, end='', flush=True)
modelLeNet = models.Sequential([
layers.Conv2D(filters=6, kernel_size=(3, 3), activation=activations.relu,
input_shape=(imagePadWidth, imagePadHeight,1), name='conv_0'),
layers.AveragePooling2D(),
layers.Conv2D(filters=16, kernel_size=(3, 3), activation=activations.relu,
name='conv_1'),
layers.AveragePooling2D(),
layers.Flatten(),
layers.Dropout(dropoutRate1, name='dropout_0'),
layers.Dense(units=120, activation=activations.relu, name='dense_0'),
layers.Dropout(dropoutRate2, name='dropout_1'),
layers.Dense(units=84, activation=activations.relu, name='dense_1'),
layers.Dense(units=10, activation =activations.softmax, name='dense_2')])
modelLeNet.compile(optimizer=optimizers.Adam(),
loss=losses.sparse_categorical_crossentropy,
metrics=['accuracy'])
hist = modelLeNet.fit(xTrainPad, yTrain,
epochs=nEpochs, batch_size=batchSize,
validation_split=0.1,
verbose=0)
loss, acc = modelLeNet.evaluate(xTestPad, yTest, verbose=0)
historyLeNet2.append(hist)
accLeNet2.append(acc)
lossLeNet2.append(loss)
weightsLeNet2.append(modelLeNet.get_weights())
print('(%.2f%%), ' % (acc*100), end='', flush=True)
print('', flush=True)
Selected model with Dropout regularization¶
modelLeNet2Index = np.argmax(accLeNet2)
modelLeNet2Dropout1 = dropoutRange[modelLeNet2Index // len(dropoutRange)]
modelLeNet2Dropout2 = dropoutRange[modelLeNet2Index % len(dropoutRange)]
print("Selected model dropout rate #1=%.2f, #2=%.2f, accuracy=%.2f%%" % \
(modelLeNet2Dropout1, modelLeNet2Dropout2, accLeNet2[modelLeNet2Index]*100))
plotHistory(historyLeNet2[modelLeNet2Index], True)
weightsLeNet2 = weightsLeNet2[modelLeNet2Index]
fig, axes = plt.subplots(2, 3, figsize=(16, 7), sharex=True, sharey=True)
for i, ax in enumerate(axes.ravel()):
plotHeatMap(weightsLeNet2[0][:,:, 0, i], [0, 1, 2], ax=ax, vmin=-1., vmax=1., cbar=False)
axes[0,0].set_title("LeNet's convolution layer #0 weights");
fig, axes = plt.subplots(1, 3, figsize=(16, 4))
for ax, layerIndex, weightIndex in zip(axes, [0, 1, 2], [4, 6, 8]):
ax.hist(weightsLeNet2[weightIndex].ravel(), density=False, bins=40)
ax.set_title("Dense layer #%d coefficients" % layerIndex)
While L1-L2 regularization is bringing very little improvement, Dropout is gaining 0.3% of performance. In the following the same Dropout is used with a smaller network.
Size optimized LeNet with Dropout¶
The two large dense layers are halved keeping the Dropout setup as optimized above with a grid.
modelLeNet3 = models.Sequential([
layers.Conv2D(filters=6, kernel_size=(3, 3), activation=activations.relu,
input_shape=(imagePadWidth, imagePadHeight, 1),
name='conv_0'),
layers.AveragePooling2D(),
layers.Conv2D(filters=16, kernel_size=(3, 3), activation=activations.relu,
name='conv_1'),
layers.AveragePooling2D(),
layers.Flatten(name='flatten'),
layers.Dropout(modelLeNet2Dropout1, name='dropout_0'),
layers.Dense(units=60, activation=activations.relu, # <---
name='dense_0'),
layers.Dropout(modelLeNet2Dropout2, name='dropout_1'),
layers.Dense(units=42, activation=activations.relu, # <---
name='dense_1'),
layers.Dense(units=10, activation=activations.softmax,
name='dense_2')])
modelLeNet3.compile(optimizer=optimizers.Adam(),
loss=losses.sparse_categorical_crossentropy,
metrics=['accuracy'])
modelLeNet3.summary()
nEpochs = 80
tvPlot = tv.train.PlotMetricsOnEpoch(metrics_name=metricNames,
cell_size=(6,4), columns=2, iter_num=nEpochs, wait_num=2)
historyLeNet3 = modelLeNet3.fit(xTrainPad, yTrain,
epochs=nEpochs, batch_size=batchSize,
validation_split=0.1,
verbose=0, callbacks=[tvPlot]);
This model has better performance than original LeNet-5 but with half of its coefficients.
loss, acc = modelLeNet3.evaluate(xTestPad, yTest, verbose=0)
print("Model LeNet #3, test accuracy = %.2f %%" % (acc*100))
weightsLeNet3 = modelLeNet3.get_weights()
fig, axes = plt.subplots(2, 3, figsize=(16, 7), sharex=True, sharey=True)
for i, ax in enumerate(axes.ravel()):
plotHeatMap(weightsLeNet3[0][:,:, 0, i], [0, 1, 2], ax=ax, vmin=-1., vmax=1., cbar=False)
axes[0,0].set_title("LeNet's convolution layer #0 weights");
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
for ax, layerIndex, weightIndex in zip(axes, [0, 1, 2], [4, 6, 8]):
ax.hist(weightsLeNet3[weightIndex].ravel(), density=False, bins=40)
ax.set_title("Dense layer #%d coefficients" % layerIndex)
Save model¶
modelLeNet3.save('models/MNIST_LeNet60.h5')
Conclusion¶
In this first part, we have designed and optimized for performance and size the Dense and "LeNet" classifiers for the MNIST digits. Here are our findings :
Network | Number of coefficients | Regularization | Accuracy on MNIST test |
---|---|---|---|
Dense "Tensor Tutorial" | 407,050 | Single Dropout | ~98.4% |
Dense with improved regularization | 407,050 | 2 Dropout layers | ~98.6% |
Dense with 128 neurons | 101,770 | 2 Dropout layers | ~97.9% |
LeNet with ReLU | 81,194 | No regularization | ~99.1% |
LeNet with ReLU and Elasticnet | 81,194 | L1-L2 regularization on 2 dense layers | ~99.1% |
LeNet with ReLU and Dropout | 81,194 | Dropout before 2 dense layers | ~99.4% |
LeNet with ReLU and Dropout, halved dense layers | 38,552 | Dropout before 2 dense layers | ~99.2% |
In this notebook, two regularization technics were used in order to tune generalization properties (i.e. fight against overfitting) and performance. Dropout has been shown to be more effective than classical L1-L2 penalizations. Dropout is also easier to tune than L1-L2 penalization, the hyper-parameter of Dropout seem to be less related to the model architecture than the hyperparameters of the L1-L2 penalization. Note that there are other regularizations technics like adding some (white) noise on the gradient, or early stopping.
It has been shown that decreasing the number of neurons is not always decreasing the performance. It might lead to a better tradeoff between model complexity (and execution cost in production), overfitting and performance.
Convolutional neural network is performing better and at a lower model complexity than the equivalent Dense (fully connnected) network.
It is worth noting that any model downloaded from the Internet should be evaluated and tuned as it might be sub-optimal in performance and complexity.
Where to go from here¶
- Dense vs. Convolutional Part-2 Visualization (HTML / Jupyter) using UMAP to view the Dense layers of the two networks
- Dense vs. Convolutional Part-3 coming soon, we will study the impact of geometry transformations (translation, rotation, scale...) on the two kinds of networks
- Classification of a 2D problem based on the Czech and Norvegian flags (HTML / Jupyter)
- Using convolutional layer as a linear filter to detect edges - Part 1 (HTML / Jupyter)
References:¶
- Gradient-Based Learning Applied to Document Recognition, Lecun et al. - http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf
- Demonstrations on the LeNet classification of digits, Lecun et al. - http://yann.lecun.com/exdb/lenet/index.html
- Dense implementation of the MNIST classifier - https://www.tensorflow.org/tensorboard/get_started
- LeNet implementation with Tensorflow Keras - https://colab.research.google.com/drive/1CVm50PGE4vhtB5I_a_yc4h5F-itKOVL9
- LeNet implementation with Tensorflow 1.0 - https://github.com/udacity/CarND-LeNet-Lab/blob/master/LeNet-Lab-Solution.ipynb
- MNIST dataset - http://yann.lecun.com/exdb/mnist/
- Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Nitish Srivastava et al. - http://jmlr.org/papers/v15/srivastava14a.html
- A Beginner’s Guide to Convolutional Neural Networks (CNNs), Suhyun Kim - https://towardsdatascience.com/a-beginners-guide-to-convolutional-neural-networks-cnns-14649dbddce8
- Regularization and variable selection via the elastic net, Hui Zou and Trevor Hastie - https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.124.4696