Local approximation of a bivariate polynomial expression using linear regression¶
Given two features, extract y (the label), perform linear regression. Coming from the single feature model, 2 features is adding some complexity not only on the maths, but also on the display through charts.
Learning goals:¶
- Analiticaly solve the linear regression and implement it with simple "raw" Python
- Use powers of the features to increase the model capacity
- Use some off the shelve methods of Numpy
- Switch to iterative methods with gradient descent (GD) and stochastic gradient descent (SGD)
- SGD with powers of the features
- Add L2 regularizer to the SGD with powers to improve convergence and fit
import numpy as np
from numpy import random
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.preprocessing import normalize as skNormalize
import pandas
In 2D, it would require many points to uniformly cover the x plan
It is then prefered to draw x points from a 2D uniform distribution
Ploting y as function of $x_0, x_1$ is then more challenging as $x_0$ and $x_1$ are not continuous monotonic vectors
# f(x) as a bivariate polynom
fPoly = np.array([[-0.002, 0], [-0.06, 0.1], [0.2, 0], [1, 0]])
# Generator
def generateBatch(N):
#
xMin = np.array([0, -0.5])
xMax = np.array([0.5, 0.5])
#
b = 0.35
std = 0.01
#
x = random.uniform(xMin, xMax, (N, 2))
yClean = (x[:,0]-0.2)**4 + (x[:,0]-0.1)**3 + 0.1*x[:,1]**2 + b
y = yClean + random.normal(0, std, N)
return (x, y, yClean)
N = 100000
xTrain, yTrain, yTrainClean = generateBatch(N);
print('x', xTrain.shape, ', yClean', yTrainClean.shape, ', y', yTrain.shape);
xTest, yTest, yTestClean = generateBatch(N);
fig = plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.scatter(xTrain[:,0], yTrainClean, marker='.');
plt.xlabel('x0')
plt.subplot(1,2,2)
plt.scatter(xTrain[:,1], yTrainClean, marker='.');
plt.xlabel('x1')
plt.title('Bivariate model without noise');
It looks like a tobogan from the side ($x_0$) and from the front ($x_1$)
Remove biases
xUnB = xTrain - np.mean(xTrain, axis=0)
yUnB = yTrain - np.mean(yTrain)
Analiticaly / Closed form¶
Linear, 1st degree approximation of Y: \begin{align} Y = X W + b \end{align}
- $X$ is a $N_{sample} \times N_{feature}$ matrix
- $W$ is a $N_{feature}$ vector
- $Y$ is a $N_{sample}$ vector
- $b$ is a scalar
To simplify calculus, let's append b to W, X is modified to append $\mathbb {1}_n$ vector on first column: \begin{align} \Theta &= \begin{bmatrix} b \\ W \\ \end{bmatrix} \\ X_m &= \begin{bmatrix} \mathbb{1}_n & X \\ \end{bmatrix} \\ Y &= X_m \Theta \\ \end{align}
Using mean square error (Euclidian norm), we are looking for $w_{est}$ such that:
\begin{align} \Theta_{est} \in argmin_\Theta \lVert X_m \Theta - Y \rVert_2^2 \end{align}Let's define f the function to minimize: \begin{align} f(\Theta) & = \lVert X_m \Theta - Y \rVert_2^2 \\ & = (X_m \Theta - Y_u)^T(X_m \Theta - Y) \\ & = \Theta^T X_m^T X_m \Theta + Y^T Y - 2 X_m^T \Theta^T Y \end{align}
Lookup the minimum through the gradient:
$$\nabla_\Theta f(\Theta) = 2 X_m^T X_m \Theta - 2 X_m^T Y$$If $X^T X$ can be inverted: $$\nabla_\Theta f(\Theta) = 0 \iff \Theta = (X_m^T X_m)^{-1} X_m^T Y$$
With numpy "primitive" operations¶
xTrainW1 = np.concatenate((np.ones(N).reshape(-1, 1), xTrain), axis=1)
xTxInv = np.linalg.inv(np.matmul(xTrainW1.T, xTrainW1))
wEstWb = np.matmul(xTxInv, np.matmul(xTrainW1.T, yTrain))
bEst = wEstWb[0]
wEst = wEstWb[1:]
print("Intercept: %.4f, coefficients: " % bEst, wEst)
Test model¶
yEst1 = np.matmul(xTest, wEst) + bEst
fig = plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.scatter(xTest[:,0], yTest, marker='.');
plt.scatter(xTest[:,0], yEst1, marker='.', alpha=0.05);
plt.subplot(1,2,2)
plt.scatter(xTest[:,1], yTest, marker='.');
plt.scatter(xTest[:,1], yEst1, marker='.', alpha=0.05);
# Mean square error 'a la mano'
mse1 = np.dot(yTest-yEst1, yTest-yEst1) / N
print('Linear regression MSE = {:.3e}'.format(mse1));
The tobogan has been approximated by a plan
Home made with powers¶
Adding some flexibility to the model using powers of x
xTrain2 = np.concatenate((np.ones(N).reshape(-1, 1), xTrain, xTrain**2, xTrain**3, xTrain**4), axis=1)
xTrain2.shape
x2Tx2Inv = np.linalg.inv(np.matmul(xTrain2.T, xTrain2))
w2Wb = np.matmul(x2Tx2Inv,np.matmul(xTrain2.T, yTrain))
b2 = w2Wb[0]
w2 = w2Wb[1:]
print("Intercept: %.4f, coefficients: " % b2, w2)
Test model¶
xTest2 = np.concatenate((xTest, xTest**2, xTest**3, xTest**4), axis=1)
yEst2 = np.matmul(xTest2, w2) + b2
fig = plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.scatter(xTest2[:,0], yTest, marker='.');
plt.scatter(xTest2[:,0], yEst2, marker='.', alpha=0.05);
plt.subplot(1,2,2)
plt.scatter(xTest2[:,1], yTest, marker='.');
plt.scatter(xTest2[:,1], yEst2, marker='.', alpha=0.05);
mse2 = np.dot(yTest-yEst2, yTest-yEst2) / N
mse2Clean = np.dot(yTestClean-yEst2, yTestClean-yEst2) / N
print('Linear regression with powers, MSE = {:.3e}, MSE to clean test data = {:.3e}'.format(mse2, mse2Clean));
The polynomial of y has been well approximated.
The MSE on the noisy sample should not be lower than the variance of the noise which is $(10^{-2})^2=10^{-4}$
NumPy least square¶
fit4, residues, rank, s = np.linalg.lstsq(xUnB, yUnB, rcond=None)
fit4, rank
Test model¶
yEst4 = np.matmul( xTest, fit4) + bEst
fig = plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.scatter(xTest[:,0], yTest, marker='.');
plt.scatter(xTest[:,0], yEst4, marker='.', alpha=0.05);
plt.subplot(1,2,2)
plt.scatter(xTest[:,1], yTest, marker='.');
plt.scatter(xTest[:,1], yEst4, marker='.', alpha=0.05);
mse4 = metrics.mean_squared_error(yTest, yEst4)
print('Numpy least square, MSE = {:.3e}'.format(mse4) );
Gradient descent¶
Attempt to find a value of w such that the gradient is null (indication of minima, or saddle point)
Loop on the training data until the gradient is below a threshold
Gradient on $w$ for a linear regression is: \begin{align} \nabla_{w} = X^T (X w - y) \end{align}
# Gradient computation
def calcGradient(x, y, w):
gradient = np.matmul(x.T, np.matmul(x, w) - y)
return gradient, np.sum(gradient**2)
# Plot helper to show target reference value
def plotToRef(label, subNum, subRow, subCol, ref, values):
nIter = len(values)
r = range(nIter)
plt.subplot(subNum, subRow, subCol)
plt.title(label)
plt.plot(r, values, r, np.ones((nIter))*ref, alpha=0.5)
plt.grid();
Standard gradient descent¶
Loop on the same train data
w6 = np.ones(3)
threshold = 1e-6
learningRate = 1e-6
gradient, gradientNorm = calcGradient(xTrainW1, yTrain, w6)
print('w start =', w6, ', Gradient norm =', gradientNorm)
w6Learn = [np.concatenate((w6, [gradientNorm]))]
while gradientNorm > threshold:
w6 = w6 - learningRate * gradient
gradient, gradientNorm = calcGradient(xTrainW1, yTrain, w6)
w6Learn.append(np.concatenate((w6, [gradientNorm])))
print('w end =', w6, ', Gradient norm =', gradientNorm, ', num iteration =', len(w6Learn))
df6 = pandas.DataFrame(w6Learn, columns = ('b', 'w0', 'w1', 'Gradient norm'))
w6.shape
fig = plt.figure(figsize=(16,10))
plotToRef('$b$', 2,3,1, bEst, df6['b'])
plotToRef('$w_0$', 2,3,2, wEst[0], df6['w0'])
plotToRef('$w_1$', 2,3,3, wEst[1], df6['w1'])
plt.subplot(2,3,4)
plt.semilogy(df6["Gradient norm"])
plt.grid()
plt.legend();
fig = plt.figure(figsize=(16,4))
plt.subplot(1,2,1)
plt.semilogy(df6['b'], df6['Gradient norm'])
plt.xlabel('b')
plt.ylabel('Gradient norm')
plt.grid();
plt.title('Trajectory of the Gradient norm during convergence')
plt.subplot(1,2,2)
plt.semilogy(df6['w0'], df6['Gradient norm'])
plt.semilogy(df6['w1'], df6['Gradient norm'])
plt.grid();
plt.xlabel('w')
plt.ylabel('Gradient norm')
plt.legend(("$w_0$", "$w_1$"));
Test model¶
yEst6 = np.matmul(xTest, w6[1:]) + w6[0]
fig = plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.scatter(xTest[:,0], yTestClean, marker='.')
plt.scatter(xTest[:,0], yEst6, marker='.', alpha=0.05)
plt.title('x0')
plt.legend(('ori clean', 'estimated'))
plt.subplot(1,2,2)
plt.scatter(xTest[:,1], yTestClean, marker='.')
plt.scatter(xTest[:,1], yEst6, marker='.', alpha=0.05)
plt.title('x1')
plt.legend(('ori clean', 'estimated'))
mse6 = metrics.mean_squared_error(yTest, yEst6)
print('Gradient descent MSE = {:.3e}'.format(mse6));
Stochastic gradient descent¶
Loop on new data, smaller batch
Nbatch = 1000
w7 = np.ones(3)
threshold = 1e-4
learningRate = 1e-4
nIterMax = 1e5
onesBatch = np.ones((Nbatch, 1))
# Initial batch
xBatch, yBatch, yBC = generateBatch(Nbatch)
gradient, gradientNorm = calcGradient(np.concatenate((onesBatch, xBatch), axis=1), yBatch, w7)
print('w start =', w7, ', Gradient norm =', gradientNorm)
w7Learn = [np.concatenate((w7, [gradientNorm]))]
# Continue
nIter = 0
while (gradientNorm > threshold) & (nIter <= nIterMax):
w7 = w7 - learningRate * gradient
xBatch, yBatch, yBC = generateBatch(Nbatch)
gradient, gradientNorm = calcGradient(np.concatenate((onesBatch, xBatch), axis=1), yBatch, w7)
w7Learn.append(np.concatenate((w7, [gradientNorm])))
learningRate = learningRate * 0.9999
nIter += 1
print('w end =', w7, ', Gradient norm =', gradientNorm, ', num iteration =', len(w7Learn))
df7 = pandas.DataFrame(w7Learn, columns = ('b', 'w0', 'w1', 'Gradient norm'))
fig = plt.figure(figsize=(16,10))
plotToRef('$b$', 2,3,1, bEst, df7['b'])
plotToRef('$w_0$', 2,3,2, wEst[0], df7['w0'])
plotToRef('$w_1$', 2,3,3, wEst[1], df7['w1'])
plt.subplot(2,3,4)
plt.semilogy(df7['Gradient norm'])
plt.grid()
plt.legend();
fig = plt.figure(figsize=(16,4))
plt.subplot(1,2,1)
plt.semilogy(df7['b'], df7['Gradient norm'])
plt.xlabel('b')
plt.ylabel('Gradient norm')
plt.grid();
plt.title('Trajectory of the Gradient norm during convergence')
plt.subplot(1,2,2)
plt.semilogy(df7['w0'], df7['Gradient norm'])
plt.semilogy(df7['w1'], df7['Gradient norm'])
plt.grid();
plt.xlabel('w')
plt.ylabel('Gradient norm')
plt.legend(("$w_0$", "$w_1$"));
Test model¶
yEst7 = np.matmul(xTest, w7[1:]) + w7[0]
fig = plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.scatter(xTest[:,0], yTestClean, marker='.')
plt.scatter(xTest[:,0], yEst7, marker='.', alpha=0.05)
plt.title('$x_0$')
plt.legend(('ori clean', 'estimated'))
plt.subplot(1,2,2)
plt.scatter(xTest[:,1], yTestClean, marker='.')
plt.scatter(xTest[:,1], yEst7, marker='.', alpha=0.05)
plt.title('$x_1$')
plt.legend(('ori clean', 'estimated'))
mse7 = metrics.mean_squared_error(yTest, yEst7)
print('Stochastic gradient descent MSE = {:.3e}'.format(mse7));
SGD with powers¶
Adding powers of x to the stochastic gradient descent
# Batch size
nBatch = 1000
# Initial coefficients (fixed and tuned manually for this trial)
w8 = np.ones(9)* 1.5
# Gradient descent end threshold (on the gradient)
threshold = 1e-2
# Initial learning rate
learningRate = 1e-4
# Maximum number of iterations
nIterMax = 1e5
onesBatch = np.ones((Nbatch, 1))
# Initial batch
xBatch, yBatch, yBC = generateBatch(nBatch)
xBatch2 = np.concatenate((onesBatch, xBatch, xBatch**2, xBatch**3, xBatch**4), axis = 1)
gradient, gradientNorm = calcGradient(xBatch2, yBatch, w8)
print('w start =', w8, ', Gradient norm = {:.3e}'.format(gradientNorm))
w8Learn = [np.concatenate((w8, [gradientNorm]))]
# Continue
nIter = 0
while (gradientNorm > threshold) & (nIter < nIterMax):
w8 = w8 - learningRate * gradient
xBatch, yBatch, yBC = generateBatch(nBatch)
xBatch2 = np.concatenate((onesBatch, xBatch, xBatch**2, xBatch**3, xBatch**4), axis = 1)
gradient, gradientNorm = calcGradient(xBatch2, yBatch, w8)
w8Learn.append(np.concatenate((w8, [gradientNorm])))
learningRate = learningRate * 0.9999
nIter += 1
print('w end =', w8, ', Gradient norm = {:.3e}'.format(gradientNorm), ', num iteration =', len(w8Learn))
df8 = pandas.DataFrame(w8Learn, columns = ('b', 'w0_1', 'w1_1', 'w0_2', 'w1_2', 'w0_3', 'w1_3', 'w0_4', 'w1_4', 'Gradient norm'))
fig = plt.figure(figsize=(12,20))
plotToRef('$w_{0,1}$', 5,2,1, fPoly[0,0], df8['w0_1'])
plotToRef('$w_{1,1}$', 5,2,2, fPoly[0,1], df8['w1_1'])
plotToRef('$w_{0,2}$', 5,2,3, fPoly[1,0], df8['w0_2'])
plotToRef('$w_{1,2}$', 5,2,4, fPoly[1,1], df8['w1_2'])
plotToRef('$w_{0,3}$', 5,2,5, fPoly[2,0], df8['w0_3'])
plotToRef('$w_{1,3}$', 5,2,6, fPoly[2,1], df8['w1_3'])
plotToRef('$w_{0,4}$', 5,2,7, fPoly[3,0], df8['w0_4'])
plotToRef('$w_{1,4}$', 5,2,8, fPoly[3,1], df8['w1_4'])
plotToRef('$b$', 5,2,9, bEst, df8['b'])
plt.subplot(5,2,10)
plt.semilogy(df8['Gradient norm'])
plt.grid()
plt.legend();
Convergence is not correct on $w_{1}$ that should only feature a small coefficient on the 2nd order term and no coefficient on the fourth order $w_{1,4}$
Test model¶
yEst8 = np.matmul(xTest2, w8[1:]) + w8[0]
fig = plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.scatter(xTest[:,0], yTestClean, marker='.')
plt.scatter(xTest[:,0], yEst8, marker='.', alpha=0.05)
plt.title('$x_0$')
plt.legend(('ori clean', 'estimated'))
plt.subplot(1,2,2)
plt.scatter(xTest[:,1], yTestClean, marker='.')
plt.scatter(xTest[:,1], yEst8, marker='.', alpha=0.05)
plt.title('$x_1$')
plt.legend(('ori clean', 'estimated'))
mse8 = metrics.mean_squared_error(yTest, yEst8)
print('Stochastic gradient descent MSE = {:.3e}'.format(mse8));
Conclusion on SGD with powers¶
The fitting of the 4th and 3rd degree powers takes more time, leading to a initial w shape on the right hand graph ($x_1$).
The MSE is improved, but the optimizer is harder to tune
SGD with powers and regularizer¶
Let's add a regularizer to the SGD with powers, using the Euclidian norm (L2)
Gradient on $w$ for a linear regression with L2 regularizer is: \begin{align} \nabla_{w} = X^T (X w - y) + \lambda w \end{align}
# Calculate gradient with regularizer and normalization
def calcGradientRegularizeL2(x, y, w, lam):
gradient = np.matmul(x.T, np.matmul(x, w) - y) + lam * w
return gradient, np.sum(gradient**2)
nBatch = 1000
w9 = np.ones(9)* 1.5
threshold = 1e-2
learningRate = 1e-4
nIterMax = 1e5
# Regularizer param
lambdaReg = 0.01
# Initial batch
xBatch, yBatch, yBC = generateBatch(nBatch)
xBatch2 = np.concatenate((onesBatch, xBatch, xBatch**2, xBatch**3, xBatch**4), axis = 1)
gradient, gradientNorm = calcGradientRegularizeL2(xBatch2, yBatch, w9, lambdaReg)
print('w start =', w9, ', Gradient norm = {:.3e}'.format(gradientNorm))
w9Learn = [np.concatenate((w9, [gradientNorm]))]
# Continue
nIter = 0
while (gradientNorm > threshold) & (nIter < nIterMax):
w9 = w9 - learningRate * gradient
xBatch, yBatch, yBC = generateBatch(nBatch)
xBatch2 = np.concatenate((onesBatch, xBatch, xBatch**2, xBatch**3, xBatch**4), axis = 1)
gradient, gradientNorm = calcGradientRegularizeL2(xBatch2, yBatch, w9, lambdaReg)
w9Learn.append(np.concatenate((w9, [gradientNorm])))
# learningRate = learningRate * 0.9999
nIter += 1
print('w end =', w9, ', Gradient norm = {:.3e}'.format(gradientNorm), ', num iteration =', len(w9Learn))
df9 = pandas.DataFrame(w9Learn, columns = ('b', 'w0_1', 'w1_1', 'w0_2', 'w1_2', 'w0_3', 'w1_3', 'w0_4', 'w1_4', 'Gradient norm'))
fig = plt.figure(figsize=(12,20))
plotToRef('$w0,1$', 5,2,1, fPoly[0,0], df9['w0_1'])
plotToRef('$w1,1$', 5,2,2, fPoly[0,1], df9['w1_1'])
plotToRef('$w0,2$', 5,2,3, fPoly[1,0], df9['w0_2'])
plotToRef('$w1,2$', 5,2,4, fPoly[1,1], df9['w1_2'])
plotToRef('$w0,3$', 5,2,5, fPoly[2,0], df9['w0_3'])
plotToRef('$w1,3$', 5,2,6, fPoly[2,1], df9['w1_3'])
plotToRef('$w0,4$', 5,2,7, fPoly[3,0], df9['w0_4'])
plotToRef('$w1,4$', 5,2,8, fPoly[3,1], df9['w1_4'])
plotToRef('$b$', 5,2,9, bEst, df9['b'])
plt.subplot(5,2,10)
plt.semilogy(df9['Gradient norm'])
plt.grid()
plt.legend();
Test model¶
yEst9 = np.matmul(xTest2, w9[1:]) + w9[0]
fig = plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.scatter(xTest[:,0], yTestClean, marker='.')
plt.scatter(xTest[:,0], yEst9, marker='.', alpha=0.05)
plt.title('x0')
plt.legend(('ori clean', 'estimated'))
plt.subplot(1,2,2)
plt.scatter(xTest[:,1], yTestClean, marker='.')
plt.scatter(xTest[:,1], yEst9, marker='.', alpha=0.05)
plt.title('x1')
plt.legend(('ori clean', 'estimated'))
mse9 = metrics.mean_squared_error(yTest, yEst9)
print('Stochastic gradient descent with L2 regularizer, MSE = {:.3e}'.format(mse9));
Conclusion on SGD with regularizer¶
The regularizer has a larger effect on 3rd and 4th degree terms, leading to a fitting that mostly use 1st and 2nd degree (parabolic)
Where to go from here ?¶
Other two feature linear implementation using Keras (HTML / Jupyter)
Theoretical and practical quality aspects of linear regression with the Gaussian Model (HTML / Jupyter)
More complex linear regression with Feature engineering or feature learning (HTML / Jupyter)
Compare with the two feature binary classification with logistic regression using "raw" Python or libraries (HTML / Jupyter)