Local approximation of an univariate quadratic expression using linear regression¶

Learning goals:¶

Mathematical statement of the linear regression using the simples data: univariate continuous function (single feature)
Simple implementation using this closed form solution
Definition of the usual metrics
Move on to use existing libraries from Scipy or SKLearn to perform linear fitting and to perform an analysis of the model
Increase the capacity of the model with powers of the features (polynomial)
Understand iterative methods based on gradient descent, including the stochastic gradient descent

import math
import numpy as np
from numpy import random
import matplotlib.pyplot as plt
from sklearn import metrics as skMetrics
from sklearn.linear_model import LinearRegression as skLinearRegression 
import scipy as sy
import sympy
import pandas

Generator Model¶

def generateBatch(N, stochastic = False):
    #
    xMin = 0
    xMax = 0.5
    #
    b = 0.35
    std = 0.01
    #
    if stochastic:
        x = random.uniform(xMin, xMax, N)
    else:
        x = np.linspace(xMin, xMax, N)
    yClean = x**4 + (x-0.3)**3 + b
    y =  yClean + random.normal(0, std, N) 
    return (x, y, yClean)

Train data¶

xs = sympy.Symbol('x')
sympy.expand(xs**4 + (xs-0.3)**3 + 0.35)

N = 100000
xTrain, yTrain, yClean = generateBatch(N)
plt.plot(xTrain, yTrain, xTrain, yClean);
plt.title('Training data');

Test data¶

xTest, yTest, yTestClean = generateBatch(N)

Closed form / analiticaly¶

Linear, 1st degree approximation of y: \begin{align} y = x w + b \end{align}

Given $N_{feature} = 1$:

$x$ is a $N_{sample}$ vector
$w$ is a scalar
$y$ is a $N_{sample}$ vector
$b$ is a scalar

Using mean square error (Euclidian norm), we are looking for $\hat{\theta} = \{\hat{b}, \hat{w}\}$ such that:

\begin{align} \hat{\theta} \in {min}_{\theta} \lVert x w + b - y \rVert_2^2 \end{align}\begin{align} g(\theta) & = \lVert x w + b - y \rVert_2^2 \\ & = \sum_{i=0}^n \left(x_i^2 w^2 + y_i^2 + b^2 + 2x_iwb - 2x_iwy_i - 2y_ib \right)\\ \end{align}

Lookup the minimum through the partial derivates:

\begin{cases} \frac{\partial{g(\theta)}}{\partial b} = \sum_{i=0}^n 2(x_iw + b - y_i) \\ \frac{\partial{g(\theta)}}{\partial w} = \sum_{i=0}^n 2(x_iw + b - y_i)x_i \end{cases}

Let's define for any variable z: \begin{align} \overline{z_n} & = \frac 1n \sum_{i=0}^n z_i \end{align}

Then: \begin{align} & \begin{cases} \frac{\partial{g(\theta)}}{\partial b} = 0 \\ \frac{\partial{g(\theta)}}{\partial w} = 0 \end{cases} \\ \iff & \begin{cases} \overline{x_n} w + b = \overline{y_n} & Eq1\\ \overline{x_n^2 }w + \overline{x_n} b = \overline{x_n y_n} & Eq2\\ \end{cases} \end{align}

$Eq2 - Eq1 \overline{x_n}$ :

\begin{align} & w \left( \overline{x_n^2} - \overline{x_n}^2 \right) = \overline{x_n y_n} - \overline{x_n}.\overline{y_n}\\ \end{align}

Leading to:

\begin{align} w &= \frac{\overline{x_n y_n} - \overline{x_n}.\overline{y_n}}{\overline{x_n^2} - \overline{x_n}^2}\\ &= \frac{\sum_{i=0}^n \left(x_i - \overline{x_n} \right) \left(y_i - \overline{y_n} \right)}{\sum_{i=0}^n \left(x_i - \overline{x_n} \right) } \\ b &= \sum_{i=0}^n y_i - x_i w = \overline{y} - \overline{x} w \\ \end{align}

Remove biases

xUnB = xTrain - np.mean(xTrain)
yUnB = yTrain - np.mean(yTrain)

# In case x is univariate, the matrix product x^T x is a scalar
wEst = np.dot(xUnB, yUnB) / np.dot(xUnB, xUnB)
bEst = np.mean(yTrain - wEst * xTrain)
print('Linear regression estimate: y = {:.3f} x + {:.3f}'.format(wEst, bEst))

Linear regression estimate: y = 0.145 x + 0.323

Test model¶

yEst1 = wEst * xTest + bEst

plt.plot(xTest, yTest, xTest, yEst1);
res = yTest - yEst1
mse1 = np.dot(res, res) / N
print('Closed form, MSE = {:.3e}'.format(mse1));

Closed form, MSE = 1.502e-04

Numpy polyfit, 1st degree¶

http://www.python-simple.com/python-numpy-scipy/fitting-regression.php

fit2 = np.polyfit(xTrain, yTrain, 1)
fit2

array([0.14488612, 0.32303991])

Test model¶

yEst2 = fit2[0] * xTest + fit2[1]

plt.plot(xTest, yTest, xTest, yEst2);
mse2 = skMetrics.mean_squared_error(yTest, yEst2)
print('Numpy polyfit 1st degree, MSE =', "{:.3e}".format(mse2));

Numpy polyfit 1st degree, MSE = 1.502e-04

Numpy polyfit, 4th degree¶

fit3 = np.polyfit(xTrain, yTrain, 4)
fit3

array([ 1.03565332,  0.96438978, -0.88586569,  0.26726443,  0.3231845 ])

Test model¶

yEst3 = xTest**4 * fit3[0] + xTest**3 * fit3[1] + xTest**2 * fit3[2] + xTest * fit3[3] + fit3[4]

plt.plot(xTest, yTestClean, xTest, yEst3);
mse3 = skMetrics.mean_squared_error(yTest, yEst3)
plt.legend(['ori','fitted'])
print('Numpy polyfit 4th degre, MSE =', "{:.3e}".format(mse3));

Numpy polyfit 4th degre, MSE = 1.003e-04

NumPy least square¶

http://www.python-simple.com/python-numpy-scipy/fitting-regression.php

fit4, residues, rank, s = np.linalg.lstsq(np.reshape(xUnB, (N,1)), yUnB)
fit4

//anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  """Entry point for launching an IPython kernel.

array([0.14488612])

Test model¶

yEst4 = fit4 * xTest + bEst

plt.plot(xTest, yTest, xTest, yEst4);
mse4 = skMetrics.mean_squared_error(yTest, yEst4)
print('Numpy Least square, MSE =', "{:.3e}".format(mse4));

Numpy Least square, MSE = 1.502e-04

Scipy linear regression¶

http://www.python-simple.com/python-numpy-scipy/fitting-regression.php

fit5 = sy.stats.linregress(xTrain, yTrain)
fit5

LinregressResult(slope=0.14488612244983226, intercept=0.3230399081789981, rvalue=0.8633622269474659, pvalue=0.0, stderr=0.0002677762899562608)

Test model¶

yEst5 = fit5.slope * xTest + fit5.intercept

plt.plot(xTest, yTest, xTest, yEst5)
mse5 = skMetrics.mean_squared_error(yTest, yEst5)
print('Scipy, MSE =', "{:.3e}".format(mse5));

Scipy, MSE = 1.502e-04

SK Learn linear regression¶

model6 = skLinearRegression(normalize=False)
model6.fit(xTrain.reshape(-1,1), yTrain.reshape(-1,1))
print('SciKit Learn linear regression, b =', model6.intercept_[0], ', w =', model6.coef_[0][0])

SciKit Learn linear regression, b = 0.323039908178998 , w = 0.1448861224498326

print('SciKit Learn, R^2-score =', model6.score(xTest.reshape(-1,1), yTest))

SciKit Learn, R^2-score = 0.7448545267081639

Test model¶

yEst6 = model6.predict(xTest.reshape(-1,1))

plt.plot(xTest, yTest, xTest, yEst6)
mse6 = skMetrics.mean_squared_error(yTest, yEst6)
print('SciKit Learn, MSE =', "{:.3e}".format(mse6));

SciKit Learn, MSE = 1.502e-04

Gradient descent¶

Let's first plot MSE for several values of the slope in order to verify the convex shape of the cost function

Ns = 100
slope = np.linspace(-1, 1, Ns)
sx = np.matmul(np.reshape(xUnB, (N, 1)), np.reshape(slope,(1,Ns)))
er_sx_y = sx - np.reshape(yUnB, (N, 1)) 
mse = np.mean(er_sx_y.T**2, axis = 1)

plt.plot(slope, mse)
plt.xlabel('w')
plt.ylabel('MSE')
plt.title('Min MSE = %.3e' % np.min(mse))
plt.grid();

Convexity of the MSE as function of the linear regression slope (w coefficient) is shown

Gradient descent computation¶

def calcGradient(X, Y, b, w):
    A = w * X + b - Y
    F_db = np.sum(A)
    F_dw = np.dot(A, X)
    return (F_db, F_dw, math.sqrt(F_db**2 + F_dw**2))

# Initial coef
b7, w7 = 1, 1
threshold = 1e-1
learningRate = 1e-6
nIterMax = 1e5

# Init
gradient_b, gradient_w, gradientNorm = calcGradient(xTrain, yTrain, b7, w7)

print('START: b = %.3e' % b7, ', w = %.3e' % w7, ', Gradient norm = %.3e' % gradientNorm)
w7Learn = [np.array([b7, w7, gradientNorm])]

nIter = 0
while (gradientNorm > threshold) & (nIter < nIterMax):
    b7 = b7 - learningRate * gradient_b
    w7 = w7 - learningRate * gradient_w
    gradient_b, gradient_w, gradientNorm = calcGradient(xTrain, yTrain, b7, w7)
    w7Learn.append(np.array([b7, w7,  gradientNorm]))
    nIter += 1
    
print('END  : b = %.3e' % b7, ', w = %.3e' % w7, ', Gradient norm = %.3e' % gradientNorm, ', num iteration =', len(w7Learn))
df7 = pandas.DataFrame(w7Learn, columns = ('b', 'w', 'Gradient norm'))

START: b = 1.000e+00 , w = 1.000e+00 , Gradient norm = 9.226e+04
END  : b = 3.230e-01 , w = 1.449e-01 , Gradient norm = 9.981e-02 , num iteration = 4832

fig = plt.figure(figsize=(16,12))
plt.subplot(2,2,1)
plt.plot(df7['b'])
plt.grid()
plt.title('b');
plt.subplot(2,2,2)
plt.plot(df7['w'])
plt.grid()
plt.title('w');
plt.subplot(2,2,3)
plt.semilogy(df7['Gradient norm'])
plt.grid()
plt.title('Gradient norm');

fig = plt.figure(figsize=(16,8))
plt.subplot(1,2,1)
plt.semilogy(df7['b'], df7['Gradient norm'])
plt.grid()
plt.xlabel('b')
plt.ylabel('gradient norm');
plt.title('Trajectory of w and Gradient norm');
plt.subplot(1,2,2)
plt.semilogy(df7['w'], df7['Gradient norm'])
plt.grid()
plt.xlabel('w')
plt.ylabel('gradient norm');
plt.title('Trajectory of w and Gradient norm');

Test model¶

yEst7 = w7*xTest + b7

plt.plot(xTest, yTest, xTest, yEst7);
mse7 = skMetrics.mean_squared_error(yTest, yEst7)
print('Gradient descent, MSE =', '{:.3e}'.format(mse7));

Gradient descent, MSE = 1.502e-04

Stochastic gradient descent¶

Using new data batch at each iteration.

Alternatively, on a finite data set: shuffle the samples.

nBatch = 100
b8, w8 = 2, 2
threshold = 1e-3
learningRate = 1e-3
nIterMax = 1e5

# Initial batch
xBatch, yBatch, yBC = generateBatch(nBatch, True)
gradient_b, gradient_w, gradientNorm = calcGradient(xBatch, yBatch, b8, w8)

print('START: b = %.3e' % b8, ', w = %.3e' % w8, ', Gradient norm = %.3e' % gradientNorm)
w8Learn = [np.array([b8, w8, gradientNorm])]

# Continue
nIter = 0
while (gradientNorm > threshold) & (nIter < nIterMax):
    b8 = b8 - learningRate * gradient_b
    w8 = w8 - learningRate * gradient_w
    xBatch, yBatch, yBC = generateBatch(nBatch, True)
    gradient_b, gradient_w, gradientNorm = calcGradient(xBatch, yBatch, b8, w8)
    w8Learn.append(np.array([b8, w8,  gradientNorm]))
    learningRate = learningRate * 0.9999 # Decreasing learning rate
    nIter += 1
    
print('START: b = %.3e' % b8, ', w = %.3e' % w8, ', Gradient norm = %.3e' % gradientNorm, ', num iteration =', len(w8Learn))
df8 = pandas.DataFrame(w8Learn, columns = ('b', 'w', 'Gradient norm'))

START: b = 2.000e+00 , w = 2.000e+00 , Gradient norm = 2.230e+02
START: b = 3.219e-01 , w = 1.492e-01 , Gradient norm = 9.648e-04 , num iteration = 3495

fig = plt.figure(figsize=(16,12))
plt.subplot(2,2,1)
plt.plot(df8['b'])
plt.grid()
plt.title('b');
plt.subplot(2,2,2)
plt.plot(df8['w'])
plt.grid()
plt.title('w');
plt.subplot(2,2,3)
plt.semilogy(df8['Gradient norm'])
plt.grid()
plt.title('Gradient norm');

The gradient norm is getting very noisy as the value is below $10^{-1}, a more adaptive learning rate would be needed there and a mean on the gradient norm to improve the stop condition of the gradient descent

fig = plt.figure(figsize=(16,8))
plt.subplot(1,2,1)
plt.semilogy(df8['b'], df8['Gradient norm'])
plt.grid()
plt.xlabel('b')
plt.ylabel('gradient norm');
plt.title('Trajectory of w and Gradient norm');
plt.subplot(1,2,2)
plt.semilogy(df8['w'], df8['Gradient norm'])
plt.grid()
plt.xlabel('w')
plt.ylabel('gradient norm');
plt.title('Trajectory of w and Gradient norm');

Test model¶

yEst8 = w8*xTest + b8

plt.scatter(xBatch, yBatch, marker='.', color = 'black');
plt.plot(xTrain, yClean)
plt.plot(xTest, yEst8)
plt.legend(['Generator model', 'Estimated model', 'Last batch'])
mse8 = skMetrics.mean_squared_error(yTest, yEst8) 
print('Stochastic gradient descent, MSE =', "{:.3e}".format(mse8));

Stochastic gradient descent, MSE = 1.506e-04

Gradient descent with SciKit Learn¶

https://scikit-learn.org/stable/modules/sgd.html#regression

from sklearn.linear_model import SGDRegressor as skSGDRegressor

model9 = skSGDRegressor(alpha=0.0001, average=False, 
           early_stopping=True, epsilon=0.1, eta0=0.0, fit_intercept=True,
           learning_rate='optimal', loss='squared_loss', max_iter=1000,
           n_iter_no_change=5, penalty='l2', power_t=0.5,
           random_state=None, shuffle=True, tol=0.001,
           validation_fraction=0.1, verbose=0, warm_start=False)
# l1_ratio=0.15,

Notes:

Regularizer is called 'penalty' and parameterized by 'alpha' (and 'l1_ratio')
Early stopping is available and parameterized by 'early_stopping', 'max_iter', 'tol' and 'n_iter_no_change'
Shuffling between epochs enabled by 'shuffle'

model9.fit(xTrain.reshape(-1, 1), yTrain)
print('Y = {0} X + {1}'.format(model9.coef_, model9.intercept_))

Y = [0.14393348] X + [0.32146107]

yEst9 = model9.predict(xTest.reshape(-1,1))

plt.plot(xTrain, yClean)
plt.plot(xTest, yEst9)
plt.legend(['Generator model', 'Estimated model'])
mse9 = skMetrics.mean_squared_error(yTest, yEst9) 
print('Gradient descent with SK Learn, MSE =', "{:.3e}".format(mse9));

Gradient descent with SK Learn, MSE = 1.536e-04

Main take-aways¶

Linear regression has a closed form leading to the best fit. Many Python libraries provide this linear or polynomial fitting.

We have also learnt gradient descent on this simple case, it will be very useful for coming projects based on neural networks.

Where to go from here ?¶

Other single feature linear implementation using TensorFlow (Notebook)

More complex bivariate models using "raw" Python (Notebook) up to the gradient descent with regularizer, or using Keras (Notebook)

Compare with the single feature binary classification using logistic regression using "raw" Python or libraries (Notebook)

data-science

Notebooks and Python about data science

If you like this project please add your Star

Local approximation of an univariate quadratic expression using linear regression¶

Learning goals:¶

Generator Model¶

Train data¶

Test data¶

Closed form / analiticaly¶

Test model¶

Numpy polyfit, 1st degree¶

Test model¶

Numpy polyfit, 4th degree¶

Test model¶

NumPy least square¶

Test model¶

Scipy linear regression¶

Test model¶

SK Learn linear regression¶

Test model¶

Gradient descent¶

Gradient descent computation¶

Test model¶

Stochastic gradient descent¶

Test model¶

Gradient descent with SciKit Learn¶

Main take-aways¶

Where to go from here ?¶