View on GitHub

data-science

Notebooks and Python about data science

If you like this project please add your Star

Linear regression estimate quality (bivariate with Gaussian noise)

Up to now, the regression models with 1 or 2 features were based on a infinite length dataset. As a consequence, all estimates were (almost) perfect.

In a given "real life" application, the dataset might be limited for many reasons like:

  • the data collect is slow (low frequency) and it has started recently
  • the phenomena to explain is not that stable (stationary), and quasi stationarity time interval corresponds to a small dataset

Data analysis and modeling is possible still, but then we must take into account for variability

Learning goals

  • Learn about the Gaussian linear model
  • Show the impact of reduced length dataset on estimate quality (bias, variance)
  • Theory and experiments using small length training datasets
  • Confidence intervals on the estimates

References

  • Lectures notes on ordinary least squares, Telecom Paris, François Portier and Anne Sabourin (Unpublished)
  • High-Dimensional Statistics, chapter 2 : Fixed Design linear regression, MIT 18.S997 2015
In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.preprocessing import normalize as skNormalize
import pandas as pd
import math
import scipy.stats as stats
import seaborn as sns

Helpers

In [2]:
def plotHistParams(x, label, ax, span, nBins = 100, pdfRefX=None, pdfRefY=None):
    """Plot histogram x, compute """
    ax.hist(x, bins=nBins, density=True, range=span, label='histogram')
    mu = np.mean(x)
    x_c = x - mu
    ax.set_title('%s\n$\mu$=%.3f, $\sigma^2$=%.3f' % (label, mu, np.dot(x_c,x_c) / len(x)))
    if not pdfRefX is None:
        ax.plot(pdfRefX, pdfRefY, label='ref')
        ax.legend()
        
def plotToRef(x, y, ref, ax, title, xLabel=None, yErr=None):
    """ Plot y and a reference (or target) value"""
    if yErr is None:
        ax.plot(x, y)
    else:
        ax.errorbar(x, y, yerr=yErr, ecolor='red', capsize=3.0)
    if xLabel:
        ax.set_xlabel(xLabel)
    ax.plot(x, np.ones((len(x)))*ref, alpha=0.5, color='orange')
    ax.set_title(title)
    ax.grid()
    
def plotHeatMap(X, classes, title=None, fmt='.2g', ax=None):
    """ Fix heatmap plot from Seaborn with pyplot 3.1.0, 3.1.1
        https://stackoverflow.com/questions/56942670/matplotlib-seaborn-first-and-last-row-cut-in-half-of-heatmap-plot
    """
    ax = sns.heatmap(X, xticklabels=classes, yticklabels=classes, annot=True, fmt=fmt, cmap=plt.cm.Blues, ax=ax) #notation: "annot" not "annote"
    bottom, top = ax.get_ylim()
    ax.set_ylim(bottom + 0.5, top - 0.5)
    if title:
        ax.set_title(title)

Model

We will use a linear model with Gaussian noise, also called Fixed Design $f_1(x) = 0.5 x_0 - 0.7 x_1 + 0.35 + \epsilon$

With :

  • $x_0 \in [0, 0.5], x_1 \in [-0.5, 0.5]$ the two features (co-variables)
  • $\epsilon \sim \mathcal{N}(0, \sigma^2)$ a Gaussian distributed unknown
In [3]:
nFeatures = 2
bLin = 0.35
wLin = [0.5, -0.7]
sigmaLin = 0.5
def generateBatchLinear(n, sigma=sigmaLin):
    #
    xMin = np.array([0, -0.5])
    xMax = np.array([0.5, 0.5])
    #
    x = np.random.uniform(xMin, xMax, (n, 2))
    yClean = np.dot(x, wLin) + bLin
    return x, yClean, yClean + np.random.normal(0, sigma, n)
In [4]:
x_1000, yClean_1000, y_1000 = generateBatchLinear(1000, sigmaLin)
fix, ax = plt.subplots(1,2, figsize=(15, 6))
for i,a in enumerate(ax):
    a.scatter(x_1000[:, i], y_1000)
    a.set_xlabel('$x_%d$' % i)
    a.set_ylabel('$y$');

With a priori knowledge restricted to the model to apply, i.e. Gaussian, there are five quantities to evaluate :

  • $\hat{w}_0, \hat{w}_1$ the weights to apply to the features
  • $\hat{b}$ the intercept
  • the Gaussian noise with the mean $\hat{\mu}$, and variance (power) $\hat{\sigma}^2$

Closed form linear regression

As explained in more details in previous notebook (HTML / Jupyter, the linear regression estimates are computed in closed form by minimizing the Euclidian norm $\lVert X_m \Theta - Y \rVert_2^2$ where:

  • $X \in \mathbb{R}^{n \times 2}$
  • $\Theta = \begin{bmatrix} b \\ w_0 \\ w_1 \end{bmatrix} $
  • $X_m = \begin{bmatrix} \mathbb{1}_n & X \\ \end{bmatrix}$, a modified $X$ with a column of 1s to evaluate the intercept

Then : $ Y = X_m \Theta + \epsilon $

Leading to the closed form computation of the linear regression, assuming the matrix is invertible (or at least pseudo invertible) : $$\hat{\Theta} = (X_m^T X_m)^{-1} X_m^T Y$$

In [5]:
def xWithIntercept(x):
    """ Add a column of ones in order to compute the intercept along with coefficients related to X"""
    return np.concatenate((np.ones((x.shape[0], 1)), x), axis=1)

x_m_1000 = xWithIntercept(x_1000)

def linearRegression(x, y):
    xTxInv = np.linalg.inv(np.matmul(x.T, x))
    return np.matmul(xTxInv, np.matmul(x.T, y))

thetaEst0 = linearRegression(x_m_1000, yClean_1000)
print("Estimation on clean y : Intercept=%.4f, coefficients=%s" % (thetaEst0[0], thetaEst0[1:]))

thetaEst_1000 = linearRegression(x_m_1000, y_1000)
print("Estimation on noisy y : Intercept=%.4f, coefficients=%s" % (thetaEst_1000[0], thetaEst_1000[1:]))
Estimation on clean y : Intercept=0.3500, coefficients=[ 0.5 -0.7]
Estimation on noisy y : Intercept=0.3715, coefficients=[ 0.45111285 -0.67426465]

Noise estimation

The noise mean is simply computed as the mean of $Y - X_m \Theta$. It is expected to be 0 in our case.

Given a noise mean equal to 0, the unbiased noise variance estimation is then : $$ \hat{\sigma}^2 = \frac{1}{n-p-1} \Vert Y - X_m \hat{\Theta} \Vert_2^2$$

with:

  • $n$ the number of samples in our set
  • $p$ the rank of the matrix $X^TX$, equals to 2 in our example
In [6]:
def noiseEstimation(yEst, y, p):
    """ Unbiased noise mean and variance estimation """
    epsilonEst = y - yEst
    mu = np.mean(epsilonEst)
    epsilonEst_c = epsilonEst - mu
    var = 1 / (len(y) - p - 1) * np.dot(epsilonEst_c, epsilonEst_c)
    return mu, var

noiseMu_1000, noiseVar_1000 = noiseEstimation(np.matmul(x_m_1000, thetaEst_1000), y_1000, nFeatures)
print('Estimated noise mean=%.2e and std deviation=%.4f' % (noiseMu_1000, math.sqrt(noiseVar_1000)))
Estimated noise mean=-1.82e-16 and std deviation=0.5051

Other important theorerical results on the Gaussian linear model

Given $\hat{\Theta_{n,k}}$ the component $k$ of the estimate of $\Theta$ on $n$ samples, and $\Theta_k$ actual value of this component:

  • Lemma 1: variations on the estimated components of $\Theta$ are Gaussian distributed $$\left(\hat{\Theta_n} - \Theta \right) \sim \mathcal{N}\left( 0, (X^T X)^{-1} \sigma^2 \right)$$

  • Lemma 2: the ratio of the estimated $\hat{\sigma}^2$ to the true (unknown) $\sigma$ is $\chi^2$ distributed $$\frac{\hat{\sigma_n}^2(n-p-1)}{\sigma^2} \sim \chi^2(n-p-1)$$

    • $\chi^2(n-p-1)$ is a Chi-2 law with n-p-1 degrees of freedom
  • Lemma 3: the ratio of the variation of the coefficient over the estimated $\hat{\sigma}$ follows a Student law $$\sqrt{\frac{n}{\hat{\sigma_n}^2 S_{n,k}}} (\hat{\Theta_{n,k}} - \Theta_k) \sim \mathcal{T}_{n-p-1}$$
    • $\mathcal{T}_{n-p-1}$ is the Student law with (n-p-1) degrees of freedom
    • $S_{n,k} = n (X^TX)^{-1}_{k,k}$ (select element k,k of the Gram matrix)
In [7]:
xx = [np.linspace(0, 16), np.linspace(-5, 5)]
fig,ax = plt.subplots(1, 2, figsize=(12, 5))
for i, a in enumerate(ax):
    a.plot(xx[i], stats.norm.pdf(xx[i]), label='$\mathcal{N}(0,1)$')
for i in [1, 2, 4, 8]:
    label0 = '$\chi^2(%d)$' % i
    ax[0].plot(xx[0], stats.chi2(i).pdf(xx[0]), label=label0)
    label1 = '$\mathcal{T}(%d)$' % i
    ax[1].plot(xx[1], stats.t(i).pdf(xx[1]), label=label1)
for a in ax:
    a.legend()
    a.set_ylabel('Density (PDF)')
    a.grid()
ax[0].set_title('$\chi^2$ distributions vs. Normal')
ax[1].set_title('$\mathcal{T}$ distributions vs. Normal');

As shown on right hand figure above, the Student law quickly converges to the Gaussian law when the degrees of freedom increase.

Application of Lemma 1:

In [8]:
def weightCovariance(x, sigma):
    """ Compute weight covariance as per Lemma 1"""
    return np.linalg.inv(np.matmul(x.T, x)) * sigma**2

wCovTheo_1000 = weightCovariance(x_m_1000, sigmaLin) 
plotHeatMap(wCovTheo_1000, ('b','$w_0$', '$w_1$'), title='Correlation matrix', fmt='.2e')

Sweep over dataset length

In a first move, let's sweep over the length of the dataset from 4 to 1000 items and estimate the linear regression and the noise parameters.

Note: the minimum value of n is: $\min_{n \in \mathbb{N}} (n-p-1 > 0) = 4$

In [9]:
fig, ax = plt.subplots(2, 2, figsize=(16, 10), subplot_kw={'xscale':'log'})

regressions = []
for n in [4, 5, 10, 20, 40, 100, 200, 500, 1000]:
    x_m_n = x_m_1000[:n]
    y_n = y_1000[:n]
    thetaEst = linearRegression(x_m_n, y_n)
    noiseMu, noiseVar = noiseEstimation(np.matmul(x_m_n, thetaEst), y_n, nFeatures)
    regressions.append([n, thetaEst[0], thetaEst[1], thetaEst[2], noiseMu, noiseVar])

dfSweepN = pd.DataFrame(regressions, columns=('n', 'b', 'w0','w1', 'noise mean', 'noise var'))
plotToRef(dfSweepN['n'], dfSweepN['b'], bLin, ax[0,0], '$\hat{b}$')
plotToRef(dfSweepN['n'], dfSweepN['noise var'], sigmaLin**2, ax[1,0], '$\hat{\sigma}^2$', xLabel='n')
plotToRef(dfSweepN['n'], dfSweepN['w0'], wLin[0], ax[0,1], '$\hat{w_0}$')
plotToRef(dfSweepN['n'], dfSweepN['w1'], wLin[1], ax[1,1], '$\hat{w_1}$', xLabel='n');

Distributions with $n = 5$, fixed design

In this set of experiments, noise samples are drawn at each experiment. The $x_i$ are known and fixed.

Let's draw many experiences with n = 5 in order to plot the histograms of the estimators

In [10]:
regressions = []
n = 5
x_m_5 = x_m_1000[:n]
yClean_5 = yClean_1000[:n]
for l in range(50000):
    # Draw noise
    y_5 = yClean_5 + np.random.normal(0, sigmaLin, n)
    thetaEst = linearRegression(x_m_5, y_5)
    noiseMu, noiseVar = noiseEstimation(np.matmul(x_m_5, thetaEst), y_5, nFeatures)
    regressions.append([thetaEst[0], thetaEst[1], thetaEst[2], noiseMu, noiseVar])
    
df_5 = pd.DataFrame(regressions, columns=('b', 'w0','w1', 'noise mean', 'noise var'))
In [11]:
fig, ax = plt.subplots(1, 4, figsize=(16, 5))
plotHistParams(df_5['b']  - bLin,    '$\hat{b} - b$',     ax[0], [-3, 3])
plotHistParams(df_5['w0'] - wLin[0], '$\hat{w_0} - w_0$', ax[1], [-5, 5])
plotHistParams(df_5['w1'] - wLin[1], '$\hat{w_1} - w_1$', ax[2], [-5, 5])

dFree = n - nFeatures - 1
xx = np.linspace(0,15)
plotHistParams(df_5['noise var'] * dFree / sigmaLin**2, '$(n-p-1)\hat{\sigma}^2 / \sigma^2$ vs. $\chi^2(2)$', 
                    ax[3], [0,20], pdfRefX=xx, pdfRefY=stats.chi2(dFree).pdf(xx))

On above graph, the regression intercept $b$ and weigths ($w_0, w_1$) look like gaussian distributed and the noise variance scaled by $\frac{n-p-1}{\sigma^2}$ with n=5, p=2, is matching the expected $\chi^2(2)$ (see Lemma 2)

Verification of Lemma 1

As per Lemma 1, we expect the following covariance matrix for the coefficients: $$\left(\hat{\Theta_n} - \Theta \right) \sim \mathcal{N}\left( 0, (X^T X)^{-1} \sigma^2 \right)$$

In [12]:
def autoCovariance(x, axis=0):
    x_c = x - np.mean(x, axis=0)
    return 1 / x.shape[axis] *  np.matmul(x_c.T, x_c)

# Covariance of the intercept and weights
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
ThetaCov_5 = autoCovariance(df_5[['b', 'w0', 'w1']].values)
plotHeatMap(ThetaCov_5, ('b','$w_0$', '$w_1$'), title='Weight covariance matrix', ax=ax[0], fmt='.3f');

# Covariance matrix based on the design matrix (X^T X)
wCovTheo_5 = weightCovariance(x_m_5, sigmaLin)
plotHeatMap(wCovTheo_5, ('b','$w_0$', '$w_1$'), title='Theoretical covariance matrix', ax=ax[1], fmt='.3f');

Covariance matrices are matching !

Verification of Lemma 3

Let's now verify the law involving only the $\hat{\Theta}$ and $\hat{\sigma}^2$ estimates, as per Lemma 3:

  • $J(\hat{\Theta_{n,k}}) = \sqrt{\frac{n}{\hat{\sigma_n}^2 S_{n,k}}} (\hat{\Theta_{n,k}} - \Theta_k) \sim \mathcal{T}_{n-p-1}$
    • $\mathcal{T}_{n-p-1}$ is the Student law with (n-p-1) degrees of freedom
    • $S_{n,k} = n (X^TX)^{-1}_{k,k}$ (select element k,k of the inverted Gram matrix)
In [13]:
n = 5
Theta_5Est = df_5[['b', 'w0', 'w1']].values
Theta_5Exp = np.array([bLin, wLin[0], wLin[1]])
estim = np.zeros((3, len(df_5)))
xTXInv_5 = n * np.linalg.inv(np.matmul(x_m_5.T, x_m_5))
nOverNoiseVarSqrt = np.sqrt(n / df_5['noise var'])
for k, theta in enumerate(Theta_5Est.T):
    Snk = xTXInv_5[k, k]
    estim[k] = math.sqrt(1/Snk) \
        * np.multiply(nOverNoiseVarSqrt, (theta - Theta_5Exp[k]))
In [14]:
fig, ax = plt.subplots(1, 3, figsize=(16, 5))
span, xx = [-10, 10], np.linspace(-10, 10)
pdfXx = stats.t(n - nFeatures - 1).pdf(xx)
labels = ['$J(\hat{b})$ vs. $\mathcal{T}(2)$', '$J(\hat{w_0})$ vs. $\mathcal{T}(2)$', '$J(\hat{w_1})$ vs. $\mathcal{T}(2)$']
for est, l, a in zip(estim, labels, ax):
    plotHistParams(est, l, a, span, pdfRefX=xx, pdfRefY=pdfXx)

Theory is verified, we now have a closed form expression to compute the confidence interval on the estimates

Confidence intervals

For this Gaussian linear model, for each coefficient, the confidence interval with level $\alpha$ is: $$IC(\theta_k) = \left[ \hat{\theta_{n,k}} + \sqrt{\frac{\hat{\sigma_n}^2 S_{n,k}}{n}} q_{\frac{\alpha}{2}}, \hat{\theta_{n,k}} + \sqrt{\frac{\hat{\sigma_n}^2 S_{n,k}}{n}} q_{1-\frac{\alpha}{2}} \right]$$

With $q_{\frac{\alpha}{2}}$ and $q_{1-\frac{\alpha}{2}}$ the quantiles of the Student law $\mathcal{T}(n-p-1)$

We may now create a specific regression procedure that is estimating the noise variance $\hat{\sigma}^2$ and provides confidence intervals for the coefficients:

In [15]:
def linearRegressionLinearGaussian(x, y, alpha):
    """ Compute linear regression coefficients and their confidence interval
            Assume that X is full rank
    """
    n = len(y)
    rank = x.shape[1] # = p + 1
    # Design matrix
    xTxInv = np.linalg.inv(np.matmul(x.T, x))
    # Regression coefficients
    thetas = np.matmul(xTxInv, np.matmul(x.T, y))
    # Predictions
    yEst = np.matmul(x, thetas)
    # Noise variance
    epsilonEst = y - yEst
    noiseMu = np.mean(epsilonEst) # for verification, but assumed 0
    epsilonEst_c = epsilonEst - noiseMu
    noiseVar = 1 / (n - rank) * np.dot(epsilonEst_c, epsilonEst_c)
    # Student quantile
    qAlpha = stats.t(n - rank).ppf(1 - alpha/2)
    # Confidence interval half range
    xTxInvDiag = np.diagonal(xTxInv)
    confidence = np.sqrt(noiseVar * xTxInvDiag) * qAlpha
    return thetas, confidence, noiseVar

Test with $n=5$

In [16]:
alpha = 0.1
thetas, confidence, noiseVar = linearRegressionLinearGaussian(x_m_5, y_5, alpha)
print('Coefficients =', thetas)
print('Confidence intervals @ %.0f%% :\n' % ((1-alpha)*100), 
      np.vstack((thetas - confidence, thetas + confidence)).T)
print('Noise variance estimate = %.3f' % noiseVar)
Coefficients = [ 0.29391061 -1.18995093 -0.81258792]
Confidence intervals @ 90% :
 [[-0.84904567  1.43686688]
 [-6.26703783  3.88713597]
 [-3.28128106  1.65610521]]
Noise variance estimate = 0.116

With so few samples in the dataset, the confidence intervals are very wide, this is shown on the three series of graphs above.

Note: all confidence intervals depend on the noise variance estimate, any error on it will drive a large impact on the resulting error.

Full test of the confidence interval

Let's redo the sweep on the dataset length computing also confidence intervals:

In [17]:
alpha = 0.05

regressions = []
for n in [5, 10, 20, 40, 100, 200, 500, 1000]:
    x_m_n = x_m_1000[:n]
    y_n = y_1000[:n]
    thetaEst, confidence, noiseVar = linearRegressionLinearGaussian(x_m_n, y_n, alpha)
    regressions.append([n, 
                        thetaEst[0], thetaEst[1], thetaEst[2], 
                        confidence[0], confidence[1], confidence[2],
                        noiseVar])

dfSweepN2 = pd.DataFrame(regressions, columns=('n', 'b', 'w0','w1', 'bConf', 'w0Conf', 'w1Conf', 'noise var'))
fig, ax = plt.subplots(2, 2, figsize=(16, 8), subplot_kw={'xscale':'log'})
plotToRef(dfSweepN2['n'], dfSweepN2['b'], bLin, ax[0,0], '$\hat{b}$', yErr=dfSweepN2['bConf'])
plotToRef(dfSweepN2['n'], dfSweepN2['noise var'], sigmaLin**2, ax[1,0], '$\hat{\sigma}^2$', xLabel='n')
plotToRef(dfSweepN2['n'], dfSweepN2['w0'], wLin[0], ax[0,1], '$\hat{w_0}$', yErr=dfSweepN2['w0Conf'])
plotToRef(dfSweepN2['n'], dfSweepN2['w1'], wLin[1], ax[1,1], '$\hat{w_1}$', yErr=dfSweepN2['w1Conf'], xLabel='n');

On above graphs, the error bars, in red, correspond to the confidence interval. For short dataset lengths the confidence interval is tremendous. From 10 samples and over, the confidence interval is narrowing quickly.

The true value (orance line) should be within reach of the error bar, if not it means that the case is out of the 95% probability.

Distributions with $n = 5$ and stochastic $X$ values

A different set of experiments in which noise is drawn from the gaussian distribution, and X from the uniform distributions.

In [18]:
n = 5

regressions = []
for l in range(50000):
    # Draw a new batch
    x, yClean, y = generateBatchLinear(n)
    x_m = xWithIntercept(x)
    thetaEst = linearRegression(x_m, y)
    noiseMu, noiseVar = noiseEstimation(np.matmul(x_m, thetaEst), y, nFeatures)
    regressions.append([thetaEst[0], thetaEst[1], thetaEst[2], noiseMu, noiseVar])
    
df_5s = pd.DataFrame(regressions, columns=('b', 'w0','w1', 'noise mean', 'noise var'))
In [19]:
fig, ax = plt.subplots(1, 4, figsize=(16, 5))
plotHistParams(df_5s['b']  - bLin,    '$\hat{b} - b$',     ax[0], [-3, 3])
plotHistParams(df_5s['w0'] - wLin[0], '$\hat{w_0} - w_0$', ax[1], [-5, 5])
plotHistParams(df_5s['w1'] - wLin[1], '$\hat{w_1} - w_1$', ax[2], [-5, 5])

dFree = n - nFeatures - 1
ax[3].hist(df_5s['noise var'] * dFree / sigmaLin**2, bins=100, density=True, label='histogram')
xx = np.linspace(0,15)
ax[3].plot(xx, stats.chi2(dFree).pdf(xx), label='$\chi^2(2)$')
ax[3].legend()
ax[3].set_title('$(n-p-1)\hat{\sigma}^2 / \sigma^2$ vs. $\chi^2(2)$');

On above graphs, compared to previous graphs, we observe that the shapes are similar but the variances are higher on $b, w_0, w_1$

Whereas the scaled noise variance is still matching the Chi-2 distribution

In [20]:
# Covariance of the intercept and weights
ThetaCov_5s = autoCovariance(df_5s[['b', 'w0', 'w1']].values)
plotHeatMap(ThetaCov_5s, ('b','$w_0$', '$w_1$'), title='Correlation matrix', fmt='.3f');

Globally the variance on the coefficients has increased, this reflects variability on fitting Y as a linear relation to X.

Conclusion

Starting from "raw theory", we have shown first how to match it with experiment, and then an application of this theory.

In this notebook, the data prior (inherent) model is Gaussian, which is somehow ideal. In the next notebooks, we will investigate more complex models and situations.