Linear Regression#

image.png

The Model#

\[y_i = \beta x_i + \alpha + \epsilon_i\]

where

  • \(y_i\) is the number of minutes user i spends on the site daily,

  • \(x_i\) is the number of friends user i has

  • \(\alpha\) is the constant when x = 0.

  • \(ε_i\) is a (hopefully small) error term representing the fact that there are other factors not accounted for by this simple model.

Least Squares Fit#

最小二乘法

\[ y_i = X_i^T w\]

The constant could be represent by 1 in X

The squared error could be written as:

\[ \sum_{i = 1}^m (y_i -X_i^T w)^2 \]

If we know \(\alpha\) and \(\beta\), then we can make predictions.

Since we know the actual output \(y_i\) we can compute the error for each pair.

Since the negative errors cancel out with the positive ones, we use squared errors.

The least squares solution is to choose the \(\alpha\) and \(\beta\) that make sum_of_squared_errors as small as possible.

\[ y_i = \alpha + \beta x_i + \varepsilon_i \]
\[ \hat\varepsilon_i =y_i-a -\beta x_i \]
\[ \text{Find }\min_{\alpha,\, \beta} Q(\alpha, \beta), \quad \text{for } Q(\alpha, \beta) = \sum_{i=1}^n\hat\varepsilon_i^{\,2} = \sum_{i=1}^n (y_i -\alpha - \beta x_i)^2\ \]

均方误差的几何意义:欧式距离#

最小二乘法就是试图找到一条直线,是所有样本到直线的欧氏距离之和最小。

image.png

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd 
import statsmodels.api as sm
import statsmodels.formula.api as smf 
import matplotlib
matplotlib.style.use('fivethirtyeight')
matplotlib.style.available
['seaborn-dark',
 'seaborn-darkgrid',
 'seaborn-ticks',
 'fivethirtyeight',
 'seaborn-whitegrid',
 'classic',
 '_classic_test',
 'fast',
 'seaborn-talk',
 'seaborn-dark-palette',
 'seaborn-bright',
 'seaborn-pastel',
 'grayscale',
 'seaborn-notebook',
 'ggplot',
 'seaborn-colorblind',
 'seaborn-muted',
 'seaborn',
 'Solarize_Light2',
 'seaborn-paper',
 'bmh',
 'tableau-colorblind10',
 'seaborn-white',
 'dark_background',
 'seaborn-poster',
 'seaborn-deep']
num_friends_good = [49,41,40,25,21,21,19,19,18,18,16,15,15,15,15,14,14,13,13,13,13,12,12,11,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,8,8,8,8,8,8,8,8,8,8,8,8,8,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
daily_minutes_good = [68.77,51.25,52.08,38.36,44.54,57.13,51.4,41.42,31.22,34.76,54.01,38.79,47.59,49.1,27.66,41.03,36.73,48.65,28.12,46.62,35.57,32.98,35,26.07,23.77,39.73,40.57,31.65,31.21,36.32,20.45,21.93,26.02,27.34,23.49,46.94,30.5,33.8,24.23,21.4,27.94,32.24,40.57,25.07,19.42,22.39,18.42,46.96,23.72,26.41,26.97,36.76,40.32,35.02,29.47,30.2,31,38.11,38.18,36.31,21.03,30.86,36.07,28.66,29.08,37.28,15.28,24.17,22.31,30.17,25.53,19.85,35.37,44.6,17.23,13.47,26.33,35.02,32.09,24.81,19.33,28.77,24.26,31.98,25.73,24.86,16.28,34.51,15.23,39.72,40.8,26.06,35.76,34.76,16.13,44.04,18.03,19.65,32.62,35.59,39.43,14.18,35.24,40.13,41.82,35.45,36.07,43.67,24.61,20.9,21.9,18.79,27.61,27.21,26.61,29.77,20.59,27.53,13.82,33.2,25,33.1,36.65,18.63,14.87,22.2,36.81,25.53,24.62,26.25,18.21,28.08,19.42,29.79,32.8,35.99,28.32,27.79,35.88,29.06,36.28,14.1,36.63,37.49,26.9,18.58,38.48,24.48,18.95,33.55,14.24,29.04,32.51,25.63,22.22,19,32.73,15.16,13.9,27.2,32.01,29.27,33,13.74,20.42,27.32,18.23,35.35,28.48,9.08,24.62,20.12,35.26,19.92,31.02,16.49,12.16,30.7,31.22,34.65,13.13,27.51,33.2,31.57,14.1,33.42,17.44,10.12,24.42,9.82,23.39,30.93,15.03,21.67,31.09,33.29,22.61,26.89,23.48,8.38,27.81,32.35,23.84]
alpha, beta = 22.9475, 0.90386
plt.scatter(num_friends_good, daily_minutes_good)
plt.plot(num_friends_good, [alpha + beta*i for i in num_friends_good], 'r-')
plt.xlabel('# of friends', fontsize = 20)
plt.ylabel('Minutes per day', fontsize = 20)
plt.title('linear regression', fontsize = 20)
plt.show()
_images/a0b3743b49a6ba63c22a2eb6ecfc60262e4ac7c49f1c6ec065769d12c831cb41.png

Of course, we need a better way to figure out how well we’ve fit the data than staring at the graph.

A common measure is the coefficient of determination (or R-squared), which measures the fraction of the total variation in the dependent variable that is captured by the model.

The Matrix Method#

\[ y_i = X_i^T w\]

The constant could be represent by 1 in X

The squared error could be written as:

\[ \sum_{i = 1}^m (y_i -X_i^T w)^2 \]

image.png

We can also write this in matrix notation as \((y-Xw)^T(y-Xw)\).

If we take the derivative of this with respect to \(w\), we’ll get

\[X^T(y-Xw)\]

We can set this to zero and solve for w to get the following equation:

\[\hat w = (X^T X)^{-1}X^T y\]
# https://github.com/computational-class/machinelearninginaction/blob/master/Ch08/regression.py
import pandas as pd
import random

dat = pd.read_csv('./data/ex0.txt', sep = '\t', names = ['x1', 'x2', 'y'])
dat['x3'] = [yi*.3 + .5*random.random() for yi in dat['y']]
dat.head()
x1 x2 y x3
0 1.0 0.067732 3.176513 1.063914
1 1.0 0.427810 3.816464 1.314866
2 1.0 0.995731 4.550095 1.775487
3 1.0 0.738336 4.256571 1.670366
4 1.0 0.981083 4.560815 1.368580
from numpy import mat, linalg, corrcoef

def standRegres(xArr,yArr):
    xMat = mat(xArr); yMat = mat(yArr).T
    xTx = xMat.T*xMat
    if linalg.det(xTx) == 0.0:
        print("This matrix is singular, cannot do inverse")
        return
    ws = xTx.I * (xMat.T*yMat)
    return ws
xs = [[dat.x1[i], dat.x2[i], dat.x3[i]] for i in dat.index]
y = dat.y
print(xs[:2])
ws = standRegres(xs, y)
print(ws)
[[1.0, 0.067732, 1.063913920495336], [1.0, 0.42781, 1.3148661204308572]]
[[2.91879051]
 [1.64782563]
 [0.08003454]]
xMat=mat(xs)
yMat=mat(y)
#yHat = xMat*ws
xCopy=xMat.copy()
xCopy.sort(0)
yHat=xCopy*ws
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xMat[:,1].flatten().A[0], yMat.T[:,0].flatten().A[0])
ax.plot(xCopy[:,1],yHat, 'r-')
plt.ylim(0, 5)
plt.show()
_images/986459aa52d822928afd75ec62b85d62513e71d81f43b978456c433a2658a9a2.png
yHat = xMat*ws
corrcoef(yHat.T, yMat)
array([[1.        , 0.98685418],
       [0.98685418, 1.        ]])

Regression with Statsmodels#

http://www.statsmodels.org/stable/index.html

statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

dat = pd.read_csv('./data/ex0.txt', sep = '\t', names = ['x1', 'x2', 'y'])
dat['x3'] = [yi*.3 - .1*random.random() for yi in y]
dat.head()
x1 x2 y x3
0 1.0 0.067732 3.176513 0.928033
1 1.0 0.427810 3.816464 1.134750
2 1.0 0.995731 4.550095 1.335951
3 1.0 0.738336 4.256571 1.206655
4 1.0 0.981083 4.560815 1.352873
results = smf.ols('y ~ x2 + x3', data=dat).fit()

results.summary()
OLS Regression Results
Dep. Variable: y R-squared: 0.985
Model: OLS Adj. R-squared: 0.985
Method: Least Squares F-statistic: 6603.
Date: Fri, 01 Dec 2023 Prob (F-statistic): 3.01e-181
Time: 14:47:41 Log-Likelihood: 275.97
No. Observations: 200 AIC: -545.9
Df Residuals: 197 BIC: -536.0
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.6757 0.105 16.015 0.000 1.469 1.882
x2 0.9091 0.063 14.356 0.000 0.784 1.034
x3 1.5579 0.122 12.772 0.000 1.317 1.798
Omnibus: 9.415 Durbin-Watson: 2.096
Prob(Omnibus): 0.009 Jarque-Bera (JB): 4.294
Skew: 0.000 Prob(JB): 0.117
Kurtosis: 2.282 Cond. No. 62.6


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_partregress_grid(results, fig = fig)
plt.show()
_images/f17553a526deb1defede522c41e5c473987e158c28b907ecaf20b9634b25c6a7.png
# regression
import numpy as np
X = np.array(num_friends_good)
X = sm.add_constant(X, prepend=False)
mod = sm.OLS(daily_minutes_good, X)
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.329
Model:                            OLS   Adj. R-squared:                  0.326
Method:                 Least Squares   F-statistic:                     98.60
Date:                Fri, 01 Dec 2023   Prob (F-statistic):           3.68e-19
Time:                        14:49:24   Log-Likelihood:                -711.76
No. Observations:                 203   AIC:                             1428.
Df Residuals:                     201   BIC:                             1434.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.9039      0.091      9.930      0.000       0.724       1.083
const         22.9476      0.846     27.133      0.000      21.280      24.615
==============================================================================
Omnibus:                       26.873   Durbin-Watson:                   2.027
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                7.541
Skew:                           0.004   Prob(JB):                       0.0230
Kurtosis:                       2.056   Cond. No.                         13.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
fig = plt.figure(figsize=(12,6))
fig = sm.graphics.plot_partregress_grid(res, fig = fig)
plt.show()
_images/0846745b002a781de4c02d8dd5e0f7dc9fec720229e03ad7ebe25abbb6c4c11b.png

Regression towards mediocrity#

The concept of regression comes from genetics and was popularized by Sir Francis Galton during the late 19th century with the publication of Regression towards mediocrity in hereditary stature. Galton observed that extreme characteristics (e.g., height) in parents are not passed on completely to their offspring.

df = pd.read_csv('./data/galton.csv')
df['father_above_average'] = [i-df['father'].mean() for i in df['father']]
df['mother_above_average'] = [i-df['mother'].mean() for i in df['mother']]
df['height_more_than_father'] =df['height'] - df['father']
df['height_more_than_mother'] =df['height'] - df['mother']
df.head()
family father mother sex height nkids father_above_average mother_above_average height_more_than_father height_more_than_mother
0 1 78.5 67.0 M 73.2 4 9.267149 2.91559 -5.3 6.2
1 1 78.5 67.0 F 69.2 4 9.267149 2.91559 -9.3 2.2
2 1 78.5 67.0 F 69.0 4 9.267149 2.91559 -9.5 2.0
3 1 78.5 67.0 F 69.0 4 9.267149 2.91559 -9.5 2.0
4 2 75.5 66.5 M 73.5 4 6.267149 2.41559 -2.0 7.0
import seaborn as sns
sns.set(font_scale=1.5)

g = sns.PairGrid(df, y_vars=["height"], x_vars=["father", "mother"], hue="sex", height=4)
g.map(sns.regplot)
g.add_legend();
_images/a890f290cde4b1a6d803a830aaeec2ce00f5d8cf86ad119eb6032468f1f4bfa6.png
g = sns.PairGrid(df, y_vars=["height_more_than_father", "height_more_than_mother"], 
                 x_vars=["father_above_average", "mother_above_average"], hue="sex", height=4)
g .map(sns.regplot)
g.add_legend();
_images/41ee037694ccb5d194e4faacfabc4e8b9627986588cb696258bca08905466693.png
results = smf.ols('height ~ father + mother + C(sex) + nkids', data=df).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 height   R-squared:                       0.641
Model:                            OLS   Adj. R-squared:                  0.639
Method:                 Least Squares   F-statistic:                     398.1
Date:                Fri, 01 Dec 2023   Prob (F-statistic):          9.09e-197
Time:                        15:16:13   Log-Likelihood:                -1960.1
No. Observations:                 898   AIC:                             3930.
Df Residuals:                     893   BIC:                             3954.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      16.1877      2.794      5.794      0.000      10.704      21.671
C(sex)[T.M]     5.2099      0.144     36.125      0.000       4.927       5.493
father          0.3983      0.030     13.472      0.000       0.340       0.456
mother          0.3210      0.031     10.269      0.000       0.260       0.382
nkids          -0.0438      0.027     -1.612      0.107      -0.097       0.010
==============================================================================
Omnibus:                       12.177   Durbin-Watson:                   1.566
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               16.265
Skew:                          -0.149   Prob(JB):                     0.000294
Kurtosis:                       3.588   Cond. No.                     3.68e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.68e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
results2 = smf.ols('height_more_than_father ~ father_above_average + mother_above_average + C(sex) + nkids', data=df).fit()
print(results2.summary())
                               OLS Regression Results                              
===================================================================================
Dep. Variable:     height_more_than_father   R-squared:                       0.672
Model:                                 OLS   Adj. R-squared:                  0.671
Method:                      Least Squares   F-statistic:                     457.6
Date:                     Fri, 01 Dec 2023   Prob (F-statistic):          1.84e-214
Time:                             15:16:16   Log-Likelihood:                -1960.1
No. Observations:                      898   AIC:                             3930.
Df Residuals:                          893   BIC:                             3954.
Df Model:                                4                                         
Covariance Type:                 nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept               -4.9011      0.201    -24.426      0.000      -5.295      -4.507
C(sex)[T.M]              5.2099      0.144     36.125      0.000       4.927       5.493
father_above_average    -0.6017      0.030    -20.351      0.000      -0.660      -0.544
mother_above_average     0.3210      0.031     10.269      0.000       0.260       0.382
nkids                   -0.0438      0.027     -1.612      0.107      -0.097       0.010
==============================================================================
Omnibus:                       12.177   Durbin-Watson:                   1.566
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               16.265
Skew:                          -0.149   Prob(JB):                     0.000294
Kurtosis:                       3.588   Cond. No.                         20.4
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
fig = plt.figure(figsize=(8,8))
fig = sm.graphics.plot_partregress_grid(results, fig = fig)
plt.show()
_images/37d34784a13e22c29c1df61b8c8dad42f785750ad4c5041be67400fd82ed3fa2.png
fig = plt.figure(figsize=(12,12))
fig = sm.graphics.plot_partregress_grid(results2, fig = fig)
plt.show()
_images/3a283750f587eadf7a9119993e47d3ec0cd538016685be00be8fa772143f52be.png

image.png