Linear Regression

Linear Regression#

The Model#

\[y_i = \beta x_i + \alpha + \epsilon_i\]

where

\(y_i\) is the number of minutes user i spends on the site daily,
\(x_i\) is the number of friends user i has
\(\alpha\) is the constant when x = 0.
\(ε_i\) is a (hopefully small) error term representing the fact that there are other factors not accounted for by this simple model.

Least Squares Fit#

最小二乘法

\[ y_i = X_i^T w\]

The constant could be represent by 1 in X

The squared error could be written as:

\[ \sum_{i = 1}^m (y_i -X_i^T w)^2 \]

If we know \(\alpha\) and \(\beta\), then we can make predictions.

Since we know the actual output \(y_i\) we can compute the error for each pair.

Since the negative errors cancel out with the positive ones, we use squared errors.

The least squares solution is to choose the \(\alpha\) and \(\beta\) that make sum_of_squared_errors as small as possible.

\[ y_i = \alpha + \beta x_i + \varepsilon_i \]

\[ \hat\varepsilon_i =y_i-a -\beta x_i \]

\[ \text{Find }\min_{\alpha,\, \beta} Q(\alpha, \beta), \quad \text{for } Q(\alpha, \beta) = \sum_{i=1}^n\hat\varepsilon_i^{\,2} = \sum_{i=1}^n (y_i -\alpha - \beta x_i)^2\ \]

均方误差的几何意义：欧式距离#

最小二乘法就是试图找到一条直线，是所有样本到直线的欧氏距离之和最小。

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd 
import statsmodels.api as sm
import statsmodels.formula.api as smf 
import matplotlib
matplotlib.style.use('fivethirtyeight')

matplotlib.style.available

['seaborn-dark',
 'seaborn-darkgrid',
 'seaborn-ticks',
 'fivethirtyeight',
 'seaborn-whitegrid',
 'classic',
 '_classic_test',
 'fast',
 'seaborn-talk',
 'seaborn-dark-palette',
 'seaborn-bright',
 'seaborn-pastel',
 'grayscale',
 'seaborn-notebook',
 'ggplot',
 'seaborn-colorblind',
 'seaborn-muted',
 'seaborn',
 'Solarize_Light2',
 'seaborn-paper',
 'bmh',
 'tableau-colorblind10',
 'seaborn-white',
 'dark_background',
 'seaborn-poster',
 'seaborn-deep']

num_friends_good = [49,41,40,25,21,21,19,19,18,18,16,15,15,15,15,14,14,13,13,13,13,12,12,11,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,8,8,8,8,8,8,8,8,8,8,8,8,8,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
daily_minutes_good = [68.77,51.25,52.08,38.36,44.54,57.13,51.4,41.42,31.22,34.76,54.01,38.79,47.59,49.1,27.66,41.03,36.73,48.65,28.12,46.62,35.57,32.98,35,26.07,23.77,39.73,40.57,31.65,31.21,36.32,20.45,21.93,26.02,27.34,23.49,46.94,30.5,33.8,24.23,21.4,27.94,32.24,40.57,25.07,19.42,22.39,18.42,46.96,23.72,26.41,26.97,36.76,40.32,35.02,29.47,30.2,31,38.11,38.18,36.31,21.03,30.86,36.07,28.66,29.08,37.28,15.28,24.17,22.31,30.17,25.53,19.85,35.37,44.6,17.23,13.47,26.33,35.02,32.09,24.81,19.33,28.77,24.26,31.98,25.73,24.86,16.28,34.51,15.23,39.72,40.8,26.06,35.76,34.76,16.13,44.04,18.03,19.65,32.62,35.59,39.43,14.18,35.24,40.13,41.82,35.45,36.07,43.67,24.61,20.9,21.9,18.79,27.61,27.21,26.61,29.77,20.59,27.53,13.82,33.2,25,33.1,36.65,18.63,14.87,22.2,36.81,25.53,24.62,26.25,18.21,28.08,19.42,29.79,32.8,35.99,28.32,27.79,35.88,29.06,36.28,14.1,36.63,37.49,26.9,18.58,38.48,24.48,18.95,33.55,14.24,29.04,32.51,25.63,22.22,19,32.73,15.16,13.9,27.2,32.01,29.27,33,13.74,20.42,27.32,18.23,35.35,28.48,9.08,24.62,20.12,35.26,19.92,31.02,16.49,12.16,30.7,31.22,34.65,13.13,27.51,33.2,31.57,14.1,33.42,17.44,10.12,24.42,9.82,23.39,30.93,15.03,21.67,31.09,33.29,22.61,26.89,23.48,8.38,27.81,32.35,23.84]
alpha, beta = 22.9475, 0.90386

plt.scatter(num_friends_good, daily_minutes_good)
plt.plot(num_friends_good, [alpha + beta*i for i in num_friends_good], 'r-')
plt.xlabel('# of friends', fontsize = 20)
plt.ylabel('Minutes per day', fontsize = 20)
plt.title('linear regression', fontsize = 20)
plt.show()

_images/a0b3743b49a6ba63c22a2eb6ecfc60262e4ac7c49f1c6ec065769d12c831cb41.png

Of course, we need a better way to figure out how well we’ve fit the data than staring at the graph.

A common measure is the coefficient of determination (or R-squared), which measures the fraction of the total variation in the dependent variable that is captured by the model.

The Matrix Method#

\[ y_i = X_i^T w\]

The constant could be represent by 1 in X

The squared error could be written as:

\[ \sum_{i = 1}^m (y_i -X_i^T w)^2 \]

We can also write this in matrix notation as \((y-Xw)^T(y-Xw)\).

If we take the derivative of this with respect to \(w\), we’ll get

\[X^T(y-Xw)\]

We can set this to zero and solve for w to get the following equation:

\[\hat w = (X^T X)^{-1}X^T y\]

# https://github.com/computational-class/machinelearninginaction/blob/master/Ch08/regression.py
import pandas as pd
import random

dat = pd.read_csv('./data/ex0.txt', sep = '\t', names = ['x1', 'x2', 'y'])
dat['x3'] = [yi*.3 + .5*random.random() for yi in dat['y']]
dat.head()

	x1	x2	y	x3
0	1.0	0.067732	3.176513	1.063914
1	1.0	0.427810	3.816464	1.314866
2	1.0	0.995731	4.550095	1.775487
3	1.0	0.738336	4.256571	1.670366
4	1.0	0.981083	4.560815	1.368580

from numpy import mat, linalg, corrcoef

def standRegres(xArr,yArr):
    xMat = mat(xArr); yMat = mat(yArr).T
    xTx = xMat.T*xMat
    if linalg.det(xTx) == 0.0:
        print("This matrix is singular, cannot do inverse")
        return
    ws = xTx.I * (xMat.T*yMat)
    return ws

xs = [[dat.x1[i], dat.x2[i], dat.x3[i]] for i in dat.index]
y = dat.y
print(xs[:2])
ws = standRegres(xs, y)
print(ws)

[[1.0, 0.067732, 1.063913920495336], [1.0, 0.42781, 1.3148661204308572]]
[[2.91879051]
 [1.64782563]
 [0.08003454]]

xMat=mat(xs)
yMat=mat(y)
#yHat = xMat*ws
xCopy=xMat.copy()
xCopy.sort(0)
yHat=xCopy*ws

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xMat[:,1].flatten().A[0], yMat.T[:,0].flatten().A[0])
ax.plot(xCopy[:,1],yHat, 'r-')
plt.ylim(0, 5)
plt.show()

_images/986459aa52d822928afd75ec62b85d62513e71d81f43b978456c433a2658a9a2.png

yHat = xMat*ws
corrcoef(yHat.T, yMat)

array([[1.        , 0.98685418],
       [0.98685418, 1.        ]])

Regression with Statsmodels#

http://www.statsmodels.org/stable/index.html

statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

dat = pd.read_csv('./data/ex0.txt', sep = '\t', names = ['x1', 'x2', 'y'])
dat['x3'] = [yi*.3 - .1*random.random() for yi in y]
dat.head()

	x1	x2	y	x3
0	1.0	0.067732	3.176513	0.928033
1	1.0	0.427810	3.816464	1.134750
2	1.0	0.995731	4.550095	1.335951
3	1.0	0.738336	4.256571	1.206655
4	1.0	0.981083	4.560815	1.352873

results = smf.ols('y ~ x2 + x3', data=dat).fit()

results.summary()

OLS Regression Results
Dep. Variable:	y	R-squared:	0.985
Model:	OLS	Adj. R-squared:	0.985
Method:	Least Squares	F-statistic:	6603.
Date:	Fri, 01 Dec 2023	Prob (F-statistic):	3.01e-181
Time:	14:47:41	Log-Likelihood:	275.97
No. Observations:	200	AIC:	-545.9
Df Residuals:	197	BIC:	-536.0
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	1.6757	0.105	16.015	0.000	1.469	1.882
x2	0.9091	0.063	14.356	0.000	0.784	1.034
x3	1.5579	0.122	12.772	0.000	1.317	1.798

Omnibus:	9.415	Durbin-Watson:	2.096
Prob(Omnibus):	0.009	Jarque-Bera (JB):	4.294
Skew:	0.000	Prob(JB):	0.117
Kurtosis:	2.282	Cond. No.	62.6

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_partregress_grid(results, fig = fig)
plt.show()

_images/f17553a526deb1defede522c41e5c473987e158c28b907ecaf20b9634b25c6a7.png

# regression
import numpy as np
X = np.array(num_friends_good)
X = sm.add_constant(X, prepend=False)
mod = sm.OLS(daily_minutes_good, X)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.329
Model:                            OLS   Adj. R-squared:                  0.326
Method:                 Least Squares   F-statistic:                     98.60
Date:                Fri, 01 Dec 2023   Prob (F-statistic):           3.68e-19
Time:                        14:49:24   Log-Likelihood:                -711.76
No. Observations:                 203   AIC:                             1428.
Df Residuals:                     201   BIC:                             1434.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.9039      0.091      9.930      0.000       0.724       1.083
const         22.9476      0.846     27.133      0.000      21.280      24.615
==============================================================================
Omnibus:                       26.873   Durbin-Watson:                   2.027
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                7.541
Skew:                           0.004   Prob(JB):                       0.0230
Kurtosis:                       2.056   Cond. No.                         13.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

fig = plt.figure(figsize=(12,6))
fig = sm.graphics.plot_partregress_grid(res, fig = fig)
plt.show()

_images/0846745b002a781de4c02d8dd5e0f7dc9fec720229e03ad7ebe25abbb6c4c11b.png

Regression towards mediocrity#

The concept of regression comes from genetics and was popularized by Sir Francis Galton during the late 19th century with the publication of Regression towards mediocrity in hereditary stature. Galton observed that extreme characteristics (e.g., height) in parents are not passed on completely to their offspring.

df = pd.read_csv('./data/galton.csv')
df['father_above_average'] = [i-df['father'].mean() for i in df['father']]
df['mother_above_average'] = [i-df['mother'].mean() for i in df['mother']]
df['height_more_than_father'] =df['height'] - df['father']
df['height_more_than_mother'] =df['height'] - df['mother']
df.head()

	family	father	mother	sex	height	nkids	father_above_average	mother_above_average	height_more_than_father	height_more_than_mother
0	1	78.5	67.0	M	73.2	4	9.267149	2.91559	-5.3	6.2
1	1	78.5	67.0	F	69.2	4	9.267149	2.91559	-9.3	2.2
2	1	78.5	67.0	F	69.0	4	9.267149	2.91559	-9.5	2.0
3	1	78.5	67.0	F	69.0	4	9.267149	2.91559	-9.5	2.0
4	2	75.5	66.5	M	73.5	4	6.267149	2.41559	-2.0	7.0

import seaborn as sns
sns.set(font_scale=1.5)

g = sns.PairGrid(df, y_vars=["height"], x_vars=["father", "mother"], hue="sex", height=4)
g.map(sns.regplot)
g.add_legend();

_images/a890f290cde4b1a6d803a830aaeec2ce00f5d8cf86ad119eb6032468f1f4bfa6.png

g = sns.PairGrid(df, y_vars=["height_more_than_father", "height_more_than_mother"], 
                 x_vars=["father_above_average", "mother_above_average"], hue="sex", height=4)
g .map(sns.regplot)
g.add_legend();

_images/41ee037694ccb5d194e4faacfabc4e8b9627986588cb696258bca08905466693.png

results = smf.ols('height ~ father + mother + C(sex) + nkids', data=df).fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 height   R-squared:                       0.641
Model:                            OLS   Adj. R-squared:                  0.639
Method:                 Least Squares   F-statistic:                     398.1
Date:                Fri, 01 Dec 2023   Prob (F-statistic):          9.09e-197
Time:                        15:16:13   Log-Likelihood:                -1960.1
No. Observations:                 898   AIC:                             3930.
Df Residuals:                     893   BIC:                             3954.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      16.1877      2.794      5.794      0.000      10.704      21.671
C(sex)[T.M]     5.2099      0.144     36.125      0.000       4.927       5.493
father          0.3983      0.030     13.472      0.000       0.340       0.456
mother          0.3210      0.031     10.269      0.000       0.260       0.382
nkids          -0.0438      0.027     -1.612      0.107      -0.097       0.010
==============================================================================
Omnibus:                       12.177   Durbin-Watson:                   1.566
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               16.265
Skew:                          -0.149   Prob(JB):                     0.000294
Kurtosis:                       3.588   Cond. No.                     3.68e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.68e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

results2 = smf.ols('height_more_than_father ~ father_above_average + mother_above_average + C(sex) + nkids', data=df).fit()
print(results2.summary())

                               OLS Regression Results                              
===================================================================================
Dep. Variable:     height_more_than_father   R-squared:                       0.672
Model:                                 OLS   Adj. R-squared:                  0.671
Method:                      Least Squares   F-statistic:                     457.6
Date:                     Fri, 01 Dec 2023   Prob (F-statistic):          1.84e-214
Time:                             15:16:16   Log-Likelihood:                -1960.1
No. Observations:                      898   AIC:                             3930.
Df Residuals:                          893   BIC:                             3954.
Df Model:                                4                                         
Covariance Type:                 nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept               -4.9011      0.201    -24.426      0.000      -5.295      -4.507
C(sex)[T.M]              5.2099      0.144     36.125      0.000       4.927       5.493
father_above_average    -0.6017      0.030    -20.351      0.000      -0.660      -0.544
mother_above_average     0.3210      0.031     10.269      0.000       0.260       0.382
nkids                   -0.0438      0.027     -1.612      0.107      -0.097       0.010
==============================================================================
Omnibus:                       12.177   Durbin-Watson:                   1.566
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               16.265
Skew:                          -0.149   Prob(JB):                     0.000294
Kurtosis:                       3.588   Cond. No.                         20.4
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

fig = plt.figure(figsize=(8,8))
fig = sm.graphics.plot_partregress_grid(results, fig = fig)
plt.show()

_images/37d34784a13e22c29c1df61b8c8dad42f785750ad4c5041be67400fd82ed3fa2.png

fig = plt.figure(figsize=(12,12))
fig = sm.graphics.plot_partregress_grid(results2, fig = fig)
plt.show()

_images/3a283750f587eadf7a9119993e47d3ec0cd538016685be00be8fa772143f52be.png