第五章统计思维

第五章统计思维#

Introduction
A Crash Course in Python
Visualizing Data
Linear Algebra
Statistics
Probability
Hypothesis and Inference
Gradient Descent

The School of Athens by Raphael (1509–1510), fresco at the Apostolic Palace, Vatican City. https://en.wikipedia.org/wiki/Platonic_Academy

The School of Athens by Raphael (1509–1510), fresco at the Apostolic Palace, Vatican City. https://en.wikipedia.org/wiki/Platonic_Academy

Plato & Typological Thinking#

Pythagoras held that

all things are number
the cosmos comes from numerical principles.

The theory of Forms or theory of Ideas is a philosophical theory, concept, or world-view, attributed to Plato, that the physical world is not as real or true as timeless, absolute, unchangeable ideas.

真实的知识存在于普遍而永恒的法则之中。

The physical world of becoming is an imitation of the mathematical world of being.

the realm of being 本质世界(理念世界)
- perfect, eternal, and changeless forms,
sensible world of becoming 现实世界
- imperfect

Population Thinking#

Charles Robert Darwin (12 February 1809 – 19 April 1882)

favourable variations would make organisms better at surviving and passing the variations on to their offspring, while unfavourable variations would be lost.

Variation is the basis of natural seletion.

在类型逻辑中平均数是主要的内容。在总体逻辑中重要的是差异，平均数只是总体的一个特征值，是探讨真实原因的手段，而不是原因本身。

Statisticism or Damn Lies

“统计至上主义”天真地以为统计学是科学方法的完备基础。

改进测量工具
研究设计、概念化

Duncan, O.D. 1984. Notes on Social Measurement, Historical and Critical. New York: Russell Sage Fundation, p.226.

The Paradigm of Demography

Otis Dudley Duncan (1921-2004) 确立一种新的学术传统

蔑视模仿自然科学试图寻找普遍规律的做法；
记录和理解真实人口中的经验模式是第一要务；
变异是人类社会的本质。
- 柏拉图：变异是对本质世界的拙劣复制。

The School of Athens by Raphael (1509–1510), fresco at the Apostolic Palace, Vatican City. https://en.wikipedia.org/wiki/Platonic_Academy

本体论：世界的本质#

“我认为自然科学是以“挖掘”本质的世界中的真理为最终目的，这也是其精华所在。而社会科学是以“了解”形成的世界为最终目的。历史上很多人想在社会科学领域找到一种真理，能够适用于各个方面，并且做过许多这方面的尝试。我认为社会科学不应该是这样的。在社会科学中，我们的目的是要了解现实世界，而不是去挖掘永恒的真理。这可能和你们的想象不一样。……既然差异是世界的本质，那差异就应该是研究的对象。” — 谢宇

高尔顿认为凯特莱的社会物理学用处不大，普通人不是万能的。
- 左手入冰，右手入火，平均温度？
高尔顿说（社会）科学的探索必须关注变异和共变。
- variation & Co-variation

The measurements have both

a central tendency, or mean, and
a spread around this central value, or variance.
- In the late 1860s, Galton conceived of a measure to quantify normal variation:
  - the standard deviation.
“Regression to mediocrity”

认识论：人类知识的起源、本质、方法及局限#

谢宇：“你到底能知道什么，你怎样认识世界。”

自然科学追求永恒真理，关注典型现象；
- 典型现象 & 平均人
社会科学关注所有个案组成的总体的状况。

方法论：使用什么方法#

谢宇：“社会科学之所以复杂，是因为我们运用的数据是通过观察所得，而观察所得的数据必然受到外来因素的影响，这些外来因素都可能解释你的数据。“

自然科学使用实验来隔离外来因素的影响；
“社会科学可以使用统计排除一些外来影响，但你不能排除所有的外来因素”。

Statistics for Describing Data#

The mathematics and techniques with which we understand data.

from collections import Counter
import math
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

matplotlib.style.use('ggplot')

def dot(v, w):
    """v_1 * w_1 + ... + v_n * w_n"""
    return sum(v_i * w_i for v_i, w_i in zip(v, w))

def sum_of_squares(v):
    """v_1 * v_1 + ... + v_n * v_n"""
    return dot(v, v)

daily_minutes = [1,68.77,51.25,52.08,38.36,44.54,57.13,
                 51.4,41.42,31.22,34.76,54.01,38.79,
                 47.59,49.1,27.66,41.03,36.73,48.65,28.12,
                 46.62,35.57,32.98,35,26.07,23.77,39.73,
                 40.57,31.65,31.21,36.32,20.45,21.93,26.02,
                 27.34,23.49,46.94,30.5,33.8,24.23,21.4,27.94,
                 32.24,40.57,25.07,19.42,22.39,18.42,46.96,23.72,
                 26.41,26.97,36.76,40.32,35.02,29.47,30.2,31,
                 38.11,38.18,36.31,21.03,30.86,36.07,28.66,
                 29.08,37.28,15.28,24.17,22.31,30.17,25.53,
                 19.85,35.37,44.6,17.23,13.47,26.33,35.02,
                 32.09,24.81,19.33,28.77,24.26,31.98,25.73,
                 24.86,16.28,34.51,15.23,39.72,40.8,26.06,
                 35.76,34.76,16.13,44.04,18.03,19.65,32.62,
                 35.59,39.43,14.18,35.24,40.13,41.82,35.45,
                 36.07,43.67,24.61,20.9,21.9,18.79,27.61,27.21,
                 26.61,29.77,20.59,27.53,13.82,33.2,25,33.1,36.65,
                 18.63,14.87,22.2,36.81,25.53,24.62,26.25,18.21,
                 28.08,19.42,29.79,32.8,35.99,28.32,27.79,35.88,29.06,
                 36.28,14.1,36.63,37.49,26.9,18.58,38.48,24.48,
                 18.95,33.55,14.24,29.04,32.51,25.63,22.22,19,
                 32.73,15.16,13.9,27.2,32.01,29.27,33,13.74,20.42,
                 27.32,18.23,35.35,28.48,9.08,24.62,20.12,35.26,
                 19.92,31.02,16.49,12.16,30.7,31.22,34.65,13.13,
                 27.51,33.2,31.57,14.1,33.42,17.44,10.12,24.42,
                 9.82,23.39,30.93,15.03,21.67,31.09,33.29,22.61,
                 26.89,23.48,8.38,27.81,32.35,23.84]

num_friends = [100,49,41,40,25,21,21,19,19,18,
               18,16,15,15,15,15,14,14,13,13,
               13,13,12,12,11,10,10,10,10,10,
               10,10,10,10,10,10,10,10,10,10,
               9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,
               9,9,9,8,8,8,8,8,8,8,8,8,8,8,8,8
               ,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
               6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
               6,6,6,6,6,6,5,5,5,5,5,5,5,5,5,5,
               5,5,5,5,5,5,5,4,4,4,4,4,4,4,4,4,
               4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,
               3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
               2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
               1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
               1,1,1,1,1,1,1]

Distribution and Histogram#

time_counts = Counter(map(int, daily_minutes))
xs = range(69)
ys = [time_counts[x] for x in xs]
plt.bar(xs, ys)
plt.axis([0,69,0,14])
plt.title("Histogram of Time Counts")
plt.xlabel("# of Time")
plt.ylabel("# of people")
plt.show()

_images/c637b1d85825922660f8357829108b9236172332a3ffd2158ad894e6d82f023e.png

friend_counts = Counter(num_friends)
xs = range(101)
ys = [friend_counts[x] for x in xs]
plt.bar(xs, ys)
plt.axis([0,101,0,25])
plt.title("Histogram of Friend Counts")
plt.xlabel("# of friends")
plt.ylabel("# of people")
plt.show()

_images/be02207d54c19d6a5b1eb2167d3d194e1e4c1598c97cbacd2d40fc6193e07edc.png

We can also draw them with plt.hist

plt.hist(daily_minutes, bins = 50)
plt.xlabel('Daily minutes')
plt.ylabel('Frequency')
plt.show()

_images/95eaca23e2f408eb457645e6b3c64015bd0d4d9ebcb11afe5f0f8c87be669090.png

plt.hist(num_friends, bins= 50)
plt.xlabel("# of friends")
plt.ylabel('Frequency')
plt.show()

_images/99315c9deb56a877899c9760d3f5f47be69ad63c16833545b6c66a0916b23151.png

Unfortunately, this chart is still too difficult to interpret.

So you start generating some statistics.

From Max to Min#

num_points = len(num_friends)               # 204

largest_value = max(num_friends)            # 100
smallest_value = min(num_friends)           # 1

print(num_points, largest_value, smallest_value)

204 100 1

sorted_values = sorted(num_friends)
smallest_value = sorted_values[0]           # 1

second_smallest_value = sorted_values[1]    # 1
second_largest_value = sorted_values[-2]    # 49

Mean, Median, Mode, and Quantile#

def mean(x):
    return sum(x) / len(x)

print("mean(num_friends)", mean(num_friends))

mean(num_friends) 7.333333333333333

np.mean(num_friends)

7.333333333333333

def median(v):
    """finds the 'middle-most' value of v"""
    n = len(v)
    sorted_v = sorted(v)
    midpoint = n // 2

    if n % 2 == 1: 
        # if odd, return the middle value
        return sorted_v[midpoint]
    else:
        # if even, return the average of the middle values
        lo = midpoint - 1
        hi = midpoint
        return (sorted_v[lo] + sorted_v[hi]) / 2

print("median(num_friends)", median(num_friends))

median(num_friends) 6.0

np.median(num_friends)

6.0

def quantile(x, p):
    """returns the pth-percentile value in x"""
    p_index = int(p * len(x))
    return sorted(x)[p_index]

print("quantile(num_friends, 0.10)", quantile(num_friends, 0.10))
print("quantile(num_friends, 0.25)", quantile(num_friends, 0.25))
print("quantile(num_friends, 0.50)", quantile(num_friends, 0.50))
print("quantile(num_friends, 0.75)", quantile(num_friends, 0.75))
print("quantile(num_friends, 0.90)", quantile(num_friends, 0.90))

quantile(num_friends, 0.10) 1
quantile(num_friends, 0.25) 3
quantile(num_friends, 0.50) 6
quantile(num_friends, 0.75) 9
quantile(num_friends, 0.90) 13

np.percentile(num_friends, 90)

13.0

def mode(x):
    """returns a list, might be more than one mode"""
    counts = Counter(x)
    max_count = max(counts.values())
    return [x_i for x_i, count in counts.items()
            if count == max_count]

print("mode(num_friends)", mode(num_friends))

mode(num_friends) [6, 1]

np.argmax(np.bincount(num_friends))
# Only the first occurrence is returned.

np.bincount(num_friends)

array([ 0, 22, 17, 20, 20, 17, 22, 15, 13, 18, 15,  1,  2,  4,  2,  4,  1,
        0,  2,  2,  0,  2,  0,  0,  0,  1,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  1,  1,  0,  0,  0,  0,  0,  0,  0,  1,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1])

from scipy import stats
stats.mode(num_friends, axis=None)

ModeResult(mode=array([1]), count=array([22]))

def data_range(x):
    return max(x) - min(x)

print("data_range(num_friends)", data_range(num_friends))

data_range(num_friends) 99

def interquartile_range(x):
    return quantile(x, 0.75) - quantile(x, 0.25)
print("interquartile_range(num_friends)", interquartile_range(num_friends))

interquartile_range(num_friends) 6

import seaborn as sns
sns.set(style="ticks", palette="pastel")
sns.boxplot(y = daily_minutes);

_images/51aafc4ca63653e20bfdd8282cfcf997919b6d3c3aed536850417daae93ba00d.png

import seaborn as sns
sns.set(style="ticks", palette="pastel")
sns.boxplot(y = num_friends);

_images/b093597c4e8538225487feab7be5a96d08153136057cca27e6e901b1d55f67dc.png

Variance and Standard Deviation#

\[\sigma = \sqrt{\frac{\sum_{i=1}^N (x_i - \overline{x})^2}{N-1} }\]

\[\sigma ^ 2 = \frac{\sum_{i=1}^N (x_i - \overline{x})^2}{N-1} \]

样本方差为何除以n-1?

def de_mean(x):
    """translate x by subtracting its mean 
    so the result has mean 0"""
    x_bar = mean(x)
    return [x_i - x_bar for x_i in x]

def variance(x):
    """assumes x has at least two elements"""
    n = len(x)
    deviations = de_mean(x)
    return sum_of_squares(deviations) / (n - 1)

print("variance(num_friends)", variance(num_friends))

variance(num_friends) 81.54351395730716

print(np.var(num_friends))

81.14379084967321

def standard_deviation(x):
    return math.sqrt(variance(x))

print("standard_deviation(num_friends)", standard_deviation(num_friends))

standard_deviation(num_friends) 9.03014473623248

np.std(num_friends)

9.007984838446012

Covariance, Correlation, and Scatter Plot#

matplotlib.style.use('ggplot')
plt.scatter(num_friends, daily_minutes, 
            alpha = .1)
plt.xlabel('number of friends')
plt.ylabel('daily minutes')
plt.title('outliers')
plt.show()

_images/a5e6bba4ecaab33608e8d7e28b8f15207ed240ac56219c4f2690b9934165875c.png

import seaborn as sns
sns.set(style="white")
g = sns.jointplot(num_friends, daily_minutes, 
                  kind="kde", height=7, space=0)

_images/a68477d1a521014f7041086b6940c3c260cb13ad9b4e25ed0e6e3fdd57294aed.png

def covariance(x, y):
    n = len(x)
    return dot(de_mean(x), de_mean(y)) / (n - 1)

print("covariance(num_friends, daily_minutes)", covariance(num_friends, daily_minutes))

covariance(num_friends, daily_minutes) 22.425435139573064

np.cov(num_friends, daily_minutes) 

array([[ 81.54351396,  22.42543514],
       [ 22.42543514, 100.78589895]])

def correlation(x, y):
    stdev_x = standard_deviation(x)
    stdev_y = standard_deviation(y)
    if stdev_x > 0 and stdev_y > 0:
        return covariance(x, y) / stdev_x / stdev_y
    else:
        return 0 # if no variation, correlation is zero
print("correlation(num_friends, daily_minutes)", correlation(num_friends, daily_minutes))

correlation(num_friends, daily_minutes) 0.24736957366478218

import seaborn as sns
sns.set(style="ticks", palette="pastel")
sns.boxplot(x=num_friends, y=daily_minutes)
sns.despine(offset=10, trim=True)

_images/49d6a59ebbd1a90ab104a32812a156f4954c28abf83bfee4fa3f929aeb3691e6.png

np.corrcoef(num_friends, daily_minutes)

array([[1.        , 0.24736957],
       [0.24736957, 1.        ]])

from scipy.stats.stats import pearsonr   
  
pearsonr(num_friends, daily_minutes) 

(0.2473695736647822, 0.0003610473973450684)

outlier = num_friends.index(100) # index of outlier

num_friends_good = [x
                    for i, x in enumerate(num_friends)
                    if i != outlier]

daily_minutes_good = [x
                      for i, x in enumerate(daily_minutes)
                      if i != outlier]


print("correlation(num_friends_good, daily_minutes_good)", \
      correlation(num_friends_good, daily_minutes_good))

correlation(num_friends_good, daily_minutes_good) 0.5736792115665573

Regression Analysis#

slope, intercept, r_value, p_value, std_err = stats.linregress(num_friends_good,daily_minutes_good)
print(slope, intercept, r_value, p_value)
xi = range(1,np.max(num_friends_good)+1)
# plotting the line
matplotlib.style.use('ggplot')
y_fit = slope*xi+intercept
plt.scatter(num_friends,daily_minutes)
plt.plot(xi,y_fit,'r-', label = '$Linear\,Fit$')
plt.xlabel('$x_i$', fontsize = 20)
plt.ylabel('$y$', fontsize = 20)
plt.legend(loc=0,numpoints=1,fontsize=13)
plt.show()

0.9038659456058652 22.947552413469026 0.5736792115665575 3.676825862770313e-19

_images/3d388c2d6497049d5bcb6edacf9db57b4eae14cbc1a016196dc6485cb0354bda.png

比较均值：T检验#

income_male = [1000, 1500, 2000, 3000, 2500, 4000, 5000, 3500]
income_female=[6000, 6200, 7000, 7100, 9000, 10000, 12000]
income_male_median = np.median(income_male)
income_female_median = np.median(income_female)
print(income_male_median, income_female_median) 

2750.0 7100.0

# Plot the boxplot to see
# minimum value, 25%,50%,75% percentile, maximum value
%matplotlib inline
import matplotlib.pyplot as plt
plt.boxplot([income_male, income_female], # meanline=True,showmeans=True,
            labels = ['$male$', '$female$'])
plt.show()

_images/aaf429eb12901bc991241f6ae3b91f20b63a8cbf41c0c531702d3ef89cf60253.png

Welch’s t-test: Equal or unequal sample sizes, unequal variances

https://en.wikipedia.org/wiki/Student’s_t-test

from scipy import stats
stats.ttest_ind(income_male, income_female)

Ttest_indResult(statistic=-5.757056463981614, pvalue=6.631425817426509e-05)

单因素方差分析（one-way ANOVA）#

ANOVA: Analysis of Variance

假设：有k组实验数据，彼此之间相互独立且服从正态分布，均值分别为\(\mu_1,\mu_2,...\mu_k\)，方差都为\(\sigma^2\)

H0:假设两组或者多组数据之间有相同的均值
H1:至少有两个均值不相等

https://en.wikipedia.org/wiki/One-way_analysis_of_variance

Consider an experiment to study the effect of three different levels of a factor on a response (e.g. three levels of a fertilizer on plant growth). If we had 6 observations for each level, we could write the outcome of the experiment in a table like this, where a1, a2, and a3 are the three levels of the factor being studied.

The null hypothesis, denoted H0, for the overall F-test for this experiment would be that all three levels of the factor produce the same response, on average.

To calculate the F-ratio:

Step 1: Calculate the mean within each group:

\[\begin{split} \begin{aligned}{\overline {Y}}_{1}&={\frac {1}{6}}\sum Y_{1i}={\frac {6+8+4+5+3+4}{6}}=5\\{\overline {Y}}_{2}&={\frac {1}{6}}\sum Y_{2i}={\frac {8+12+9+11+6+8}{6}}=9\\{\overline {Y}}_{3}&={\frac {1}{6}}\sum Y_{3i}={\frac {13+9+11+8+7+12}{6}}=10\end{aligned} \end{split}\]

Step 2: Calculate the overall mean:

\(\overline{Y}=\frac{\sum _{i} \overline{Y_i}}{a}=\frac {\overline{Y_1}+\overline{Y_2}+\overline{Y_3}}{a}=\frac {5+9+10}{3}=8\)

where \(a\) is the number of groups.

Step 3: Calculate the “between-group” sum of squared differences:

\[\begin{split} \begin{aligned}S_{B}&=n({\overline {Y}}_{1}-{\overline {Y}})^{2}+n({\overline {Y}}_{2}-{\overline {Y}})^{2}+n({\overline {Y}}_{3}-{\overline {Y}})^{2}\\[8pt]&=6(5-8)^{2}+6(9-8)^{2}+6(10-8)^{2}=84\end{aligned} \end{split}\]

where n is the number of data values per group.

The between-group degrees of freedom is one less than the number of groups \(f_{b}=3-1=2\)

The between-group mean square value is \(MS_{B}=84/2=42\)

Step 4: Calculate the “within-group” sum of squares. Begin by centering the data in each group

a1	a2	a3
6−5=1	8−9=−1	13−10=3
8−5=3	12−9=3	9−10=−1
4−5=−1	9−9=0	11−10=1
5−5=0	11−9=2	8−10=−2
3−5=−2	6−9=−3	7−10=−3
4−5=−1	8−9=−1	12−10=2

The within-group sum of squares:

\[\begin{split} \begin{aligned}S_{W}=&(1)^{2}+(3)^{2}+(-1)^{2}+(0)^{2}+(-2)^{2}+(-1)^{2}+\\&(-1)^{2}+(3)^{2}+(0)^{2}+(2)^{2}+(-3)^{2}+(-1)^{2}+\\&(3)^{2}+(-1)^{2}+(1)^{2}+(-2)^{2}+(-3)^{2}+(2)^{2}= 68\\\end{aligned} \end{split}\]

The within-group degrees of freedom is

\(f_{W}=a(n-1)=3(6-1)=15\)

Thus the within-group mean square value is

\(MS_{W}=S_{W}/f_{W}=68/15\approx 4.5\)

Step 5: The F-ratio is

\( F={\frac {MS_{B}}{MS_{W}}}\approx 42/4.5\approx 9.3\)

\(F_{crit}(2,15) = 3.68\) at α = 0.05. The results are significant at the 5% significance level.

a1 = [6, 8, 4, 5, 3, 4]
a2 = [8, 12, 9, 11, 6, 8]
a3 = [13, 9, 11, 8, 7, 12]

f,p = stats.f_oneway(a1, a2, a3)
print(f,p)

9.264705882352942 0.0023987773293929083

# 5个地方的蚌壳长度的均值是否都一样呢？
from scipy import stats
tillamook = [0.0571, 0.0813, 0.0831, 0.0976, 0.0817, 0.0859, 0.0735, 0.0659, 0.0923, 0.0836]
newport = [0.0873, 0.0662, 0.0672, 0.0819, 0.0749, 0.0649, 0.0835,0.0725]
petersburg = [0.0974, 0.1352, 0.0817, 0.1016, 0.0968, 0.1064, 0.105]
magadan = [0.1033, 0.0915, 0.0781, 0.0685, 0.0677, 0.0697, 0.0764, 0.0689]
tvarminne = [0.0703, 0.1026, 0.0956, 0.0973, 0.1039, 0.1045]

plt.boxplot([tillamook, newport, petersburg, magadan, tvarminne], # meanline=True,showmeans=True,
            labels = ['tillamook', 'newport', 'petersburg', 'magadan', 'tvarminne'])
plt.show()

_images/70f80da25834011a766851a825e5a6688c0f40d73c3cc5834f68468d67a54790.png

f,p = stats.f_oneway(tillamook, newport, petersburg, magadan, tvarminne)
print(f,p)

7.121019471642447 0.0002812242314534544

from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
import pandas as pd

df = []
labels = ['tillamook', 'newport', 'petersburg', 'magadan', 'tvarminne']
for k, i in enumerate([tillamook, newport, petersburg, magadan, tvarminne]):
    for j in i:
        df.append([labels[k],j])
        
df = pd.DataFrame(df, columns = ['treatment', 'value'])
model = ols('value ~ C(treatment)', df).fit()
print(anova_lm(model)) 

                df    sum_sq   mean_sq         F    PR(>F)
C(treatment)   4.0  0.004520  0.001130  7.121019  0.000281
Residual      34.0  0.005395  0.000159       NaN       NaN

Two-way ANOVA#

ANOVA with interaction

Measurement of fetal head circumference hs, by four observers in three fetuses.

# https://raw.githubusercontent.com/thomas-haslwanter/statsintro_python/master/ipynb/Data/data_altman/altman_12_6.txt
import pandas as pd
df = pd.read_csv('./data/altman_12_6.txt', names=['hs', 'fetus', 'observer'])
df.head()

	hs	fetus	observer
0	14.3	1	1
1	14.0	1	1
2	14.8	1	1
3	13.6	1	2
4	13.6	1	2

from statsmodels.graphics.api import interaction_plot
plt.figure(figsize=(8,6))
fig = interaction_plot(df['fetus'], df['observer'], df['hs'],
        ms=10, ax=plt.gca())

_images/c3694c2235637c85a459cefa6ff4060333e01f44a687ba5c63694df57266e7c8.png

formula = 'hs ~ C(fetus) + C(observer) + C(fetus):C(observer)'
lm = ols(formula, df).fit() 
print(anova_lm(lm))  

                        df      sum_sq     mean_sq            F        PR(>F)
C(fetus)               2.0  324.008889  162.004444  2113.101449  1.051039e-27
C(observer)            3.0    1.198611    0.399537     5.211353  6.497055e-03
C(fetus):C(observer)   6.0    0.562222    0.093704     1.222222  3.295509e-01
Residual              24.0    1.840000    0.076667          NaN           NaN

卡方检验 A chi-squared test#

https://en.wikipedia.org/wiki/Chi-squared_test

Suppose we look at the House of Representatives for the 113th Congress. This data is taken from www.senate.gov.

	Republican	Democrat	Total
Male	215	143	358
Female	19	64	83
Total	234	207	441

also referred to as χ² test (or chi-square test), is any statistical hypothesis test in which the sampling distribution of the test statistic is a chi-square distribution when the null hypothesis is true.
A chi-squared test can then be used to reject the null hypothesis that the data are independent.
Test statistics that follow a chi-squared distribution arise from an assumption of independent normally distributed data, which is valid in many cases due to the central limit theorem.
Chi-squared tests are often constructed from a sum of squared errors, or through the sample variance.

Chi-Square Formula#

\(Χ^2 = \frac{\sum (O − E)^2}{E} \)

\(\sum\) means to sum up (see Sigma Notation)
O = each Observed (actual) value
E = each Expected value

So we calculate \((O−E)^2/E\) for each pair of observed and expected values then sum them all up.

total = 441
republican_prob = 234.0/441
democrat_prob = 207.0/441

male_prob = 358.0/441
female_prob = 83.0/441
male_republican_expected = republican_prob * male_prob * total
male_democrat_expected = democrat_prob * male_prob * total
female_republican_expected = republican_prob * female_prob * total
female_democrat_expected = democrat_prob * female_prob * total
print(male_republican_expected, male_democrat_expected,'\n', 
      female_republican_expected, female_democrat_expected)

189.9591836734694 168.04081632653063 
 44.04081632653061 38.95918367346939

Under the null hypothesis, it has approximately a chi-square distribution whose number of degrees of freedom are

(number of rows−1)(number of columns−1)=(2−1)(2−1)=1.

Then in that “cell” of the table, we have

\[\frac{(observed−expected)^2}{expected}=\frac{(215−189.9)^2}{189.9} = 3.3\]

The sum of these quantities over all of the cells is the Chi-square statistic.

3.30093271 + 13.12279349 + 222.36009343 + 277.84184475 = 37.37

Observed

	Republican	Democrat
Male	215	143
Female	19	64

Expected

	Republican	Democrat
Male	189.9	168.0
Female	44.0	38.9

Chi Square

	Republican	Democrat
Male	\(\frac{(215-189.9)^2}{189.9} \)	\(\frac{(143-168.0)^2}{168.0}\)
Female	\(\frac{(19-44.0)^2}{44.0}\)	\(\frac{(64-38.9)^2}{38.9}\)

observed = [215, 143, 19, 64]
expected = [male_republican_expected, male_democrat_expected,
            female_republican_expected, female_democrat_expected]
# The sum of these quantities over all of the cells is the test statistic.
chi_square = np.sum( [(observed[k] - i)**2/i for k, i in enumerate(expected)] )
print(chi_square)

37.365036593000426

import scipy
house = [ [ 215, 143 ], [ 19, 64 ] ]
chi2, p, ddof, expected = scipy.stats.chi2_contingency( house, correction = False)
msg = "Test Statistic: {}\np-value: {}\nDegrees of Freedom: {}\n"
print( msg.format( chi2, p, ddof ) ) 
print( expected )

Test Statistic: 37.36503659300042
p-value: 9.796267016333982e-10
Degrees of Freedom: 1

[[189.95918367 168.04081633]
 [ 44.04081633  38.95918367]]

scipy.stats.chi2_contingency?

Suppose there is a city of 1 million residents with four neighborhoods: A, B, C, and D.
A random sample of 650 residents of the city is taken and their occupation is recorded as “blue collar”, “white collar”, or “no collar”.
The null hypothesis is that each person’s neighborhood of residence is independent of the person’s occupational classification. The data are tabulated as:

	A	B	C	D	Total
White collar	90	60	104	95	349
Blue collar	30	50	51	20	151
No coloar	30	40	45	35	150
Total	150	150	200	150	650

住在A社区的概率：Let us take the sample living in neighborhood A, 150/650, to estimate what proportion of the whole 1 million people live in neighborhood A.
白领的概率：Similarly we take 349/650 to estimate what proportion of the 1 million people are white-collar workers.
住在A社区的白领的数量：By the assumption of independence under the hypothesis we should “expect” the number of white-collar workers in neighborhood A to be

\[\frac{150}{650}\frac{349}{650}650=80.54\]

Then in that “cell” of the table, we have

\[\frac{(observed−expected)^2}{expected}=\frac{(90−80.54)^2}{80.54} = 1.11\]

The sum of these quantities over all of the cells is the test statistic.

Under the null hypothesis, it has approximately a chi-square distribution whose number of degrees of freedom are

(number of rows−1)(number of columns−1)=(3−1)(4−1)=6.

house = [ [ 90, 60, 104, 95 ], [ 30, 50, 51, 20 ], [30, 40, 45, 35] ]
chi2, p, ddof, expected = scipy.stats.chi2_contingency( house )
msg = "Test Statistic: {}\n p-value: {}\n Degrees of Freedom: {}\n"
print( msg.format( chi2, p, ddof ) )
print( expected )

Test Statistic: 24.5712028585826
 p-value: 0.0004098425861096696
 Degrees of Freedom: 6

[[ 80.53846154  80.53846154 107.38461538  80.53846154]
 [ 34.84615385  34.84615385  46.46153846  34.84615385]
 [ 34.61538462  34.61538462  46.15384615  34.61538462]]

第五章 统计思维

Contents

第五章 统计思维#