Logistic Regression of Titanic Data

Statsmodels

http://statsmodels.sourceforge.net/

Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests.

An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.

Researchers across fields may find that statsmodels fully meets their needs for statistical computing and data analysis in Python.

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm

Features include:

  • Linear regression models

  • Generalized linear models

  • Discrete choice models

  • Robust linear models

  • Many models and functions for time series analysis

  • Nonparametric estimators

  • A collection of datasets for examples

  • A wide range of statistical tests

  • Input-output tools for producing tables in a number of formats and for reading Stata files into NumPy and Pandas.

  • Plotting functions

  • Extensive unit tests to ensure correctness of results

  • Many more models and extensions in development

train = pd.read_csv('./data/tatanic_train.csv',sep = ",", header=0)
test = pd.read_csv('./data/tatanic_test.csv',sep = ",", header=0)

Describing Data

  • .describe() summarizes the columns/features of the DataFrame, including the count of observations, mean, max and so on.

  • Another useful trick is to look at the dimensions of the DataFrame. This is done by requesting the .shape attribute of your DataFrame object. (ex. your_data.shape)

train.head()
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
train.describe()
Unnamed: 0 PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 445.000000 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 222.500000 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 445.000000 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 667.500000 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 890.000000 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
train.shape#, len(train)
#train.columns
(891, 13)
# Passengers that survived vs passengers that passed away
train["Survived"][:3] 
0    0
1    1
2    1
Name: Survived, dtype: int64

Value Counts

以Series形式返回指定列的不同取值的频率

# Passengers that survived vs passengers that passed away
train["Survived"].value_counts()
0    549
1    342
Name: Survived, dtype: int64
# As proportions
train["Survived"].value_counts(normalize = True)
0    0.616162
1    0.383838
Name: Survived, dtype: float64
train['Sex'].value_counts()
male      577
female    314
Name: Sex, dtype: int64
train[train['Sex']=='female'][:3]#[train['Pclass'] == 3]
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
# Males that survived vs males that passed away
train[["Survived", 'Fare']][train["Sex"] == 'male'][:3]
Survived Fare
0 0 7.2500
4 0 8.0500
5 0 8.4583
# Males that survived vs males that passed away
train["Survived"][train["Sex"] == 'male'].value_counts() 
0    468
1    109
Name: Survived, dtype: int64
# Females that survived vs Females that passed away
train["Survived"][train["Sex"] == 'female'].value_counts() 
1    233
0     81
Name: Survived, dtype: int64
# Normalized male survival
train["Survived"][train["Sex"] == 'male'].value_counts(normalize = True) 
0    0.811092
1    0.188908
Name: Survived, dtype: float64
# Normalized female survival
train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True)
1    0.742038
0    0.257962
Name: Survived, dtype: float64
# Create the column Child, and indicate whether child or not a child. Print the new column.
train["Child"] = float('NaN')
train.Child[train.Age < 5] = 1
train.Child[train.Age >= 5] = 0
print(train.Child[:3])
0    0.0
1    0.0
2    0.0
Name: Child, dtype: float64
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
# Normalized Survival Rates for under 18
train.Survived[train.Child == 1].value_counts(normalize = True)
1    0.675
0    0.325
Name: Survived, dtype: float64
# Normalized Survival Rates for over 18
train.Survived[train.Child == 0].value_counts(normalize = True)
0    0.609792
1    0.390208
Name: Survived, dtype: float64

透视表(pivotTab)

透视表就是将指定原有DataFrame的列分别作为行索引和列索引,然后对指定的列应用聚集函数(默认情况下式mean函数)。

列联表(crossTab)

交叉表是用于统计分组频率的特殊透视表

Compute a simple cross tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.

pd.crosstab(train['Sex'],train['Survived'],margins=True)
Survived 0 1 All
Sex
female 81 233 314
male 468 109 577
All 549 342 891
pd.crosstab(train['Sex'],train['Survived'],margins=True, normalize='index')
Survived 0 1
Sex
female 0.257962 0.742038
male 0.811092 0.188908
All 0.616162 0.383838
pd.crosstab(train['Sex'],[train['Survived'], train['Pclass']],margins=True)
Survived 0 1 All
Pclass 1 2 3 1 2 3
Sex
female 3 6 72 91 70 72 314
male 77 91 300 45 17 47 577
All 80 97 372 136 87 119 891
pd.crosstab(train['Sex'],[train['Survived'], train['Pclass']], normalize='index')
Survived 0 1
Pclass 1 2 3 1 2 3
Sex
female 0.009554 0.019108 0.229299 0.289809 0.222930 0.229299
male 0.133449 0.157712 0.519931 0.077990 0.029463 0.081456
pd.crosstab(train['Sex'],train['Pclass'], values=train['Survived'], aggfunc=np.average)
/opt/anaconda3/lib/python3.7/site-packages/numpy/lib/function_base.py:393: RuntimeWarning: Mean of empty slice.
  avg = a.mean(axis)
/opt/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Pclass 1 2 3
Sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447
pd.crosstab(train['Sex'],train['Pclass'], values=train['Survived'], aggfunc=np.average, margins=True)
/opt/anaconda3/lib/python3.7/site-packages/numpy/lib/function_base.py:393: RuntimeWarning: Mean of empty slice.
  avg = a.mean(axis)
/opt/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Pclass 1 2 3 All
Sex
female 0.968085 0.921053 0.500000 0.742038
male 0.368852 0.157407 0.135447 0.188908
All 0.629630 0.472826 0.242363 0.383838
train[['Survived','Sex','Pclass']].pivot_table(index=['Sex','Pclass'])
Survived
Sex Pclass
female 1 0.968085
2 0.921053
3 0.500000
male 1 0.368852
2 0.157407
3 0.135447
train[['Fare','Sex','Pclass']].pivot_table(index=['Sex','Pclass'])
Fare
Sex Pclass
female 1 106.125798
2 21.970121
3 16.118810
male 1 67.226127
2 19.741782
3 12.661633
age = pd.cut(train['Age'], [0, 18, 80])
train.pivot_table('Survived', ['Sex', age], 'Pclass')
Pclass 1 2 3
Sex Age
female (0, 18] 0.909091 1.000000 0.511628
(18, 80] 0.972973 0.900000 0.423729
male (0, 18] 0.800000 0.600000 0.215686
(18, 80] 0.375000 0.071429 0.133663
fare = pd.qcut(train['Fare'], 2)
train.pivot_table('Survived', ['Sex', age], [fare, 'Pclass'])
Fare (-0.001, 14.454] (14.454, 512.329]
Pclass 1 2 3 1 2 3
Sex Age
female (0, 18] NaN 1.000000 0.714286 0.909091 1.000000 0.318182
(18, 80] NaN 0.880000 0.444444 0.972973 0.914286 0.391304
male (0, 18] NaN 0.000000 0.260870 0.800000 0.818182 0.178571
(18, 80] 0.0 0.098039 0.125000 0.391304 0.030303 0.192308

Logistic Regression

image.png

对数几率函数 (一种Sigmoid函数) $$y = \frac{1}{1+e^{-z}} = \frac{1}{1+e^{-(w^Tx + b)}}$$

对数几率 log odds $$logit = ln \frac{y}{1-y} = w^Tx + b$$

# load data with pandas
import pandas as pd
import statsmodels.api as sm

train = pd.read_csv('../data/tatanic_train.csv',sep = ",", header=0)

Data Cleaning

# dealing with missing data
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Fare"] = train["Fare"].fillna(train["Fare"].median())
# Convert the male and female groups to integer form
train['Sex'] = train['Sex'].fillna('ffill')
train['female'] = [1 if i =='female' else 0 for i in train['Sex']]
#Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna('S')
train['embarked_c'] = [1 if i =='C' else 0 for i in train['Embarked']]
train['embarked_q'] = [1 if i =='Q' else 0 for i in train['Embarked']]
logit = sm.Logit(train['Survived'],  
                 train[['female', 'Fare', 'Age','Pclass', 'embarked_c', 'embarked_q' ]])
result = logit.fit()
result.summary()
Optimization terminated successfully.
         Current function value: 0.458435
         Iterations 6
Logit Regression Results
Dep. Variable: Survived No. Observations: 891
Model: Logit Df Residuals: 885
Method: MLE Df Model: 5
Date: Sun, 07 Jun 2020 Pseudo R-squ.: 0.3116
Time: 11:17:13 Log-Likelihood: -408.47
converged: True LL-Null: -593.33
Covariance Type: nonrobust LLR p-value: 9.902e-78
coef std err z P>|z| [0.025 0.975]
female 2.6053 0.182 14.296 0.000 2.248 2.962
Fare 0.0041 0.002 1.973 0.049 2.66e-05 0.008
Age -0.0113 0.005 -2.216 0.027 -0.021 -0.001
Pclass -0.6710 0.070 -9.528 0.000 -0.809 -0.533
embarked_c 0.6349 0.226 2.814 0.005 0.193 1.077
embarked_q 0.3518 0.310 1.134 0.257 -0.256 0.960

image.png