Logistic Regression of Titanic Data

Logistic Regression of Titanic Data#

Statsmodels#

http://statsmodels.sourceforge.net/

Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests.

An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.

Researchers across fields may find that statsmodels fully meets their needs for statistical computing and data analysis in Python.

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm

Features include:

Linear regression models
Generalized linear models
Discrete choice models
Robust linear models
Many models and functions for time series analysis
Nonparametric estimators
A collection of datasets for examples
A wide range of statistical tests
Input-output tools for producing tables in a number of formats and for reading Stata files into NumPy and Pandas.
Plotting functions
Extensive unit tests to ensure correctness of results
Many more models and extensions in development

train = pd.read_csv('./data/tatanic_train.csv',sep = ",", header=0)
test = pd.read_csv('./data/tatanic_test.csv',sep = ",", header=0)

Describing Data#

.describe() summarizes the columns/features of the DataFrame, including the count of observations, mean, max and so on.
Another useful trick is to look at the dimensions of the DataFrame. This is done by requesting the .shape attribute of your DataFrame object. (ex. your_data.shape)

train.head()

	Unnamed: 0	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

train.describe()

	Unnamed: 0	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	445.000000	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	0.000000	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	222.500000	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	445.000000	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	667.500000	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	890.000000	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

train.shape#, len(train)
#train.columns

(891, 13)

# Passengers that survived vs passengers that passed away
train["Survived"][:3] 

  0
  1
  1
Name: Survived, dtype: int64

Value Counts#

以Series形式返回指定列的不同取值的频率

# Passengers that survived vs passengers that passed away
train["Survived"].value_counts()

0    549
1    342
Name: Survived, dtype: int64

# As proportions
train["Survived"].value_counts(normalize = True)

0    0.616162
1    0.383838
Name: Survived, dtype: float64

train['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

train[train['Sex']=='female'][:3]#[train['Pclass'] == 3]

	Unnamed: 0	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S

# Males that survived vs males that passed away
train[["Survived", 'Fare']][train["Sex"] == 'male'][:3]

	Survived	Fare
0	0	7.2500
4	0	8.0500
5	0	8.4583

# Males that survived vs males that passed away
train["Survived"][train["Sex"] == 'male'].value_counts() 

0    468
1    109
Name: Survived, dtype: int64

# Females that survived vs Females that passed away
train["Survived"][train["Sex"] == 'female'].value_counts() 

1    233
0     81
Name: Survived, dtype: int64

# Normalized male survival
train["Survived"][train["Sex"] == 'male'].value_counts(normalize = True) 

0    0.811092
1    0.188908
Name: Survived, dtype: float64

# Normalized female survival
train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True) 

1    0.742038
0    0.257962
Name: Survived, dtype: float64

# Create the column Child, and indicate whether child or not a child. Print the new column.
train["Child"] = float('NaN')
train.Child[train.Age < 5] = 1
train.Child[train.Age >= 5] = 0
print(train.Child[:3])

  0.0
  0.0
  0.0
Name: Child, dtype: float64

/var/folders/8b/hhnbt0nd4zsg2qhxc28q23w80000gn/T/ipykernel_65699/714207616.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train.Child[train.Age < 5] = 1
/var/folders/8b/hhnbt0nd4zsg2qhxc28q23w80000gn/T/ipykernel_65699/714207616.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train.Child[train.Age >= 5] = 0

# Normalized Survival Rates for under 18
train.Survived[train.Child == 1].value_counts(normalize = True)

1    0.675
0    0.325
Name: Survived, dtype: float64

# Normalized Survival Rates for over 18
train.Survived[train.Child == 0].value_counts(normalize = True)

0    0.609792
1    0.390208
Name: Survived, dtype: float64

透视表(pivotTab)#

透视表就是将指定原有DataFrame的列分别作为行索引和列索引，然后对指定的列应用聚集函数(默认情况下式mean函数)。

列联表（crossTab）#

交叉表是用于统计分组频率的特殊透视表

Compute a simple cross tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.

pd.crosstab(train['Sex'],train['Survived'],margins=True)

Survived	0	1	All
Sex
female	81	233	314
male	468	109	577
All	549	342	891

pd.crosstab(train['Sex'],train['Survived'],margins=True, normalize='index')

Survived	0	1
Sex
female	0.257962	0.742038
male	0.811092	0.188908
All	0.616162	0.383838

pd.crosstab(train['Sex'],[train['Survived'], train['Pclass']],margins=True)

Survived	0			1			All
Pclass	1	2	3	1	2	3
Sex
female	3	6	72	91	70	72	314
male	77	91	300	45	17	47	577
All	80	97	372	136	87	119	891

pd.crosstab(train['Sex'],[train['Survived'], train['Pclass']], normalize='index')

Survived	0			1
Pclass	1	2	3	1	2	3
Sex
female	0.009554	0.019108	0.229299	0.289809	0.222930	0.229299
male	0.133449	0.157712	0.519931	0.077990	0.029463	0.081456

import numpy as np
pd.crosstab(train['Sex'],train['Pclass'], values=train['Survived'], aggfunc=np.average)

Pclass	1	2	3
Sex
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

pd.crosstab(train['Sex'],train['Pclass'], values=train['Survived'], aggfunc=np.average, margins=True)

Pclass	1	2	3	All
Sex
female	0.968085	0.921053	0.500000	0.742038
male	0.368852	0.157407	0.135447	0.188908
All	0.629630	0.472826	0.242363	0.383838

train[['Survived','Sex','Pclass']].pivot_table(index=['Sex','Pclass'])

		Survived
Sex	Pclass
female	1	0.968085
	2	0.921053
	3	0.500000
male	1	0.368852
	2	0.157407
	3	0.135447

train[['Fare','Sex','Pclass']].pivot_table(index=['Sex','Pclass'])

		Fare
Sex	Pclass
female	1	106.125798
	2	21.970121
	3	16.118810
male	1	67.226127
	2	19.741782
	3	12.661633

age = pd.cut(train['Age'], [0, 18, 80])
train.pivot_table('Survived', ['Sex', age], 'Pclass')

	Pclass	1	2	3
Sex	Age
female	(0, 18]	0.909091	1.000000	0.511628
female	(18, 80]	0.972973	0.900000	0.423729
male	(0, 18]	0.800000	0.600000	0.215686
male	(18, 80]	0.375000	0.071429	0.133663

fare = pd.qcut(train['Fare'], 2)
train.pivot_table('Survived', ['Sex', age], [fare, 'Pclass'])

	Fare	(-0.001, 14.454]			(14.454, 512.329]
	Pclass	1	2	3	1	2	3
Sex	Age
female	(0, 18]	NaN	1.000000	0.714286	0.909091	1.000000	0.318182
female	(18, 80]	NaN	0.880000	0.444444	0.972973	0.914286	0.391304
male	(0, 18]	NaN	0.000000	0.260870	0.800000	0.818182	0.178571
male	(18, 80]	0.0	0.098039	0.125000	0.391304	0.030303	0.192308