{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 案例:《转角遇到爱》背后的数据\n",
"\n",
"数据新闻的另一种视角与实现——《转角遇到爱》作品分析 #43\n",
"\n",
"- 作品链接:https://h5.thepaper.cn/html/zt/2018/08/seekinglove/index.html\n",
"- 简介:获得2018年SND(美国新闻媒体视觉设计协会)最佳数字设计铜奖。选一个晴天的周日,从上海人民广场地铁站9号口出门,左手边就是闻名全国的人民广场相亲角。五六十岁模样的大叔大妈们带着伞和小板凳,在这里为他们的晚辈寻觅一份姻缘。澎湃新闻 www.thepaper.cn 和姐妹英文媒体“第六声”的数据记者花费了六个周末的时间,收集了874份相亲广告。从中可以读出关于618位女士和256位男士的觅爱故事。\n",
"- 解读:https://github.com/data-journalism/data-journalism.github.io/discussions/43"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:46:47.995690Z",
"start_time": "2021-10-24T06:46:47.992003Z"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import seaborn as sns\n",
"import pylab as plt\n",
"\n",
"plt.rcParams['font.sans-serif'] = ['Microsoft YaHei'] # 用来正常显示中文标签\n",
"plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号, 注意['SimHei']对应这句不行.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:18:13.389614Z",
"start_time": "2021-10-24T06:18:13.270272Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MPs' expenses claims, Jul-Dec, 2009.xlsx\r\n",
"\u001b[31mdata.js\u001b[m\u001b[m*\r\n",
"db_new.csv\r\n"
]
}
],
"source": [
"ls './data/'"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:18:38.320003Z",
"start_time": "2021-10-24T06:18:38.246064Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" Gender.self | \n",
" Year.self | \n",
" Born.self | \n",
" Hukou.self | \n",
" Live.self | \n",
" Marriage.self | \n",
" Height.self | \n",
" Weight.self | \n",
" Looking.self | \n",
" ... | \n",
" Hobby.wanted | \n",
" Edu.min.wanted | \n",
" Edu.min.n.wanted | \n",
" Job.wanted | \n",
" Salary.min.wanted | \n",
" Apt.wanted | \n",
" Family.wanted | \n",
" Other.wanted | \n",
" interesting.wanted | \n",
" Similar.wanted | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 女性 | \n",
" 1987 | \n",
" N | \n",
" N | \n",
" N | \n",
" N | \n",
" 1.65 | \n",
" N | \n",
" 皮肤白,大眼睛 | \n",
" ... | \n",
" 无不良嗜好 | \n",
" N | \n",
" N | \n",
" 稳定 | \n",
" N | \n",
" 有婚房 | \n",
" N | \n",
" 81年不考虑。家境较好,有独立婚房。靠近长宁区 | \n",
" 81年不考虑 | \n",
" N | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 男性 | \n",
" 1983 | \n",
" 浙江 | \n",
" 有上海户口 | \n",
" N | \n",
" N | \n",
" 1.83 | \n",
" N | \n",
" N | \n",
" ... | \n",
" N | \n",
" 本科 | \n",
" 4 | \n",
" N | \n",
" N | \n",
" 要有婚房 | \n",
" 单亲家庭勿扰 | \n",
" 89年不要,独生女,家庭条件相当 | \n",
" 89年不要 | \n",
" Y | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 男性 | \n",
" 1970 | \n",
" 上海 | \n",
" 有上海户口 | \n",
" N | \n",
" 单身 | \n",
" 1.75 | \n",
" N | \n",
" N | \n",
" ... | \n",
" N | \n",
" N | \n",
" N | \n",
" 白领工作 | \n",
" N | \n",
" N | \n",
" N | \n",
" 条件相当 | \n",
" 白领工作 | \n",
" Y | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 男性 | \n",
" 1983 | \n",
" 上海 | \n",
" 有上海户口 | \n",
" 上海 | \n",
" N | \n",
" 1.8 | \n",
" N | \n",
" N | \n",
" ... | \n",
" N | \n",
" 本科 | \n",
" 4 | \n",
" 稳定工作 | \n",
" N | \n",
" N | \n",
" N | \n",
" N | \n",
" 本分 | \n",
" N | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 女性 | \n",
" 1988 | \n",
" 上海 | \n",
" 有上海户口 | \n",
" 上海 | \n",
" N | \n",
" 1.69 | \n",
" N | \n",
" 清纯、秀丽、有气质 | \n",
" ... | \n",
" 不抽烟 | \n",
" 本科 | \n",
" 4 | \n",
" N | \n",
" N | \n",
" N | \n",
" N | \n",
" 条件相当 | \n",
" 不抽烟 | \n",
" Y | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 43 columns
\n",
"
"
],
"text/plain": [
" id Gender.self Year.self Born.self Hukou.self Live.self Marriage.self \\\n",
"0 1 女性 1987 N N N N \n",
"1 2 男性 1983 浙江 有上海户口 N N \n",
"2 3 男性 1970 上海 有上海户口 N 单身 \n",
"3 4 男性 1983 上海 有上海户口 上海 N \n",
"4 5 女性 1988 上海 有上海户口 上海 N \n",
"\n",
" Height.self Weight.self Looking.self ... Hobby.wanted Edu.min.wanted \\\n",
"0 1.65 N 皮肤白,大眼睛 ... 无不良嗜好 N \n",
"1 1.83 N N ... N 本科 \n",
"2 1.75 N N ... N N \n",
"3 1.8 N N ... N 本科 \n",
"4 1.69 N 清纯、秀丽、有气质 ... 不抽烟 本科 \n",
"\n",
" Edu.min.n.wanted Job.wanted Salary.min.wanted Apt.wanted Family.wanted \\\n",
"0 N 稳定 N 有婚房 N \n",
"1 4 N N 要有婚房 单亲家庭勿扰 \n",
"2 N 白领工作 N N N \n",
"3 4 稳定工作 N N N \n",
"4 4 N N N N \n",
"\n",
" Other.wanted interesting.wanted Similar.wanted \n",
"0 81年不考虑。家境较好,有独立婚房。靠近长宁区 81年不考虑 N \n",
"1 89年不要,独生女,家庭条件相当 89年不要 Y \n",
"2 条件相当 白领工作 Y \n",
"3 N 本分 N \n",
"4 条件相当 不抽烟 Y \n",
"\n",
"[5 rows x 43 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('./data/db_new.csv')\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:19:15.640192Z",
"start_time": "2021-10-24T06:19:15.635514Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"874"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:19:20.850626Z",
"start_time": "2021-10-24T06:19:20.835902Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" Year.self | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 874.000000 | \n",
" 874.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 438.929062 | \n",
" 1982.425629 | \n",
"
\n",
" \n",
" std | \n",
" 253.149643 | \n",
" 7.339553 | \n",
"
\n",
" \n",
" min | \n",
" 1.000000 | \n",
" 1945.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" 219.500000 | \n",
" 1980.000000 | \n",
"
\n",
" \n",
" 50% | \n",
" 439.500000 | \n",
" 1984.000000 | \n",
"
\n",
" \n",
" 75% | \n",
" 657.750000 | \n",
" 1987.000000 | \n",
"
\n",
" \n",
" max | \n",
" 876.000000 | \n",
" 1995.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id Year.self\n",
"count 874.000000 874.000000\n",
"mean 438.929062 1982.425629\n",
"std 253.149643 7.339553\n",
"min 1.000000 1945.000000\n",
"25% 219.500000 1980.000000\n",
"50% 439.500000 1984.000000\n",
"75% 657.750000 1987.000000\n",
"max 876.000000 1995.000000"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:20:06.957776Z",
"start_time": "2021-10-24T06:20:06.953230Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['id', 'Gender.self', 'Year.self', 'Born.self', 'Hukou.self',\n",
" 'Live.self', 'Marriage.self', 'Height.self', 'Weight.self',\n",
" 'Looking.self', 'Personality.self', 'Edu.self', 'Eduno.self',\n",
" 'top.self', 'Abroad.self', 'Major.self', 'Job.self', 'Salary.self',\n",
" 'Apt.self', 'Family.self', 'Hobby.self', 'Other.self',\n",
" 'interesting.self', 'Gender.wanted', 'Year.max.wanted',\n",
" 'Year.min.wanted', 'Year.text.wanted', 'Hukou.wanted', 'Live.wanted',\n",
" 'Marriage.wanted', 'Height.min.wanted', 'Looking.wanted',\n",
" 'Personality.wanted', 'Hobby.wanted', 'Edu.min.wanted',\n",
" 'Edu.min.n.wanted', 'Job.wanted', 'Salary.min.wanted', 'Apt.wanted',\n",
" 'Family.wanted', 'Other.wanted', 'interesting.wanted',\n",
" 'Similar.wanted'],\n",
" dtype='object')"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:36:42.242387Z",
"start_time": "2021-10-24T06:36:42.099005Z"
}
},
"outputs": [],
"source": [
"df['Age'] = 2018 - df['Year.self']"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:38:16.342486Z",
"start_time": "2021-10-24T06:38:15.932000Z"
}
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize =(8, 4), dpi = 100)\n",
"\n",
"sns.histplot(\n",
" df,\n",
" x=\"Age\", hue=\"Gender.self\",\n",
" edgecolor=\".3\",\n",
" linewidth=.5,\n",
" log_scale=True,\n",
");"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:42:08.666934Z",
"start_time": "2021-10-24T06:42:08.511165Z"
}
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# https://seaborn.pydata.org/generated/seaborn.violinplot.html#seaborn.violinplot\n",
"plt.figure(figsize =(8, 4), dpi = 100)\n",
"sns.violinplot(x=\"Gender.self\", y=\"Age\", data=df);"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:47:09.837345Z",
"start_time": "2021-10-24T06:47:09.832998Z"
}
},
"outputs": [],
"source": [
"# deal with missing data\n",
"df['Height.self'] = [float(i) if i != 'N' else np.nan for i in df['Height.self']]"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:47:52.379338Z",
"start_time": "2021-10-24T06:47:52.177822Z"
}
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize =(8, 4), dpi = 100)\n",
"sns.violinplot(x=\"Gender.self\", y=\"Height.self\", data=df);"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:42:51.657499Z",
"start_time": "2021-10-24T06:42:51.652864Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array(['N', '有上海户口', '没有上海户口'], dtype=object)"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Hukou.self'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:43:13.243742Z",
"start_time": "2021-10-24T06:43:13.238432Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"有上海户口 421\n",
"N 409\n",
"没有上海户口 44\n",
"Name: Hukou.self, dtype: int64"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Hukou.self'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:52:10.207425Z",
"start_time": "2021-10-24T06:52:10.201328Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"N 628\n",
"气质佳 14\n",
"帅气 9\n",
"貌佳清秀 8\n",
"清秀 6\n",
" ... \n",
"肤白、身材好 1\n",
"帅 1\n",
"品貌端庄 1\n",
"形象好、气质佳 1\n",
"英俊帅气 1\n",
"Name: Looking.self, Length: 159, dtype: int64"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Looking.self'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:52:41.622357Z",
"start_time": "2021-10-24T06:52:41.616607Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 N\n",
"1 N\n",
"2 N\n",
"3 N\n",
"4 善良、进取、阳光、有责任心\n",
" ... \n",
"869 N\n",
"870 性格文静、善良贤惠、老实本分\n",
"871 N\n",
"872 善良\n",
"873 开朗、稳重、有责任心\n",
"Name: Personality.self, Length: 874, dtype: object"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Personality.self']"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:54:48.501071Z",
"start_time": "2021-10-24T06:54:48.495102Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"本科 373\n",
"研究生 272\n",
"N 120\n",
"大专 81\n",
"博士 14\n",
"中专 7\n",
"高中 6\n",
"初中 1\n",
"Name: Edu.self, dtype: int64"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Edu.self'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:06:37.089895Z",
"start_time": "2021-10-24T07:06:37.084709Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"4 373\n",
"5 272\n",
"N 120\n",
"3 81\n",
"6 14\n",
"1 7\n",
"2 6\n",
"0 1\n",
"Name: Eduno.self, dtype: int64"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Eduno.self'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:07:04.224003Z",
"start_time": "2021-10-24T07:07:04.218551Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"N 680\n",
"重点大学毕业 194\n",
"Name: top.self, dtype: int64"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['top.self'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:08:46.474953Z",
"start_time": "2021-10-24T07:08:46.469495Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"N 782\n",
"Y 92\n",
"Name: Abroad.self, dtype: int64"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Abroad.self'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:11:01.330966Z",
"start_time": "2021-10-24T07:11:01.324309Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"N 111\n",
"外企 50\n",
"银行 31\n",
"国企 28\n",
"公务员 21\n",
" ... \n",
"互联网研发 1\n",
"上海二级医院 1\n",
"制造企业进出口专员 1\n",
"外企白领 1\n",
"外资航空公司 1\n",
"Name: Job.self, Length: 466, dtype: int64"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Job.self'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:10:23.283854Z",
"start_time": "2021-10-24T07:10:23.278202Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"N 803\n",
"金融 13\n",
"会计 4\n",
"英语 3\n",
"生物化学 3\n",
"电子信息 2\n",
"计算机 2\n",
"传媒 2\n",
"财务管理 2\n",
"法学 2\n",
"财会 2\n",
"外语 2\n",
"金融数学 1\n",
"建筑学 1\n",
"工商管理 1\n",
"法学和经济 1\n",
"中医 1\n",
"播音主持 1\n",
"计算机科学 1\n",
"医学 1\n",
"电力电子 1\n",
"新闻媒体 1\n",
"金融学 1\n",
"护理 1\n",
"同声传译 1\n",
"幼儿师范学前教育 1\n",
"微电子 1\n",
"英语和会计 1\n",
"认证检测 1\n",
"经贸外语 1\n",
"政法 1\n",
"金融统计 1\n",
"电气自动化、工商管理 1\n",
"数据分析 1\n",
"生物 1\n",
"数理统计 1\n",
"绘画,设计 1\n",
"中文 1\n",
"药剂 1\n",
"通信专业 1\n",
"临床医学 1\n",
"服装设计 1\n",
"工程 1\n",
"通信工程 1\n",
"建筑 1\n",
"机电工程 1\n",
"Name: Major.self, dtype: int64"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Major.self'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:11:43.569746Z",
"start_time": "2021-10-24T07:11:43.563988Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"N 553\n",
"120000 64\n",
"200000 46\n",
"300000 27\n",
"60000 17\n",
"96000 15\n",
"100000 14\n",
"84000 14\n",
"240000 13\n",
"180000 13\n",
"250000 12\n",
"150000 12\n",
"400000 10\n",
"72000 9\n",
"500000 9\n",
"78000 4\n",
"350000 3\n",
"1000000 3\n",
"360000 2\n",
"162000 2\n",
"48000 2\n",
"90000 2\n",
"144000 2\n",
"66000 2\n",
"108000 2\n",
"800000 2\n",
"57600 2\n",
"70000 1\n",
"6000000 1\n",
"140000 1\n",
"54000 1\n",
"1300000 1\n",
"36000 1\n",
"600000 1\n",
"700000 1\n",
"52000 1\n",
"370000 1\n",
"42000 1\n",
"450000 1\n",
"660000 1\n",
"270000 1\n",
"160000 1\n",
"156000 1\n",
"24000 1\n",
"74400 1\n",
"Name: Salary.self, dtype: int64"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Salary.self'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:12:57.262759Z",
"start_time": "2021-10-24T07:12:57.257039Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"N 461\n",
"有房 413\n",
"Name: Apt.self, dtype: int64"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Apt.self'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:14:50.067552Z",
"start_time": "2021-10-24T07:14:50.061602Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"N 635\n",
"父母退休 36\n",
"父母已退休 9\n",
"家境好 8\n",
"知识分子家庭 8\n",
" ... \n",
"父母退休、家庭和睦 1\n",
"知识分子家庭出身 1\n",
"父亲在政法部门工作,母亲是教师 1\n",
"父亲母亲已退休 1\n",
"纯朴家风,父母均为事业单位退休 1\n",
"Name: Family.self, Length: 149, dtype: int64"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Family.self'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:16:24.706467Z",
"start_time": "2021-10-24T07:16:24.700302Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"N 775\n",
"无不良嗜好 24\n",
"烟酒不沾 9\n",
"爱国画 3\n",
"兴趣爱好广泛 2\n",
" ... \n",
"爱健身、游泳、做饭 1\n",
"无烟酒不良嗜好 1\n",
"钢琴十级,擅长中英文演讲 1\n",
"爱好书法、古筝 1\n",
"钢琴八级 1\n",
"Name: Hobby.self, Length: 61, dtype: int64"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Hobby.self'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:18:24.758063Z",
"start_time": "2021-10-24T07:18:24.751908Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"N 286\n",
"备有婚房 158\n",
"独生女 23\n",
"家境好 11\n",
"独生女。备有婚房 10\n",
" ... \n",
"爱清洁。名下有1000万房产 1\n",
"三个专业毕业 1\n",
"独生子 1\n",
"有四套房车 1\n",
"独生女,闵行有房,无贷 1\n",
"Name: Other.self, Length: 323, dtype: int64"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Other.self'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:19:33.233114Z",
"start_time": "2021-10-24T07:19:33.226390Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"N 634\n",
"无同居史 4\n",
"自己创业 3\n",
"有绿卡 3\n",
"高知家庭 3\n",
" ... \n",
"稳重大方 1\n",
"活泼文静 1\n",
"外语10级 1\n",
"肤白 1\n",
"明年毕业 1\n",
"Name: interesting.self, Length: 215, dtype: int64"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['interesting.self'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:20:41.416534Z",
"start_time": "2021-10-24T07:20:41.410990Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"N 781\n",
"上海 51\n",
"美国 12\n",
"澳大利亚悉尼 5\n",
"澳大利亚 3\n",
"德国 2\n",
"加拿大 2\n",
"日本 2\n",
"新加坡 2\n",
"加拿大多伦多 2\n",
"山东 1\n",
"浙江杭州 1\n",
"西班牙 1\n",
"英国伦敦 1\n",
"日本大阪 1\n",
"美国纽约 1\n",
"美国加州 1\n",
"西班牙巴塞罗那 1\n",
"美国芝加哥 1\n",
"加拿大温哥华 1\n",
"美国旧金山 1\n",
"江苏昆山 1\n",
"Name: Live.self, dtype: int64"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Live.self'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:21:25.899002Z",
"start_time": "2021-10-24T07:21:25.893364Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"N 777\n",
"Y 97\n",
"Name: Similar.wanted, dtype: int64"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Similar.wanted'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:22:32.004135Z",
"start_time": "2021-10-24T07:22:31.997765Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"N 754\n",
"五官端正 10\n",
"清秀 10\n",
"相貌端正 5\n",
"貌佳 4\n",
" ... \n",
"相貌较好 1\n",
"容貌稍好 1\n",
"气质佳、品貌优秀 1\n",
"长相好 1\n",
"靓女、甜美可爱 1\n",
"Name: Looking.wanted, Length: 71, dtype: int64"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Looking.wanted'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:26:54.473386Z",
"start_time": "2021-10-24T07:26:54.468044Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"N 760\n",
"上海户口 97\n",
"江浙沪 15\n",
"澳大利亚悉尼 1\n",
"美国/加拿大 1\n",
"Name: Hukou.wanted, dtype: int64"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Hukou.wanted'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:27:30.205036Z",
"start_time": "2021-10-24T07:27:30.193707Z"
}
},
"outputs": [],
"source": [
"df['Looking.self.dummy'] = [1 if i != 'N' else 0 for i in df['Looking.self']]\n",
"df['Looking.wanted.dummy'] = [1 if i != 'N' else 0 for i in df['Looking.wanted']]\n",
"df['Personality.self.dummy'] = [1 if i != 'N' else 0 for i in df['Personality.self']]\n",
"df['Family.self.dummy'] = [1 if i != 'N' else 0 for i in df['Family.self']]\n",
"df['Hobby.self.dummy'] = [1 if i != 'N' else 0 for i in df['Hobby.self']]\n",
"df['Other.self.dummy'] = [1 if i != 'N' else 0 for i in df['Other.self']]\n",
"df['interesting.self.dummy'] = [1 if i != 'N' else 0 for i in df['interesting.self']]\n",
"df['Hukou.wanted.dummy'] = [1 if i != 'N' else 0 for i in df['Hukou.wanted']]\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:50:18.807222Z",
"start_time": "2021-10-24T06:50:18.800519Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['id',\n",
" 'Gender.self',\n",
" 'Year.self',\n",
" 'Born.self',\n",
" 'Hukou.self',\n",
" 'Live.self',\n",
" 'Marriage.self',\n",
" 'Height.self',\n",
" 'Weight.self',\n",
" 'Looking.self',\n",
" 'Personality.self',\n",
" 'Edu.self',\n",
" 'Eduno.self',\n",
" 'top.self',\n",
" 'Abroad.self',\n",
" 'Major.self',\n",
" 'Job.self',\n",
" 'Salary.self',\n",
" 'Apt.self',\n",
" 'Family.self',\n",
" 'Hobby.self',\n",
" 'Other.self',\n",
" 'interesting.self',\n",
" 'Gender.wanted',\n",
" 'Year.max.wanted',\n",
" 'Year.min.wanted',\n",
" 'Year.text.wanted',\n",
" 'Hukou.wanted',\n",
" 'Live.wanted',\n",
" 'Marriage.wanted',\n",
" 'Height.min.wanted',\n",
" 'Looking.wanted',\n",
" 'Personality.wanted',\n",
" 'Hobby.wanted',\n",
" 'Edu.min.wanted',\n",
" 'Edu.min.n.wanted',\n",
" 'Job.wanted',\n",
" 'Salary.min.wanted',\n",
" 'Apt.wanted',\n",
" 'Family.wanted',\n",
" 'Other.wanted',\n",
" 'interesting.wanted',\n",
" 'Similar.wanted']"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"['id', 'Gender.self', 'Year.self', 'Born.self', 'Hukou.self',\n",
" 'Live.self', 'Marriage.self', 'Height.self', 'Weight.self',\n",
" 'Looking.self', 'Personality.self', 'Edu.self', 'Eduno.self',\n",
" 'top.self', 'Abroad.self', 'Major.self', 'Job.self', 'Salary.self',\n",
" 'Apt.self', 'Family.self', 'Hobby.self', 'Other.self',\n",
" 'interesting.self', 'Gender.wanted', 'Year.max.wanted',\n",
" 'Year.min.wanted', 'Year.text.wanted', 'Hukou.wanted', 'Live.wanted',\n",
" 'Marriage.wanted', 'Height.min.wanted', 'Looking.wanted',\n",
" 'Personality.wanted', 'Hobby.wanted', 'Edu.min.wanted',\n",
" 'Edu.min.n.wanted', 'Job.wanted', 'Salary.min.wanted', 'Apt.wanted',\n",
" 'Family.wanted', 'Other.wanted', 'interesting.wanted',\n",
" 'Similar.wanted']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 列联表分析"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:56:15.254424Z",
"start_time": "2021-10-24T06:56:15.210277Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Looking.self.dummy | \n",
" 0 | \n",
" 1 | \n",
" All | \n",
"
\n",
" \n",
" Gender.self | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 女性 | \n",
" 406 | \n",
" 212 | \n",
" 618 | \n",
"
\n",
" \n",
" 男性 | \n",
" 222 | \n",
" 34 | \n",
" 256 | \n",
"
\n",
" \n",
" All | \n",
" 628 | \n",
" 246 | \n",
" 874 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Looking.self.dummy 0 1 All\n",
"Gender.self \n",
"女性 406 212 618\n",
"男性 222 34 256\n",
"All 628 246 874"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(df['Gender.self'],df['Looking.self.dummy'],margins=True)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:56:37.148573Z",
"start_time": "2021-10-24T06:56:37.105086Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Looking.self.dummy | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" Gender.self | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 女性 | \n",
" 0.656958 | \n",
" 0.343042 | \n",
"
\n",
" \n",
" 男性 | \n",
" 0.867188 | \n",
" 0.132812 | \n",
"
\n",
" \n",
" All | \n",
" 0.718535 | \n",
" 0.281465 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Looking.self.dummy 0 1\n",
"Gender.self \n",
"女性 0.656958 0.343042\n",
"男性 0.867188 0.132812\n",
"All 0.718535 0.281465"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(df['Gender.self'],df['Looking.self.dummy'],margins=True, normalize='index')"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:59:38.526237Z",
"start_time": "2021-10-24T06:59:38.508379Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[406, 212], [222, 34]]\n",
"Test Statistic: 38.525189790079125\n",
" p-value: 5.405153041255414e-10\n",
" Degrees of Freedom: 1\n",
"\n",
"[[444.05491991 173.94508009]\n",
" [183.94508009 72.05491991]]\n"
]
}
],
"source": [
"import numpy as np\n",
"from scipy import stats\n",
"\n",
"alist = np.array(pd.crosstab(df['Gender.self'],df['Looking.self.dummy'],margins=False)).tolist()\n",
"print(alist)\n",
"\n",
"# 卡方检验\n",
"chi2, p, ddof, expected = stats.chi2_contingency( alist )\n",
"msg = \"Test Statistic: {}\\n p-value: {}\\n Degrees of Freedom: {}\\n\"\n",
"print( msg.format( chi2, p, ddof ) )\n",
"print( expected )"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:57:38.874823Z",
"start_time": "2021-10-24T06:57:38.836237Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Personality.self.dummy | \n",
" 0 | \n",
" 1 | \n",
" All | \n",
"
\n",
" \n",
" Gender.self | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 女性 | \n",
" 371 | \n",
" 247 | \n",
" 618 | \n",
"
\n",
" \n",
" 男性 | \n",
" 173 | \n",
" 83 | \n",
" 256 | \n",
"
\n",
" \n",
" All | \n",
" 544 | \n",
" 330 | \n",
" 874 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Personality.self.dummy 0 1 All\n",
"Gender.self \n",
"女性 371 247 618\n",
"男性 173 83 256\n",
"All 544 330 874"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(df['Gender.self'],df['Personality.self.dummy'],margins=True)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T06:58:00.202271Z",
"start_time": "2021-10-24T06:58:00.160745Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Personality.self.dummy | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" Gender.self | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 女性 | \n",
" 0.600324 | \n",
" 0.399676 | \n",
"
\n",
" \n",
" 男性 | \n",
" 0.675781 | \n",
" 0.324219 | \n",
"
\n",
" \n",
" All | \n",
" 0.622426 | \n",
" 0.377574 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Personality.self.dummy 0 1\n",
"Gender.self \n",
"女性 0.600324 0.399676\n",
"男性 0.675781 0.324219\n",
"All 0.622426 0.377574"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(df['Gender.self'],df['Personality.self.dummy'],margins=True, normalize='index')"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:00:23.105353Z",
"start_time": "2021-10-24T07:00:23.087789Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[371, 247], [173, 83]]\n",
"Test Statistic: 4.070440019813376\n",
" p-value: 0.04363990497274837\n",
" Degrees of Freedom: 1\n",
"\n",
"[[384.6590389 233.3409611]\n",
" [159.3409611 96.6590389]]\n"
]
}
],
"source": [
"alist = np.array(pd.crosstab(df['Gender.self'],df['Personality.self.dummy'],margins=False)).tolist()\n",
"print(alist)\n",
"\n",
"# 卡方检验\n",
"chi2, p, ddof, expected = stats.chi2_contingency( alist )\n",
"msg = \"Test Statistic: {}\\n p-value: {}\\n Degrees of Freedom: {}\\n\"\n",
"print( msg.format( chi2, p, ddof ) )\n",
"print( expected )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:01:48.574356Z",
"start_time": "2021-10-24T07:01:48.526096Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Edu.self | \n",
" N | \n",
" 中专 | \n",
" 初中 | \n",
" 博士 | \n",
" 大专 | \n",
" 本科 | \n",
" 研究生 | \n",
" 高中 | \n",
" All | \n",
"
\n",
" \n",
" Gender.self | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 女性 | \n",
" 74 | \n",
" 3 | \n",
" 1 | \n",
" 4 | \n",
" 47 | \n",
" 278 | \n",
" 207 | \n",
" 4 | \n",
" 618 | \n",
"
\n",
" \n",
" 男性 | \n",
" 46 | \n",
" 4 | \n",
" 0 | \n",
" 10 | \n",
" 34 | \n",
" 95 | \n",
" 65 | \n",
" 2 | \n",
" 256 | \n",
"
\n",
" \n",
" All | \n",
" 120 | \n",
" 7 | \n",
" 1 | \n",
" 14 | \n",
" 81 | \n",
" 373 | \n",
" 272 | \n",
" 6 | \n",
" 874 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Edu.self N 中专 初中 博士 大专 本科 研究生 高中 All\n",
"Gender.self \n",
"女性 74 3 1 4 47 278 207 4 618\n",
"男性 46 4 0 10 34 95 65 2 256\n",
"All 120 7 1 14 81 373 272 6 874"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(df['Gender.self'],df['Edu.self'],margins=True)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:02:26.422921Z",
"start_time": "2021-10-24T07:02:26.371978Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Edu.self | \n",
" N | \n",
" 中专 | \n",
" 初中 | \n",
" 博士 | \n",
" 大专 | \n",
" 本科 | \n",
" 研究生 | \n",
" 高中 | \n",
"
\n",
" \n",
" Gender.self | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 女性 | \n",
" 0.119741 | \n",
" 0.004854 | \n",
" 0.001618 | \n",
" 0.006472 | \n",
" 0.076052 | \n",
" 0.449838 | \n",
" 0.334951 | \n",
" 0.006472 | \n",
"
\n",
" \n",
" 男性 | \n",
" 0.179688 | \n",
" 0.015625 | \n",
" 0.000000 | \n",
" 0.039062 | \n",
" 0.132812 | \n",
" 0.371094 | \n",
" 0.253906 | \n",
" 0.007812 | \n",
"
\n",
" \n",
" All | \n",
" 0.137300 | \n",
" 0.008009 | \n",
" 0.001144 | \n",
" 0.016018 | \n",
" 0.092677 | \n",
" 0.426773 | \n",
" 0.311213 | \n",
" 0.006865 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Edu.self N 中专 初中 博士 大专 本科 \\\n",
"Gender.self \n",
"女性 0.119741 0.004854 0.001618 0.006472 0.076052 0.449838 \n",
"男性 0.179688 0.015625 0.000000 0.039062 0.132812 0.371094 \n",
"All 0.137300 0.008009 0.001144 0.016018 0.092677 0.426773 \n",
"\n",
"Edu.self 研究生 高中 \n",
"Gender.self \n",
"女性 0.334951 0.006472 \n",
"男性 0.253906 0.007812 \n",
"All 0.311213 0.006865 "
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(df['Gender.self'],df['Edu.self'],margins=True, normalize='index')"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:03:14.977664Z",
"start_time": "2021-10-24T07:03:14.957983Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[74, 3, 1, 4, 47, 278, 207, 4], [46, 4, 0, 10, 34, 95, 65, 2]]\n",
"Test Statistic: 32.566864831926054\n",
" p-value: 3.187587277414393e-05\n",
" Degrees of Freedom: 7\n",
"\n",
"[[ 84.85125858 4.94965675 0.70709382 9.8993135 57.27459954\n",
" 263.74599542 192.32951945 4.24256293]\n",
" [ 35.14874142 2.05034325 0.29290618 4.1006865 23.72540046\n",
" 109.25400458 79.67048055 1.75743707]]\n"
]
}
],
"source": [
"alist = np.array(pd.crosstab(df['Gender.self'],df['Edu.self'],margins=False)).tolist()\n",
"print(alist)\n",
"\n",
"# 卡方检验\n",
"chi2, p, ddof, expected = stats.chi2_contingency( alist )\n",
"msg = \"Test Statistic: {}\\n p-value: {}\\n Degrees of Freedom: {}\\n\"\n",
"print( msg.format( chi2, p, ddof ) )\n",
"print( expected )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:04:41.795432Z",
"start_time": "2021-10-24T07:04:41.757324Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Hukou.self | \n",
" N | \n",
" 有上海户口 | \n",
" 没有上海户口 | \n",
" All | \n",
"
\n",
" \n",
" Gender.self | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 女性 | \n",
" 293 | \n",
" 290 | \n",
" 35 | \n",
" 618 | \n",
"
\n",
" \n",
" 男性 | \n",
" 116 | \n",
" 131 | \n",
" 9 | \n",
" 256 | \n",
"
\n",
" \n",
" All | \n",
" 409 | \n",
" 421 | \n",
" 44 | \n",
" 874 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Hukou.self N 有上海户口 没有上海户口 All\n",
"Gender.self \n",
"女性 293 290 35 618\n",
"男性 116 131 9 256\n",
"All 409 421 44 874"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(df['Gender.self'],df['Hukou.self'],margins=True)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:05:10.881381Z",
"start_time": "2021-10-24T07:05:10.840172Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Hukou.self | \n",
" N | \n",
" 有上海户口 | \n",
" 没有上海户口 | \n",
"
\n",
" \n",
" Gender.self | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 女性 | \n",
" 0.474110 | \n",
" 0.469256 | \n",
" 0.056634 | \n",
"
\n",
" \n",
" 男性 | \n",
" 0.453125 | \n",
" 0.511719 | \n",
" 0.035156 | \n",
"
\n",
" \n",
" All | \n",
" 0.467963 | \n",
" 0.481693 | \n",
" 0.050343 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Hukou.self N 有上海户口 没有上海户口\n",
"Gender.self \n",
"女性 0.474110 0.469256 0.056634\n",
"男性 0.453125 0.511719 0.035156\n",
"All 0.467963 0.481693 0.050343"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(df['Gender.self'],df['Hukou.self'],margins=True, normalize = 'index')"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:05:36.153888Z",
"start_time": "2021-10-24T07:05:36.137371Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[293, 290, 35], [116, 131, 9]]\n",
"Test Statistic: 2.506628461110095\n",
" p-value: 0.28555682567352375\n",
" Degrees of Freedom: 2\n",
"\n",
"[[289.201373 297.68649886 31.11212815]\n",
" [119.798627 123.31350114 12.88787185]]\n"
]
}
],
"source": [
"alist = np.array(pd.crosstab(df['Gender.self'],df['Hukou.self'],margins=False)).tolist()\n",
"print(alist)\n",
"\n",
"# 卡方检验\n",
"chi2, p, ddof, expected = stats.chi2_contingency( alist )\n",
"msg = \"Test Statistic: {}\\n p-value: {}\\n Degrees of Freedom: {}\\n\"\n",
"print( msg.format( chi2, p, ddof ) )\n",
"print( expected )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:07:27.876117Z",
"start_time": "2021-10-24T07:07:27.838061Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" top.self | \n",
" N | \n",
" 重点大学毕业 | \n",
" All | \n",
"
\n",
" \n",
" Gender.self | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 女性 | \n",
" 470 | \n",
" 148 | \n",
" 618 | \n",
"
\n",
" \n",
" 男性 | \n",
" 210 | \n",
" 46 | \n",
" 256 | \n",
"
\n",
" \n",
" All | \n",
" 680 | \n",
" 194 | \n",
" 874 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"top.self N 重点大学毕业 All\n",
"Gender.self \n",
"女性 470 148 618\n",
"男性 210 46 256\n",
"All 680 194 874"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(df['Gender.self'],df['top.self'],margins=True)"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:07:52.683890Z",
"start_time": "2021-10-24T07:07:52.642530Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" top.self | \n",
" N | \n",
" 重点大学毕业 | \n",
"
\n",
" \n",
" Gender.self | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 女性 | \n",
" 0.760518 | \n",
" 0.239482 | \n",
"
\n",
" \n",
" 男性 | \n",
" 0.820312 | \n",
" 0.179688 | \n",
"
\n",
" \n",
" All | \n",
" 0.778032 | \n",
" 0.221968 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"top.self N 重点大学毕业\n",
"Gender.self \n",
"女性 0.760518 0.239482\n",
"男性 0.820312 0.179688\n",
"All 0.778032 0.221968"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(df['Gender.self'],df['top.self'],margins=True, normalize = 'index')"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:08:06.234954Z",
"start_time": "2021-10-24T07:08:06.218357Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[470, 148], [210, 46]]\n",
"Test Statistic: 3.4093710125149688\n",
" p-value: 0.0648271518621084\n",
" Degrees of Freedom: 1\n",
"\n",
"[[480.82379863 137.17620137]\n",
" [199.17620137 56.82379863]]\n"
]
}
],
"source": [
"alist = np.array(pd.crosstab(df['Gender.self'],df['top.self'],margins=False)).tolist()\n",
"print(alist)\n",
"\n",
"# 卡方检验\n",
"chi2, p, ddof, expected = stats.chi2_contingency( alist )\n",
"msg = \"Test Statistic: {}\\n p-value: {}\\n Degrees of Freedom: {}\\n\"\n",
"print( msg.format( chi2, p, ddof ) )\n",
"print( expected )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:09:11.820413Z",
"start_time": "2021-10-24T07:09:11.783232Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Abroad.self | \n",
" N | \n",
" Y | \n",
" All | \n",
"
\n",
" \n",
" Gender.self | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 女性 | \n",
" 550 | \n",
" 68 | \n",
" 618 | \n",
"
\n",
" \n",
" 男性 | \n",
" 232 | \n",
" 24 | \n",
" 256 | \n",
"
\n",
" \n",
" All | \n",
" 782 | \n",
" 92 | \n",
" 874 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Abroad.self N Y All\n",
"Gender.self \n",
"女性 550 68 618\n",
"男性 232 24 256\n",
"All 782 92 874"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(df['Gender.self'],df['Abroad.self'],margins=True)"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:09:26.480175Z",
"start_time": "2021-10-24T07:09:26.438190Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Abroad.self | \n",
" N | \n",
" Y | \n",
"
\n",
" \n",
" Gender.self | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 女性 | \n",
" 0.889968 | \n",
" 0.110032 | \n",
"
\n",
" \n",
" 男性 | \n",
" 0.906250 | \n",
" 0.093750 | \n",
"
\n",
" \n",
" All | \n",
" 0.894737 | \n",
" 0.105263 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Abroad.self N Y\n",
"Gender.self \n",
"女性 0.889968 0.110032\n",
"男性 0.906250 0.093750\n",
"All 0.894737 0.105263"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(df['Gender.self'],df['Abroad.self'],margins=True, normalize = 'index')"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:09:39.072601Z",
"start_time": "2021-10-24T07:09:39.056631Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[550, 68], [232, 24]]\n",
"Test Statistic: 0.351325749125499\n",
" p-value: 0.5533636134064016\n",
" Degrees of Freedom: 1\n",
"\n",
"[[552.94736842 65.05263158]\n",
" [229.05263158 26.94736842]]\n"
]
}
],
"source": [
"alist = np.array(pd.crosstab(df['Gender.self'],df['Abroad.self'],margins=False)).tolist()\n",
"print(alist)\n",
"\n",
"# 卡方检验\n",
"chi2, p, ddof, expected = stats.chi2_contingency( alist )\n",
"msg = \"Test Statistic: {}\\n p-value: {}\\n Degrees of Freedom: {}\\n\"\n",
"print( msg.format( chi2, p, ddof ) )\n",
"print( expected )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:13:25.333267Z",
"start_time": "2021-10-24T07:13:25.295343Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Apt.self | \n",
" N | \n",
" 有房 | \n",
" All | \n",
"
\n",
" \n",
" Gender.self | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 女性 | \n",
" 385 | \n",
" 233 | \n",
" 618 | \n",
"
\n",
" \n",
" 男性 | \n",
" 76 | \n",
" 180 | \n",
" 256 | \n",
"
\n",
" \n",
" All | \n",
" 461 | \n",
" 413 | \n",
" 874 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Apt.self N 有房 All\n",
"Gender.self \n",
"女性 385 233 618\n",
"男性 76 180 256\n",
"All 461 413 874"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(df['Gender.self'],df['Apt.self'],margins=True)"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:13:39.284921Z",
"start_time": "2021-10-24T07:13:39.244306Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Apt.self | \n",
" N | \n",
" 有房 | \n",
"
\n",
" \n",
" Gender.self | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 女性 | \n",
" 0.622977 | \n",
" 0.377023 | \n",
"
\n",
" \n",
" 男性 | \n",
" 0.296875 | \n",
" 0.703125 | \n",
"
\n",
" \n",
" All | \n",
" 0.527460 | \n",
" 0.472540 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Apt.self N 有房\n",
"Gender.self \n",
"女性 0.622977 0.377023\n",
"男性 0.296875 0.703125\n",
"All 0.527460 0.472540"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(df['Gender.self'],df['Apt.self'],margins=True, normalize = 'index')"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:14:07.344002Z",
"start_time": "2021-10-24T07:14:07.326516Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[385, 233], [76, 180]]\n",
"Test Statistic: 75.92908969295243\n",
" p-value: 2.9403634235931047e-18\n",
" Degrees of Freedom: 1\n",
"\n",
"[[325.97025172 292.02974828]\n",
" [135.02974828 120.97025172]]\n"
]
}
],
"source": [
"alist = np.array(pd.crosstab(df['Gender.self'],df['Apt.self'],margins=False)).tolist()\n",
"print(alist)\n",
"\n",
"# 卡方检验\n",
"chi2, p, ddof, expected = stats.chi2_contingency( alist )\n",
"msg = \"Test Statistic: {}\\n p-value: {}\\n Degrees of Freedom: {}\\n\"\n",
"print( msg.format( chi2, p, ddof ) )\n",
"print( expected )"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:23:36.012768Z",
"start_time": "2021-10-24T07:23:35.995071Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[549, 79], [205, 41]]\n",
"Test Statistic: 2.1596107037130725\n",
" p-value: 0.14168058143484263\n",
" Degrees of Freedom: 1\n",
"\n",
"[[541.77574371 86.22425629]\n",
" [212.22425629 33.77574371]]\n"
]
}
],
"source": [
"alist = np.array(pd.crosstab(df['Looking.self.dummy'],df['Looking.wanted.dummy'],margins=False)).tolist()\n",
"print(alist)\n",
"\n",
"# 卡方检验\n",
"chi2, p, ddof, expected = stats.chi2_contingency( alist )\n",
"msg = \"Test Statistic: {}\\n p-value: {}\\n Degrees of Freedom: {}\\n\"\n",
"print( msg.format( chi2, p, ddof ) )\n",
"print( expected )"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:26:15.526929Z",
"start_time": "2021-10-24T07:26:15.509099Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[104, 16], [5, 2], [1, 0], [10, 4], [72, 9], [318, 55], [238, 34], [6, 0]]\n",
"Test Statistic: 6.176235299193836\n",
" p-value: 0.5193287773742754\n",
" Degrees of Freedom: 7\n",
"\n",
"[[1.03524027e+02 1.64759725e+01]\n",
" [6.03890160e+00 9.61098398e-01]\n",
" [8.62700229e-01 1.37299771e-01]\n",
" [1.20778032e+01 1.92219680e+00]\n",
" [6.98787185e+01 1.11212815e+01]\n",
" [3.21787185e+02 5.12128146e+01]\n",
" [2.34654462e+02 3.73455378e+01]\n",
" [5.17620137e+00 8.23798627e-01]]\n"
]
}
],
"source": [
"alist = np.array(pd.crosstab(df['Looking.self.dummy'],df['Looking.wanted.dummy'],margins=False)).tolist()\n",
"print(alist)\n",
"\n",
"# 卡方检验\n",
"chi2, p, ddof, expected = stats.chi2_contingency( alist )\n",
"msg = \"Test Statistic: {}\\n p-value: {}\\n Degrees of Freedom: {}\\n\"\n",
"print( msg.format( chi2, p, ddof ) )\n",
"print( expected )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 介绍自己容貌者比起不介绍自己容貌的人对户口有要求更明确!"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:28:39.144270Z",
"start_time": "2021-10-24T07:28:39.104232Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" Hukou.wanted.dummy | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" Looking.self.dummy | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.890127 | \n",
" 0.109873 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.817073 | \n",
" 0.182927 | \n",
"
\n",
" \n",
" All | \n",
" 0.869565 | \n",
" 0.130435 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Hukou.wanted.dummy 0 1\n",
"Looking.self.dummy \n",
"0 0.890127 0.109873\n",
"1 0.817073 0.182927\n",
"All 0.869565 0.130435"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(df['Looking.self.dummy'],df['Hukou.wanted.dummy'],margins=True, normalize = 'index')"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {
"ExecuteTime": {
"end_time": "2021-10-24T07:28:07.273806Z",
"start_time": "2021-10-24T07:28:07.257734Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[559, 69], [201, 45]]\n",
"Test Statistic: 7.6855978565757\n",
" p-value: 0.005566323672839524\n",
" Degrees of Freedom: 1\n",
"\n",
"[[546.08695652 81.91304348]\n",
" [213.91304348 32.08695652]]\n"
]
}
],
"source": [
"alist = np.array(pd.crosstab(df['Looking.self.dummy'],df['Hukou.wanted.dummy'],margins=False)).tolist()\n",
"print(alist)\n",
"\n",
"# 卡方检验\n",
"chi2, p, ddof, expected = stats.chi2_contingency( alist )\n",
"msg = \"Test Statistic: {}\\n p-value: {}\\n Degrees of Freedom: {}\\n\"\n",
"print( msg.format( chi2, p, ddof ) )\n",
"print( expected )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": true
}
},
"nbformat": 4,
"nbformat_minor": 4
}