{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"\n",
"# 使用Pandas进行数据清洗\n",
"\n",
"\n",
"\n",
"![image.png](./images/author.png)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## 简介\n",
"\n",
"\n",
"\n",
"Pandas 是 Python 中很流行的类库,使用它可以进行数据科学计算和数据分析。数据分析结果的好坏依赖于数据的好坏。\n",
"- 存在数据缺失、数据格式不统一、数据错误的情况。\n",
"- 不管是不完善的报表,还是技术处理数据的失当,都会不可避免的产生“脏”数据。\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"首先要检查数据,明确数据分析目的。在此基础上:\n",
"- 处理缺失数据\n",
"- 添加默认值\n",
"- 删除不完整的行\n",
"- 删除不完整的列\n",
"- 规范化数据类型\n",
"- 必要的转换\n",
"- 变量重命名\n",
"- 保存结果\n",
"\n",
"https://www.cnblogs.com/BoyceYang/p/8182053.html"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"\n",
"\n",
"\n",
"\n",
"![](./images/pandas.png)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## 案例1: 电影数据清洗\n",
"\n",
"Imdb5000数据包含了很多信息,例如演员、导演、预算、总收入,以及 IMDB 评分和上映时间。有一些列的值是缺失的,有些列的默认值是0,有的是 NaN(Not a Number)。下面我们通过使用 Pandas 提供的功能来清洗“脏”数据。数据来源:\n",
"- https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset\n",
"- https://raw.githubusercontent.com/sundeepblue/movie_rating_prediction/master/movie_metadata.csv"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T13:08:30.866389Z",
"start_time": "2020-06-06T13:08:30.700051Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',\n",
" 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',\n",
" 'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',\n",
" 'movie_title', 'num_voted_users', 'cast_total_facebook_likes',\n",
" 'actor_3_name', 'facenumber_in_poster', 'plot_keywords',\n",
" 'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',\n",
" 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',\n",
" 'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],\n",
" dtype='object')"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"data = pd.read_csv('../data/movie_metadata.csv')\n",
"data.columns"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-24T01:05:44.568427Z",
"start_time": "2019-10-24T01:05:44.563651Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### 检查数据\n",
"\n",
"Pandas 提供了`describe()`方法描述数据,还提供了 `head() 方法`,输出前几行数据,让我们对读入的数据有一个大致的了解。此外,Pandas 提供了一些数据切片的方法:\n",
"- 查看一列的一些基本统计信息:`data['columnname'].describe()`\n",
"- 选择一列:`data['columnname']`\n",
"- 选择一列的前几行数据:`data['columnsname'][:n]`\n",
"- 选择多列:`data[['column1','column2']]`\n",
"- Where 条件过滤:`data[data['columnname'] > condition]`\n"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T13:09:03.162722Z",
"start_time": "2020-06-06T13:09:03.117734Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" director_name | \n",
" num_critic_for_reviews | \n",
" duration | \n",
" director_facebook_likes | \n",
" actor_3_facebook_likes | \n",
" actor_2_name | \n",
" actor_1_facebook_likes | \n",
" gross | \n",
" genres | \n",
" ... | \n",
" num_user_for_reviews | \n",
" language | \n",
" country | \n",
" content_rating | \n",
" budget | \n",
" title_year | \n",
" actor_2_facebook_likes | \n",
" imdb_score | \n",
" aspect_ratio | \n",
" movie_facebook_likes | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Color | \n",
" James Cameron | \n",
" 723.0 | \n",
" 178.0 | \n",
" 0.0 | \n",
" 855.0 | \n",
" Joel David Moore | \n",
" 1000.0 | \n",
" 760505847.0 | \n",
" Action|Adventure|Fantasy|Sci-Fi | \n",
" ... | \n",
" 3054.0 | \n",
" English | \n",
" USA | \n",
" PG-13 | \n",
" 237000000.0 | \n",
" 2009.0 | \n",
" 936.0 | \n",
" 7.9 | \n",
" 1.78 | \n",
" 33000 | \n",
"
\n",
" \n",
" 1 | \n",
" Color | \n",
" Gore Verbinski | \n",
" 302.0 | \n",
" 169.0 | \n",
" 563.0 | \n",
" 1000.0 | \n",
" Orlando Bloom | \n",
" 40000.0 | \n",
" 309404152.0 | \n",
" Action|Adventure|Fantasy | \n",
" ... | \n",
" 1238.0 | \n",
" English | \n",
" USA | \n",
" PG-13 | \n",
" 300000000.0 | \n",
" 2007.0 | \n",
" 5000.0 | \n",
" 7.1 | \n",
" 2.35 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" Color | \n",
" Sam Mendes | \n",
" 602.0 | \n",
" 148.0 | \n",
" 0.0 | \n",
" 161.0 | \n",
" Rory Kinnear | \n",
" 11000.0 | \n",
" 200074175.0 | \n",
" Action|Adventure|Thriller | \n",
" ... | \n",
" 994.0 | \n",
" English | \n",
" UK | \n",
" PG-13 | \n",
" 245000000.0 | \n",
" 2015.0 | \n",
" 393.0 | \n",
" 6.8 | \n",
" 2.35 | \n",
" 85000 | \n",
"
\n",
" \n",
" 3 | \n",
" Color | \n",
" Christopher Nolan | \n",
" 813.0 | \n",
" 164.0 | \n",
" 22000.0 | \n",
" 23000.0 | \n",
" Christian Bale | \n",
" 27000.0 | \n",
" 448130642.0 | \n",
" Action|Thriller | \n",
" ... | \n",
" 2701.0 | \n",
" English | \n",
" USA | \n",
" PG-13 | \n",
" 250000000.0 | \n",
" 2012.0 | \n",
" 23000.0 | \n",
" 8.5 | \n",
" 2.35 | \n",
" 164000 | \n",
"
\n",
" \n",
" 4 | \n",
" NaN | \n",
" Doug Walker | \n",
" NaN | \n",
" NaN | \n",
" 131.0 | \n",
" NaN | \n",
" Rob Walker | \n",
" 131.0 | \n",
" NaN | \n",
" Documentary | \n",
" ... | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 12.0 | \n",
" 7.1 | \n",
" NaN | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 28 columns
\n",
"
"
],
"text/plain": [
" color director_name num_critic_for_reviews duration \\\n",
"0 Color James Cameron 723.0 178.0 \n",
"1 Color Gore Verbinski 302.0 169.0 \n",
"2 Color Sam Mendes 602.0 148.0 \n",
"3 Color Christopher Nolan 813.0 164.0 \n",
"4 NaN Doug Walker NaN NaN \n",
"\n",
" director_facebook_likes actor_3_facebook_likes actor_2_name \\\n",
"0 0.0 855.0 Joel David Moore \n",
"1 563.0 1000.0 Orlando Bloom \n",
"2 0.0 161.0 Rory Kinnear \n",
"3 22000.0 23000.0 Christian Bale \n",
"4 131.0 NaN Rob Walker \n",
"\n",
" actor_1_facebook_likes gross genres ... \\\n",
"0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... \n",
"1 40000.0 309404152.0 Action|Adventure|Fantasy ... \n",
"2 11000.0 200074175.0 Action|Adventure|Thriller ... \n",
"3 27000.0 448130642.0 Action|Thriller ... \n",
"4 131.0 NaN Documentary ... \n",
"\n",
" num_user_for_reviews language country content_rating budget \\\n",
"0 3054.0 English USA PG-13 237000000.0 \n",
"1 1238.0 English USA PG-13 300000000.0 \n",
"2 994.0 English UK PG-13 245000000.0 \n",
"3 2701.0 English USA PG-13 250000000.0 \n",
"4 NaN NaN NaN NaN NaN \n",
"\n",
" title_year actor_2_facebook_likes imdb_score aspect_ratio \\\n",
"0 2009.0 936.0 7.9 1.78 \n",
"1 2007.0 5000.0 7.1 2.35 \n",
"2 2015.0 393.0 6.8 2.35 \n",
"3 2012.0 23000.0 8.5 2.35 \n",
"4 NaN 12.0 7.1 NaN \n",
"\n",
" movie_facebook_likes \n",
"0 33000 \n",
"1 0 \n",
"2 85000 \n",
"3 164000 \n",
"4 0 \n",
"\n",
"[5 rows x 28 columns]"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T13:09:41.527185Z",
"start_time": "2020-06-06T13:09:41.508092Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[(nan, 104),\n",
" ('Steven Spielberg', 26),\n",
" ('Woody Allen', 22),\n",
" ('Martin Scorsese', 20),\n",
" ('Clint Eastwood', 20),\n",
" ('Ridley Scott', 17),\n",
" ('Tim Burton', 16),\n",
" ('Steven Soderbergh', 16),\n",
" ('Spike Lee', 16),\n",
" ('Renny Harlin', 15)]"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from collections import Counter\n",
"\n",
"freq = Counter(data['director_name'].tolist())\n",
"sorted(list(freq.items()), key=lambda x: x[1], reverse=True)[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### 一、缺失数据\n",
"\n",
"为什么要考虑数值缺失的情况?\n",
"\n",
"- 真实世界的数据很少是干净和一致的\n",
"- 许多有趣的数据集都会丢失一些数据\n",
"- 不同的数据源可能以不同的方式表示缺失的数据\n",
"\n",
"缺失数据是最常见的问题之一。产生这个问题可能的原因\n",
"\n",
"① 从来没有填正确过② 数据不可用③ 计算错误\n",
"\n",
"处理缺失数据的方法:\n",
"一、补充缺失值; 二、删除缺失值; 三、不处理"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### 1. 使用默认值填补缺失"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T13:08:35.837102Z",
"start_time": "2020-06-06T13:08:35.789917Z"
},
"scrolled": false,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" director_name | \n",
" num_critic_for_reviews | \n",
" duration | \n",
" director_facebook_likes | \n",
" actor_3_facebook_likes | \n",
" actor_2_name | \n",
" actor_1_facebook_likes | \n",
" gross | \n",
" genres | \n",
" ... | \n",
" num_user_for_reviews | \n",
" language | \n",
" country | \n",
" content_rating | \n",
" budget | \n",
" title_year | \n",
" actor_2_facebook_likes | \n",
" imdb_score | \n",
" aspect_ratio | \n",
" movie_facebook_likes | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 2 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 3 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" True | \n",
" False | \n",
" True | \n",
" True | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
" ... | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
" True | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 28 columns
\n",
"
"
],
"text/plain": [
" color director_name num_critic_for_reviews duration \\\n",
"0 False False False False \n",
"1 False False False False \n",
"2 False False False False \n",
"3 False False False False \n",
"4 True False True True \n",
"\n",
" director_facebook_likes actor_3_facebook_likes actor_2_name \\\n",
"0 False False False \n",
"1 False False False \n",
"2 False False False \n",
"3 False False False \n",
"4 False True False \n",
"\n",
" actor_1_facebook_likes gross genres ... num_user_for_reviews language \\\n",
"0 False False False ... False False \n",
"1 False False False ... False False \n",
"2 False False False ... False False \n",
"3 False False False ... False False \n",
"4 False True False ... True True \n",
"\n",
" country content_rating budget title_year actor_2_facebook_likes \\\n",
"0 False False False False False \n",
"1 False False False False False \n",
"2 False False False False False \n",
"3 False False False False False \n",
"4 True True True True False \n",
"\n",
" imdb_score aspect_ratio movie_facebook_likes \n",
"0 False False False \n",
"1 False False False \n",
"2 False False False \n",
"3 False False False \n",
"4 False True False \n",
"\n",
"[5 rows x 28 columns]"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.isnull()[:5] # 通过这个函数查看是否有缺失值。True表示缺失,False表示没有缺失。"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"##### 👉使用空字符串填补"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T07:56:23.249017Z",
"start_time": "2020-06-06T07:56:23.225795Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 James Cameron\n",
"1 Gore Verbinski\n",
"2 Sam Mendes\n",
"3 Christopher Nolan\n",
"4 Doug Walker\n",
" ... \n",
"5038 Scott Smith\n",
"5039 \n",
"5040 Benjamin Roberds\n",
"5041 Daniel Hsia\n",
"5042 Jon Gunn\n",
"Name: director_name, Length: 5043, dtype: object"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.director_name.fillna('')\n",
"# 此处用空字符串替代缺失值\n",
"# 该函数的使用方法:列表名称.列名.fillna('替代NA的字符')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"##### 👉使用数值0进行填补"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T13:21:57.188223Z",
"start_time": "2020-06-06T13:21:57.175838Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 178.0\n",
"1 169.0\n",
"2 148.0\n",
"3 164.0\n",
"4 0.0\n",
" ... \n",
"5038 87.0\n",
"5039 43.0\n",
"5040 76.0\n",
"5041 100.0\n",
"5042 90.0\n",
"Name: duration, Length: 5043, dtype: float64"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['duration'].fillna(0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### 2.删除缺失数据\n",
"##### ①有缺失数据的情况就删除(一行任意出现一个缺失值我就删掉)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T07:56:53.854452Z",
"start_time": "2020-06-06T07:56:53.784373Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" director_name | \n",
" num_critic_for_reviews | \n",
" duration | \n",
" director_facebook_likes | \n",
" actor_3_facebook_likes | \n",
" actor_2_name | \n",
" actor_1_facebook_likes | \n",
" gross | \n",
" genres | \n",
" ... | \n",
" num_user_for_reviews | \n",
" language | \n",
" country | \n",
" content_rating | \n",
" budget | \n",
" title_year | \n",
" actor_2_facebook_likes | \n",
" imdb_score | \n",
" aspect_ratio | \n",
" movie_facebook_likes | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Color | \n",
" James Cameron | \n",
" 723.0 | \n",
" 178.0 | \n",
" 0.0 | \n",
" 855.0 | \n",
" Joel David Moore | \n",
" 1000.0 | \n",
" 760505847.0 | \n",
" Action|Adventure|Fantasy|Sci-Fi | \n",
" ... | \n",
" 3054.0 | \n",
" English | \n",
" USA | \n",
" PG-13 | \n",
" 237000000.0 | \n",
" 2009.0 | \n",
" 936.0 | \n",
" 7.9 | \n",
" 1.78 | \n",
" 33000 | \n",
"
\n",
" \n",
" 1 | \n",
" Color | \n",
" Gore Verbinski | \n",
" 302.0 | \n",
" 169.0 | \n",
" 563.0 | \n",
" 1000.0 | \n",
" Orlando Bloom | \n",
" 40000.0 | \n",
" 309404152.0 | \n",
" Action|Adventure|Fantasy | \n",
" ... | \n",
" 1238.0 | \n",
" English | \n",
" USA | \n",
" PG-13 | \n",
" 300000000.0 | \n",
" 2007.0 | \n",
" 5000.0 | \n",
" 7.1 | \n",
" 2.35 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" Color | \n",
" Sam Mendes | \n",
" 602.0 | \n",
" 148.0 | \n",
" 0.0 | \n",
" 161.0 | \n",
" Rory Kinnear | \n",
" 11000.0 | \n",
" 200074175.0 | \n",
" Action|Adventure|Thriller | \n",
" ... | \n",
" 994.0 | \n",
" English | \n",
" UK | \n",
" PG-13 | \n",
" 245000000.0 | \n",
" 2015.0 | \n",
" 393.0 | \n",
" 6.8 | \n",
" 2.35 | \n",
" 85000 | \n",
"
\n",
" \n",
"
\n",
"
3 rows × 28 columns
\n",
"
"
],
"text/plain": [
" color director_name num_critic_for_reviews duration \\\n",
"0 Color James Cameron 723.0 178.0 \n",
"1 Color Gore Verbinski 302.0 169.0 \n",
"2 Color Sam Mendes 602.0 148.0 \n",
"\n",
" director_facebook_likes actor_3_facebook_likes actor_2_name \\\n",
"0 0.0 855.0 Joel David Moore \n",
"1 563.0 1000.0 Orlando Bloom \n",
"2 0.0 161.0 Rory Kinnear \n",
"\n",
" actor_1_facebook_likes gross genres ... \\\n",
"0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... \n",
"1 40000.0 309404152.0 Action|Adventure|Fantasy ... \n",
"2 11000.0 200074175.0 Action|Adventure|Thriller ... \n",
"\n",
" num_user_for_reviews language country content_rating budget \\\n",
"0 3054.0 English USA PG-13 237000000.0 \n",
"1 1238.0 English USA PG-13 300000000.0 \n",
"2 994.0 English UK PG-13 245000000.0 \n",
"\n",
" title_year actor_2_facebook_likes imdb_score aspect_ratio \\\n",
"0 2009.0 936.0 7.9 1.78 \n",
"1 2007.0 5000.0 7.1 2.35 \n",
"2 2015.0 393.0 6.8 2.35 \n",
"\n",
" movie_facebook_likes \n",
"0 33000 \n",
"1 0 \n",
"2 85000 \n",
"\n",
"[3 rows x 28 columns]"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.dropna()[:3]\n",
"# 删除任何包含 NA 值的行:\n",
"# 可以看到原本有NA值得行全部被删除"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"##### ② 全部是缺失值才删除\n",
"(一般情况下如果有部分缺失值是不影响整体行列的,如果有就删除会影响数据完整性)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-24T00:41:09.360267Z",
"start_time": "2019-10-24T00:41:09.333261Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" director_name | \n",
" num_critic_for_reviews | \n",
" duration | \n",
" director_facebook_likes | \n",
" actor_3_facebook_likes | \n",
" actor_2_name | \n",
" actor_1_facebook_likes | \n",
" gross | \n",
" genres | \n",
" ... | \n",
" num_user_for_reviews | \n",
" language | \n",
" country | \n",
" content_rating | \n",
" budget | \n",
" title_year | \n",
" actor_2_facebook_likes | \n",
" imdb_score | \n",
" aspect_ratio | \n",
" movie_facebook_likes | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Color | \n",
" James Cameron | \n",
" 723.0 | \n",
" 178.0 | \n",
" 0.0 | \n",
" 855.0 | \n",
" Joel David Moore | \n",
" 1000.0 | \n",
" 760505847.0 | \n",
" Action|Adventure|Fantasy|Sci-Fi | \n",
" ... | \n",
" 3054.0 | \n",
" English | \n",
" USA | \n",
" PG-13 | \n",
" 237000000.0 | \n",
" 2009.0 | \n",
" 936.0 | \n",
" 7.9 | \n",
" 1.78 | \n",
" 33000 | \n",
"
\n",
" \n",
" 1 | \n",
" Color | \n",
" Gore Verbinski | \n",
" 302.0 | \n",
" 169.0 | \n",
" 563.0 | \n",
" 1000.0 | \n",
" Orlando Bloom | \n",
" 40000.0 | \n",
" 309404152.0 | \n",
" Action|Adventure|Fantasy | \n",
" ... | \n",
" 1238.0 | \n",
" English | \n",
" USA | \n",
" PG-13 | \n",
" 300000000.0 | \n",
" 2007.0 | \n",
" 5000.0 | \n",
" 7.1 | \n",
" 2.35 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" Color | \n",
" Sam Mendes | \n",
" 602.0 | \n",
" 148.0 | \n",
" 0.0 | \n",
" 161.0 | \n",
" Rory Kinnear | \n",
" 11000.0 | \n",
" 200074175.0 | \n",
" Action|Adventure|Thriller | \n",
" ... | \n",
" 994.0 | \n",
" English | \n",
" UK | \n",
" PG-13 | \n",
" 245000000.0 | \n",
" 2015.0 | \n",
" 393.0 | \n",
" 6.8 | \n",
" 2.35 | \n",
" 85000 | \n",
"
\n",
" \n",
"
\n",
"
3 rows × 28 columns
\n",
"
"
],
"text/plain": [
" color director_name num_critic_for_reviews duration \\\n",
"0 Color James Cameron 723.0 178.0 \n",
"1 Color Gore Verbinski 302.0 169.0 \n",
"2 Color Sam Mendes 602.0 148.0 \n",
"\n",
" director_facebook_likes actor_3_facebook_likes actor_2_name \\\n",
"0 0.0 855.0 Joel David Moore \n",
"1 563.0 1000.0 Orlando Bloom \n",
"2 0.0 161.0 Rory Kinnear \n",
"\n",
" actor_1_facebook_likes gross genres ... \\\n",
"0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... \n",
"1 40000.0 309404152.0 Action|Adventure|Fantasy ... \n",
"2 11000.0 200074175.0 Action|Adventure|Thriller ... \n",
"\n",
" num_user_for_reviews language country content_rating budget \\\n",
"0 3054.0 English USA PG-13 237000000.0 \n",
"1 1238.0 English USA PG-13 300000000.0 \n",
"2 994.0 English UK PG-13 245000000.0 \n",
"\n",
" title_year actor_2_facebook_likes imdb_score aspect_ratio \\\n",
"0 2009.0 936.0 7.9 1.78 \n",
"1 2007.0 5000.0 7.1 2.35 \n",
"2 2015.0 393.0 6.8 2.35 \n",
"\n",
" movie_facebook_likes \n",
"0 33000 \n",
"1 0 \n",
"2 85000 \n",
"\n",
"[3 rows x 28 columns]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.dropna(how='all')[:3]\n",
"# 删除任何包含 NA 值的行:\n",
"# 可以看到原本有NA值得行全部被删除"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"→图示解读:我们可以发现,全部是缺失值才删除的方法很低效几乎没用\n",
"\n",
"→原始数据是5043行,经过处理后仍然是5043行"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"##### ③ 至少有n个才删除\n",
"(我们可以设定一个“门槛”,如果一行中有5个以上缺失值就删除,这样比起上面两个方法更加“中庸”)\n",
"\n",
"→图示解读:但我们发现设定5个缺失值为门槛仍然不能处理,我们需要降低删除的门槛,,比如改到4个或3"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-24T00:41:35.165485Z",
"start_time": "2019-10-24T00:41:35.138966Z"
},
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" director_name | \n",
" num_critic_for_reviews | \n",
" duration | \n",
" director_facebook_likes | \n",
" actor_3_facebook_likes | \n",
" actor_2_name | \n",
" actor_1_facebook_likes | \n",
" gross | \n",
" genres | \n",
" ... | \n",
" num_user_for_reviews | \n",
" language | \n",
" country | \n",
" content_rating | \n",
" budget | \n",
" title_year | \n",
" actor_2_facebook_likes | \n",
" imdb_score | \n",
" aspect_ratio | \n",
" movie_facebook_likes | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Color | \n",
" James Cameron | \n",
" 723.0 | \n",
" 178.0 | \n",
" 0.0 | \n",
" 855.0 | \n",
" Joel David Moore | \n",
" 1000.0 | \n",
" 760505847.0 | \n",
" Action|Adventure|Fantasy|Sci-Fi | \n",
" ... | \n",
" 3054.0 | \n",
" English | \n",
" USA | \n",
" PG-13 | \n",
" 237000000.0 | \n",
" 2009.0 | \n",
" 936.0 | \n",
" 7.9 | \n",
" 1.78 | \n",
" 33000 | \n",
"
\n",
" \n",
" 1 | \n",
" Color | \n",
" Gore Verbinski | \n",
" 302.0 | \n",
" 169.0 | \n",
" 563.0 | \n",
" 1000.0 | \n",
" Orlando Bloom | \n",
" 40000.0 | \n",
" 309404152.0 | \n",
" Action|Adventure|Fantasy | \n",
" ... | \n",
" 1238.0 | \n",
" English | \n",
" USA | \n",
" PG-13 | \n",
" 300000000.0 | \n",
" 2007.0 | \n",
" 5000.0 | \n",
" 7.1 | \n",
" 2.35 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" Color | \n",
" Sam Mendes | \n",
" 602.0 | \n",
" 148.0 | \n",
" 0.0 | \n",
" 161.0 | \n",
" Rory Kinnear | \n",
" 11000.0 | \n",
" 200074175.0 | \n",
" Action|Adventure|Thriller | \n",
" ... | \n",
" 994.0 | \n",
" English | \n",
" UK | \n",
" PG-13 | \n",
" 245000000.0 | \n",
" 2015.0 | \n",
" 393.0 | \n",
" 6.8 | \n",
" 2.35 | \n",
" 85000 | \n",
"
\n",
" \n",
"
\n",
"
3 rows × 28 columns
\n",
"
"
],
"text/plain": [
" color director_name num_critic_for_reviews duration \\\n",
"0 Color James Cameron 723.0 178.0 \n",
"1 Color Gore Verbinski 302.0 169.0 \n",
"2 Color Sam Mendes 602.0 148.0 \n",
"\n",
" director_facebook_likes actor_3_facebook_likes actor_2_name \\\n",
"0 0.0 855.0 Joel David Moore \n",
"1 563.0 1000.0 Orlando Bloom \n",
"2 0.0 161.0 Rory Kinnear \n",
"\n",
" actor_1_facebook_likes gross genres ... \\\n",
"0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... \n",
"1 40000.0 309404152.0 Action|Adventure|Fantasy ... \n",
"2 11000.0 200074175.0 Action|Adventure|Thriller ... \n",
"\n",
" num_user_for_reviews language country content_rating budget \\\n",
"0 3054.0 English USA PG-13 237000000.0 \n",
"1 1238.0 English USA PG-13 300000000.0 \n",
"2 994.0 English UK PG-13 245000000.0 \n",
"\n",
" title_year actor_2_facebook_likes imdb_score aspect_ratio \\\n",
"0 2009.0 936.0 7.9 1.78 \n",
"1 2007.0 5000.0 7.1 2.35 \n",
"2 2015.0 393.0 6.8 2.35 \n",
"\n",
" movie_facebook_likes \n",
"0 33000 \n",
"1 0 \n",
"2 85000 \n",
"\n",
"[3 rows x 28 columns]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.dropna(thresh=5)[:3]\n",
"# 在下面的例子中,行数据中至少要有 5 个缺失值才会删除"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"##### ④ 某个/几个列值缺失才删除\n",
"(如果我们觉得某几个值是特别重要的、不可或缺的话,如果是缺失值该数据就没意义,就可以指定特定列)\n",
"\n",
"(例如对于一个电影而言,电影名字或者导演通常是必须的要素)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-24T00:41:39.571641Z",
"start_time": "2019-10-24T00:41:39.546419Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" director_name | \n",
" num_critic_for_reviews | \n",
" duration | \n",
" director_facebook_likes | \n",
" actor_3_facebook_likes | \n",
" actor_2_name | \n",
" actor_1_facebook_likes | \n",
" gross | \n",
" genres | \n",
" ... | \n",
" num_user_for_reviews | \n",
" language | \n",
" country | \n",
" content_rating | \n",
" budget | \n",
" title_year | \n",
" actor_2_facebook_likes | \n",
" imdb_score | \n",
" aspect_ratio | \n",
" movie_facebook_likes | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Color | \n",
" James Cameron | \n",
" 723.0 | \n",
" 178.0 | \n",
" 0.0 | \n",
" 855.0 | \n",
" Joel David Moore | \n",
" 1000.0 | \n",
" 760505847.0 | \n",
" Action|Adventure|Fantasy|Sci-Fi | \n",
" ... | \n",
" 3054.0 | \n",
" English | \n",
" USA | \n",
" PG-13 | \n",
" 237000000.0 | \n",
" 2009.0 | \n",
" 936.0 | \n",
" 7.9 | \n",
" 1.78 | \n",
" 33000 | \n",
"
\n",
" \n",
" 1 | \n",
" Color | \n",
" Gore Verbinski | \n",
" 302.0 | \n",
" 169.0 | \n",
" 563.0 | \n",
" 1000.0 | \n",
" Orlando Bloom | \n",
" 40000.0 | \n",
" 309404152.0 | \n",
" Action|Adventure|Fantasy | \n",
" ... | \n",
" 1238.0 | \n",
" English | \n",
" USA | \n",
" PG-13 | \n",
" 300000000.0 | \n",
" 2007.0 | \n",
" 5000.0 | \n",
" 7.1 | \n",
" 2.35 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" Color | \n",
" Sam Mendes | \n",
" 602.0 | \n",
" 148.0 | \n",
" 0.0 | \n",
" 161.0 | \n",
" Rory Kinnear | \n",
" 11000.0 | \n",
" 200074175.0 | \n",
" Action|Adventure|Thriller | \n",
" ... | \n",
" 994.0 | \n",
" English | \n",
" UK | \n",
" PG-13 | \n",
" 245000000.0 | \n",
" 2015.0 | \n",
" 393.0 | \n",
" 6.8 | \n",
" 2.35 | \n",
" 85000 | \n",
"
\n",
" \n",
"
\n",
"
3 rows × 28 columns
\n",
"
"
],
"text/plain": [
" color director_name num_critic_for_reviews duration \\\n",
"0 Color James Cameron 723.0 178.0 \n",
"1 Color Gore Verbinski 302.0 169.0 \n",
"2 Color Sam Mendes 602.0 148.0 \n",
"\n",
" director_facebook_likes actor_3_facebook_likes actor_2_name \\\n",
"0 0.0 855.0 Joel David Moore \n",
"1 563.0 1000.0 Orlando Bloom \n",
"2 0.0 161.0 Rory Kinnear \n",
"\n",
" actor_1_facebook_likes gross genres ... \\\n",
"0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... \n",
"1 40000.0 309404152.0 Action|Adventure|Fantasy ... \n",
"2 11000.0 200074175.0 Action|Adventure|Thriller ... \n",
"\n",
" num_user_for_reviews language country content_rating budget \\\n",
"0 3054.0 English USA PG-13 237000000.0 \n",
"1 1238.0 English USA PG-13 300000000.0 \n",
"2 994.0 English UK PG-13 245000000.0 \n",
"\n",
" title_year actor_2_facebook_likes imdb_score aspect_ratio \\\n",
"0 2009.0 936.0 7.9 1.78 \n",
"1 2007.0 5000.0 7.1 2.35 \n",
"2 2015.0 393.0 6.8 2.35 \n",
"\n",
" movie_facebook_likes \n",
"0 33000 \n",
"1 0 \n",
"2 85000 \n",
"\n",
"[3 rows x 28 columns]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.dropna(subset=['director_name','title_year'])[:3]\n",
"# 此处用subset方法删除了 'director_name','title_year'为缺失值的行"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"##### ⑤ 删除列\n",
"(前面的操作是针对行)\n",
"\n",
"(同样的方法也适用于于列,只需要补充axis=1的参数即可。默认值为axis=0 行 axis=1 列)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-24T00:41:46.372347Z",
"start_time": "2019-10-24T00:41:46.345759Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" director_name | \n",
" num_critic_for_reviews | \n",
" duration | \n",
" director_facebook_likes | \n",
" actor_3_facebook_likes | \n",
" actor_2_name | \n",
" actor_1_facebook_likes | \n",
" gross | \n",
" genres | \n",
" ... | \n",
" num_user_for_reviews | \n",
" language | \n",
" country | \n",
" content_rating | \n",
" budget | \n",
" title_year | \n",
" actor_2_facebook_likes | \n",
" imdb_score | \n",
" aspect_ratio | \n",
" movie_facebook_likes | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Color | \n",
" James Cameron | \n",
" 723.0 | \n",
" 178.0 | \n",
" 0.0 | \n",
" 855.0 | \n",
" Joel David Moore | \n",
" 1000.0 | \n",
" 760505847.0 | \n",
" Action|Adventure|Fantasy|Sci-Fi | \n",
" ... | \n",
" 3054.0 | \n",
" English | \n",
" USA | \n",
" PG-13 | \n",
" 237000000.0 | \n",
" 2009.0 | \n",
" 936.0 | \n",
" 7.9 | \n",
" 1.78 | \n",
" 33000 | \n",
"
\n",
" \n",
" 1 | \n",
" Color | \n",
" Gore Verbinski | \n",
" 302.0 | \n",
" 169.0 | \n",
" 563.0 | \n",
" 1000.0 | \n",
" Orlando Bloom | \n",
" 40000.0 | \n",
" 309404152.0 | \n",
" Action|Adventure|Fantasy | \n",
" ... | \n",
" 1238.0 | \n",
" English | \n",
" USA | \n",
" PG-13 | \n",
" 300000000.0 | \n",
" 2007.0 | \n",
" 5000.0 | \n",
" 7.1 | \n",
" 2.35 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" Color | \n",
" Sam Mendes | \n",
" 602.0 | \n",
" 148.0 | \n",
" 0.0 | \n",
" 161.0 | \n",
" Rory Kinnear | \n",
" 11000.0 | \n",
" 200074175.0 | \n",
" Action|Adventure|Thriller | \n",
" ... | \n",
" 994.0 | \n",
" English | \n",
" UK | \n",
" PG-13 | \n",
" 245000000.0 | \n",
" 2015.0 | \n",
" 393.0 | \n",
" 6.8 | \n",
" 2.35 | \n",
" 85000 | \n",
"
\n",
" \n",
"
\n",
"
3 rows × 28 columns
\n",
"
"
],
"text/plain": [
" color director_name num_critic_for_reviews duration \\\n",
"0 Color James Cameron 723.0 178.0 \n",
"1 Color Gore Verbinski 302.0 169.0 \n",
"2 Color Sam Mendes 602.0 148.0 \n",
"\n",
" director_facebook_likes actor_3_facebook_likes actor_2_name \\\n",
"0 0.0 855.0 Joel David Moore \n",
"1 563.0 1000.0 Orlando Bloom \n",
"2 0.0 161.0 Rory Kinnear \n",
"\n",
" actor_1_facebook_likes gross genres ... \\\n",
"0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... \n",
"1 40000.0 309404152.0 Action|Adventure|Fantasy ... \n",
"2 11000.0 200074175.0 Action|Adventure|Thriller ... \n",
"\n",
" num_user_for_reviews language country content_rating budget \\\n",
"0 3054.0 English USA PG-13 237000000.0 \n",
"1 1238.0 English USA PG-13 300000000.0 \n",
"2 994.0 English UK PG-13 245000000.0 \n",
"\n",
" title_year actor_2_facebook_likes imdb_score aspect_ratio \\\n",
"0 2009.0 936.0 7.9 1.78 \n",
"1 2007.0 5000.0 7.1 2.35 \n",
"2 2015.0 393.0 6.8 2.35 \n",
"\n",
" movie_facebook_likes \n",
"0 33000 \n",
"1 0 \n",
"2 85000 \n",
"\n",
"[3 rows x 28 columns]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 删除一正列为 NA 的列:\n",
"data.dropna(axis=1, how='all')[:3] "
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-24T00:41:49.858547Z",
"start_time": "2019-10-24T00:41:49.843995Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" genres | \n",
" movie_title | \n",
" num_voted_users | \n",
" cast_total_facebook_likes | \n",
" movie_imdb_link | \n",
" imdb_score | \n",
" movie_facebook_likes | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Action|Adventure|Fantasy|Sci-Fi | \n",
" Avatar | \n",
" 886204 | \n",
" 4834 | \n",
" http://www.imdb.com/title/tt0499549/?ref_=fn_t... | \n",
" 7.9 | \n",
" 33000 | \n",
"
\n",
" \n",
" 1 | \n",
" Action|Adventure|Fantasy | \n",
" Pirates of the Caribbean: At World's End | \n",
" 471220 | \n",
" 48350 | \n",
" http://www.imdb.com/title/tt0449088/?ref_=fn_t... | \n",
" 7.1 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" Action|Adventure|Thriller | \n",
" Spectre | \n",
" 275868 | \n",
" 11700 | \n",
" http://www.imdb.com/title/tt2379713/?ref_=fn_t... | \n",
" 6.8 | \n",
" 85000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" genres movie_title \\\n",
"0 Action|Adventure|Fantasy|Sci-Fi Avatar \n",
"1 Action|Adventure|Fantasy Pirates of the Caribbean: At World's End \n",
"2 Action|Adventure|Thriller Spectre \n",
"\n",
" num_voted_users cast_total_facebook_likes \\\n",
"0 886204 4834 \n",
"1 471220 48350 \n",
"2 275868 11700 \n",
"\n",
" movie_imdb_link imdb_score \\\n",
"0 http://www.imdb.com/title/tt0499549/?ref_=fn_t... 7.9 \n",
"1 http://www.imdb.com/title/tt0449088/?ref_=fn_t... 7.1 \n",
"2 http://www.imdb.com/title/tt2379713/?ref_=fn_t... 6.8 \n",
"\n",
" movie_facebook_likes \n",
"0 33000 \n",
"1 0 \n",
"2 85000 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 删除任何包含空值的列:\n",
"data.dropna(axis=1, how='any')[:3]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### 3. 通过计算填充\n",
"使用数字类型的数据,比如,电影的时长,计算像电影平均时长可以帮我们甚至是数据集。这并不是最优解,但这个持续时间是根据其他数据估算出来的。这样的方式下,就不会因为像 0 或者 NaN这样的值在我们分析的时候而抛错。"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T08:03:10.588855Z",
"start_time": "2020-06-06T08:03:10.565510Z"
},
"code_folding": [],
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 178.000000\n",
"1 169.000000\n",
"2 148.000000\n",
"3 164.000000\n",
"4 107.201074\n",
"Name: duration, dtype: float64"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# hist\n",
"# import matplotlib.pyplot as plt\n",
"# plt.hist(data['duration']);\n",
"data['duration'] = data['duration'].fillna(data['duration'].mean())\n",
"data['duration'][:5]\n",
"# 此外还可以使用最大值 最小值 等方式填补"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 二、对数值类型进行规范(转换类型)\n",
"\n",
"有的时候,尤其当我们读取 csv 中一串数字的时候,有的时候数值类型的数字被读成字符串的数字,或将字符串的数字读成数据值类型的数字。"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-24T00:42:22.324202Z",
"start_time": "2019-10-24T00:42:22.319807Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',\n",
" 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',\n",
" 'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',\n",
" 'movie_title', 'num_voted_users', 'cast_total_facebook_likes',\n",
" 'actor_3_name', 'facenumber_in_poster', 'plot_keywords',\n",
" 'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',\n",
" 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',\n",
" 'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],\n",
" dtype='object')"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.columns"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T08:03:14.503967Z",
"start_time": "2020-06-06T08:03:14.476246Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 178\n",
"1 169\n",
"2 148\n",
"3 164\n",
"4 107\n",
" ... \n",
"5038 87\n",
"5039 43\n",
"5040 76\n",
"5041 100\n",
"5042 90\n",
"Name: duration, Length: 5043, dtype: int64"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['duration'] = data['duration'].astype('int')\n",
"data['duration']\n",
"# 这就是告诉 Pandas ‘duration’列的类型是数值类型。\n",
"# 同样的,如果想把上映年读成字符串而不是数值类型,我们使用和上面类似的方法:"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T08:05:42.575174Z",
"start_time": "2020-06-06T08:05:42.556393Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 2009.0\n",
"1 2007.0\n",
"2 2015.0\n",
"Name: title_year, dtype: object"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# import numpy as np\n",
"# data['title_year']=data['title_year'].fillna(np.nan)\n",
"data['title_year'].astype(str)[:3]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 三、不识别数据\n",
"\n",
"有时候我们不想看到某列或者某行的空值,但却该列缺失不是关键量不需要删除\n",
"可以选择忽略/不识别该数据的方法"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T08:08:39.164363Z",
"start_time": "2020-06-06T08:08:39.140576Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 James Cameron\n",
"1 Gore Verbinski\n",
"2 Sam Mendes\n",
"3 Christopher Nolan\n",
"4 Doug Walker\n",
" ... \n",
"5038 Scott Smith\n",
"5039 \n",
"5040 Benjamin Roberds\n",
"5041 Daniel Hsia\n",
"5042 Jon Gunn\n",
"Name: director_name, Length: 5043, dtype: object"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['director_name'] = data['director_name'].replace({'NaN',''})\n",
"data['director_name'].fillna('')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"同样的, 对于爬虫文本中可能会出现换错行的情况\n",
"\n",
"例如 原文本为112233\n",
"\n",
"因为错换行情况变成11\n",
"\n",
" 22\n",
" \n",
" 33\n",
" \n",
"因此,用这个方法可以可以替换文本中存在的\\n(还可用来其他转移转义字符 \\n\\t\\r"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 四、重复数据\n",
"\n",
"有的时候数据集中会有一些重复的数据。在我们的数据集中也添加了重复的数据。"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T08:09:52.895805Z",
"start_time": "2020-06-06T08:09:52.876394Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"# 首先我们校验一下是否存在重复记录。\n",
"# 如果存在重复记录,就使用 Pandas 提供的 drop_duplicates() 来删除重复数据。\n",
"\n",
"data.drop_duplicates(['director_name','duration'],inplace=True)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## 案例2:病人心脏病数据\n",
"\n",
"https://www.cnblogs.com/BoyceYang/p/8186033.html\n"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T08:10:12.174552Z",
"start_time": "2020-06-06T08:10:12.122637Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 1 | \n",
" Mickéy Mousé | \n",
" 56 | \n",
" 70kgs | \n",
" 72 | \n",
" 69 | \n",
" 71 | \n",
" - | \n",
" -.1 | \n",
" -.2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2.0 | \n",
" Donald Duck | \n",
" 34.0 | \n",
" 154.89lbs | \n",
" - | \n",
" - | \n",
" - | \n",
" 85 | \n",
" 84 | \n",
" 76 | \n",
"
\n",
" \n",
" 1 | \n",
" 3.0 | \n",
" Mini Mouse | \n",
" 16.0 | \n",
" NaN | \n",
" - | \n",
" - | \n",
" - | \n",
" 65 | \n",
" 69 | \n",
" 72 | \n",
"
\n",
" \n",
" 2 | \n",
" 4.0 | \n",
" Scrooge McDuck | \n",
" NaN | \n",
" 78kgs | \n",
" 78 | \n",
" 79 | \n",
" 72 | \n",
" - | \n",
" - | \n",
" - | \n",
"
\n",
" \n",
" 3 | \n",
" 5.0 | \n",
" Pink Panther | \n",
" 54.0 | \n",
" 198.658lbs | \n",
" - | \n",
" - | \n",
" - | \n",
" 69 | \n",
" NaN | \n",
" 75 | \n",
"
\n",
" \n",
" 4 | \n",
" 6.0 | \n",
" Huey McDuck | \n",
" 52.0 | \n",
" 189lbs | \n",
" - | \n",
" - | \n",
" - | \n",
" 68 | \n",
" 75 | \n",
" 72 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 1 Mickéy Mousé 56 70kgs 72 69 71 - -.1 -.2\n",
"0 2.0 Donald Duck 34.0 154.89lbs - - - 85 84 76\n",
"1 3.0 Mini Mouse 16.0 NaN - - - 65 69 72\n",
"2 4.0 Scrooge McDuck NaN 78kgs 78 79 72 - - -\n",
"3 5.0 Pink Panther 54.0 198.658lbs - - - 69 NaN 75\n",
"4 6.0 Huey McDuck 52.0 189lbs - - - 68 75 72"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"df = pd.read_csv('../data/patient_heart_rate.csv')\n",
"df.head() "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"- 没有列头\n",
"- 一个列有多个参数\n",
"- 列数据的单位不统一\n",
"- 缺失值\n",
"- 空行\n",
"- 重复数据\n",
"- 非 ASCII 字符\n",
"- 有些列头应该是数据,而不应该是列名参数"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### 五、增加列头"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T08:10:47.826616Z",
"start_time": "2020-06-06T08:10:47.762710Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" name | \n",
" age | \n",
" weight | \n",
" m0006 | \n",
" m0612 | \n",
" m1218 | \n",
" f0006 | \n",
" f0612 | \n",
" f1218 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" Mickéy Mousé | \n",
" 56.0 | \n",
" 70kgs | \n",
" 72 | \n",
" 69 | \n",
" 71 | \n",
" - | \n",
" - | \n",
" - | \n",
"
\n",
" \n",
" 1 | \n",
" 2.0 | \n",
" Donald Duck | \n",
" 34.0 | \n",
" 154.89lbs | \n",
" - | \n",
" - | \n",
" - | \n",
" 85 | \n",
" 84 | \n",
" 76 | \n",
"
\n",
" \n",
" 2 | \n",
" 3.0 | \n",
" Mini Mouse | \n",
" 16.0 | \n",
" NaN | \n",
" - | \n",
" - | \n",
" - | \n",
" 65 | \n",
" 69 | \n",
" 72 | \n",
"
\n",
" \n",
" 3 | \n",
" 4.0 | \n",
" Scrooge McDuck | \n",
" NaN | \n",
" 78kgs | \n",
" 78 | \n",
" 79 | \n",
" 72 | \n",
" - | \n",
" - | \n",
" - | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" Pink Panther | \n",
" 54.0 | \n",
" 198.658lbs | \n",
" - | \n",
" - | \n",
" - | \n",
" 69 | \n",
" NaN | \n",
" 75 | \n",
"
\n",
" \n",
" 5 | \n",
" 6.0 | \n",
" Huey McDuck | \n",
" 52.0 | \n",
" 189lbs | \n",
" - | \n",
" - | \n",
" - | \n",
" 68 | \n",
" 75 | \n",
" 72 | \n",
"
\n",
" \n",
" 6 | \n",
" 7.0 | \n",
" Dewey McDuck | \n",
" 19.0 | \n",
" 56kgs | \n",
" - | \n",
" - | \n",
" - | \n",
" 71 | \n",
" 78 | \n",
" 75 | \n",
"
\n",
" \n",
" 7 | \n",
" 8.0 | \n",
" Scööpy Doo | \n",
" 32.0 | \n",
" 78kgs | \n",
" 78 | \n",
" 76 | \n",
" 75 | \n",
" - | \n",
" - | \n",
" - | \n",
"
\n",
" \n",
" 8 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 9 | \n",
" 9.0 | \n",
" Huey McDuck | \n",
" 52.0 | \n",
" 189lbs | \n",
" - | \n",
" - | \n",
" - | \n",
" 68 | \n",
" 75 | \n",
" 72 | \n",
"
\n",
" \n",
" 10 | \n",
" 10.0 | \n",
" Louie McDuck | \n",
" 12.0 | \n",
" 45kgs | \n",
" - | \n",
" - | \n",
" - | \n",
" 92 | \n",
" 95 | \n",
" 87 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id name age weight m0006 m0612 m1218 f0006 f0612 f1218\n",
"0 1.0 Mickéy Mousé 56.0 70kgs 72 69 71 - - -\n",
"1 2.0 Donald Duck 34.0 154.89lbs - - - 85 84 76\n",
"2 3.0 Mini Mouse 16.0 NaN - - - 65 69 72\n",
"3 4.0 Scrooge McDuck NaN 78kgs 78 79 72 - - -\n",
"4 5.0 Pink Panther 54.0 198.658lbs - - - 69 NaN 75\n",
"5 6.0 Huey McDuck 52.0 189lbs - - - 68 75 72\n",
"6 7.0 Dewey McDuck 19.0 56kgs - - - 71 78 75\n",
"7 8.0 Scööpy Doo 32.0 78kgs 78 76 75 - - -\n",
"8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN\n",
"9 9.0 Huey McDuck 52.0 189lbs - - - 68 75 72\n",
"10 10.0 Louie McDuck 12.0 45kgs - - - 92 95 87"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"column_names= ['id', 'name', 'age', 'weight','m0006','m0612','m1218','f0006','f0612','f1218']\n",
"df = pd.read_csv('../data/patient_heart_rate.csv', names = column_names)\n",
"df.head(15)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### 六、单位列数据的单位不统一\n",
"\n",
"如果仔细观察数据集可以发现 Weight 列的单位不统一。有的单位是 kgs,有的单位是 lbs (磅)"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T08:11:44.798237Z",
"start_time": "2020-06-06T08:11:44.758864Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" name | \n",
" age | \n",
" weight | \n",
" m0006 | \n",
" m0612 | \n",
" m1218 | \n",
" f0006 | \n",
" f0612 | \n",
" f1218 | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 2.0 | \n",
" Donald Duck | \n",
" 34.0 | \n",
" 154.89lbs | \n",
" - | \n",
" - | \n",
" - | \n",
" 85 | \n",
" 84 | \n",
" 76 | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" Pink Panther | \n",
" 54.0 | \n",
" 198.658lbs | \n",
" - | \n",
" - | \n",
" - | \n",
" 69 | \n",
" NaN | \n",
" 75 | \n",
"
\n",
" \n",
" 5 | \n",
" 6.0 | \n",
" Huey McDuck | \n",
" 52.0 | \n",
" 189lbs | \n",
" - | \n",
" - | \n",
" - | \n",
" 68 | \n",
" 75 | \n",
" 72 | \n",
"
\n",
" \n",
" 9 | \n",
" 9.0 | \n",
" Huey McDuck | \n",
" 52.0 | \n",
" 189lbs | \n",
" - | \n",
" - | \n",
" - | \n",
" 68 | \n",
" 75 | \n",
" 72 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id name age weight m0006 m0612 m1218 f0006 f0612 f1218\n",
"1 2.0 Donald Duck 34.0 154.89lbs - - - 85 84 76\n",
"4 5.0 Pink Panther 54.0 198.658lbs - - - 69 NaN 75\n",
"5 6.0 Huey McDuck 52.0 189lbs - - - 68 75 72\n",
"9 9.0 Huey McDuck 52.0 189lbs - - - 68 75 72"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 获取 weight 数据列中单位为 lbs 的数据\n",
"rows_with_lbs = df['weight'].str.contains('lbs').fillna(False)\n",
"df[rows_with_lbs]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"为了解决这个问题,将单位统一,我们将单位是 lbs(磅) 的数据转换成 kgs(千克)。 `1Kg=2.20462LBS`"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T08:12:39.724996Z",
"start_time": "2020-06-06T08:12:39.710618Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"for i,lbs_row in df[rows_with_lbs].iterrows():\n",
" weight = int(float(lbs_row['weight'][:-3])/2.2)\n",
" df.at[i,'weight'] = '{}kgs'.format(weight) "
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T08:12:45.365652Z",
"start_time": "2020-06-06T08:12:45.319725Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" name | \n",
" age | \n",
" weight | \n",
" m0006 | \n",
" m0612 | \n",
" m1218 | \n",
" f0006 | \n",
" f0612 | \n",
" f1218 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" Mickéy Mousé | \n",
" 56.0 | \n",
" 70kgs | \n",
" 72 | \n",
" 69 | \n",
" 71 | \n",
" - | \n",
" - | \n",
" - | \n",
"
\n",
" \n",
" 1 | \n",
" 2.0 | \n",
" Donald Duck | \n",
" 34.0 | \n",
" 70kgs | \n",
" - | \n",
" - | \n",
" - | \n",
" 85 | \n",
" 84 | \n",
" 76 | \n",
"
\n",
" \n",
" 2 | \n",
" 3.0 | \n",
" Mini Mouse | \n",
" 16.0 | \n",
" NaN | \n",
" - | \n",
" - | \n",
" - | \n",
" 65 | \n",
" 69 | \n",
" 72 | \n",
"
\n",
" \n",
" 3 | \n",
" 4.0 | \n",
" Scrooge McDuck | \n",
" NaN | \n",
" 78kgs | \n",
" 78 | \n",
" 79 | \n",
" 72 | \n",
" - | \n",
" - | \n",
" - | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" Pink Panther | \n",
" 54.0 | \n",
" 90kgs | \n",
" - | \n",
" - | \n",
" - | \n",
" 69 | \n",
" NaN | \n",
" 75 | \n",
"
\n",
" \n",
" 5 | \n",
" 6.0 | \n",
" Huey McDuck | \n",
" 52.0 | \n",
" 85kgs | \n",
" - | \n",
" - | \n",
" - | \n",
" 68 | \n",
" 75 | \n",
" 72 | \n",
"
\n",
" \n",
" 6 | \n",
" 7.0 | \n",
" Dewey McDuck | \n",
" 19.0 | \n",
" 56kgs | \n",
" - | \n",
" - | \n",
" - | \n",
" 71 | \n",
" 78 | \n",
" 75 | \n",
"
\n",
" \n",
" 7 | \n",
" 8.0 | \n",
" Scööpy Doo | \n",
" 32.0 | \n",
" 78kgs | \n",
" 78 | \n",
" 76 | \n",
" 75 | \n",
" - | \n",
" - | \n",
" - | \n",
"
\n",
" \n",
" 8 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 9 | \n",
" 9.0 | \n",
" Huey McDuck | \n",
" 52.0 | \n",
" 85kgs | \n",
" - | \n",
" - | \n",
" - | \n",
" 68 | \n",
" 75 | \n",
" 72 | \n",
"
\n",
" \n",
" 10 | \n",
" 10.0 | \n",
" Louie McDuck | \n",
" 12.0 | \n",
" 45kgs | \n",
" - | \n",
" - | \n",
" - | \n",
" 92 | \n",
" 95 | \n",
" 87 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id name age weight m0006 m0612 m1218 f0006 f0612 f1218\n",
"0 1.0 Mickéy Mousé 56.0 70kgs 72 69 71 - - -\n",
"1 2.0 Donald Duck 34.0 70kgs - - - 85 84 76\n",
"2 3.0 Mini Mouse 16.0 NaN - - - 65 69 72\n",
"3 4.0 Scrooge McDuck NaN 78kgs 78 79 72 - - -\n",
"4 5.0 Pink Panther 54.0 90kgs - - - 69 NaN 75\n",
"5 6.0 Huey McDuck 52.0 85kgs - - - 68 75 72\n",
"6 7.0 Dewey McDuck 19.0 56kgs - - - 71 78 75\n",
"7 8.0 Scööpy Doo 32.0 78kgs 78 76 75 - - -\n",
"8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN\n",
"9 9.0 Huey McDuck 52.0 85kgs - - - 68 75 72\n",
"10 10.0 Louie McDuck 12.0 45kgs - - - 92 95 87"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(12)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### 七、空行 \n",
"\n",
"仔细对比会发现我们的数据中一行空行,除了 index 之外,全部的值都是 NaN。\n",
"\n",
"Pandas 的 read_csv() 并没有可选参数来忽略空行,这样,我们就需要在数据被读入之后再使用 dropna() 进行处理,删除空行."
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T08:13:17.073273Z",
"start_time": "2020-06-06T08:13:17.020781Z"
},
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" name | \n",
" age | \n",
" weight | \n",
" m0006 | \n",
" m0612 | \n",
" m1218 | \n",
" f0006 | \n",
" f0612 | \n",
" f1218 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" Mickéy Mousé | \n",
" 56.0 | \n",
" 70kgs | \n",
" 72 | \n",
" 69 | \n",
" 71 | \n",
" - | \n",
" - | \n",
" - | \n",
"
\n",
" \n",
" 1 | \n",
" 2.0 | \n",
" Donald Duck | \n",
" 34.0 | \n",
" 70kgs | \n",
" - | \n",
" - | \n",
" - | \n",
" 85 | \n",
" 84 | \n",
" 76 | \n",
"
\n",
" \n",
" 2 | \n",
" 3.0 | \n",
" Mini Mouse | \n",
" 16.0 | \n",
" NaN | \n",
" - | \n",
" - | \n",
" - | \n",
" 65 | \n",
" 69 | \n",
" 72 | \n",
"
\n",
" \n",
" 3 | \n",
" 4.0 | \n",
" Scrooge McDuck | \n",
" NaN | \n",
" 78kgs | \n",
" 78 | \n",
" 79 | \n",
" 72 | \n",
" - | \n",
" - | \n",
" - | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" Pink Panther | \n",
" 54.0 | \n",
" 90kgs | \n",
" - | \n",
" - | \n",
" - | \n",
" 69 | \n",
" NaN | \n",
" 75 | \n",
"
\n",
" \n",
" 5 | \n",
" 6.0 | \n",
" Huey McDuck | \n",
" 52.0 | \n",
" 85kgs | \n",
" - | \n",
" - | \n",
" - | \n",
" 68 | \n",
" 75 | \n",
" 72 | \n",
"
\n",
" \n",
" 6 | \n",
" 7.0 | \n",
" Dewey McDuck | \n",
" 19.0 | \n",
" 56kgs | \n",
" - | \n",
" - | \n",
" - | \n",
" 71 | \n",
" 78 | \n",
" 75 | \n",
"
\n",
" \n",
" 7 | \n",
" 8.0 | \n",
" Scööpy Doo | \n",
" 32.0 | \n",
" 78kgs | \n",
" 78 | \n",
" 76 | \n",
" 75 | \n",
" - | \n",
" - | \n",
" - | \n",
"
\n",
" \n",
" 9 | \n",
" 9.0 | \n",
" Huey McDuck | \n",
" 52.0 | \n",
" 85kgs | \n",
" - | \n",
" - | \n",
" - | \n",
" 68 | \n",
" 75 | \n",
" 72 | \n",
"
\n",
" \n",
" 10 | \n",
" 10.0 | \n",
" Louie McDuck | \n",
" 12.0 | \n",
" 45kgs | \n",
" - | \n",
" - | \n",
" - | \n",
" 92 | \n",
" 95 | \n",
" 87 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id name age weight m0006 m0612 m1218 f0006 f0612 f1218\n",
"0 1.0 Mickéy Mousé 56.0 70kgs 72 69 71 - - -\n",
"1 2.0 Donald Duck 34.0 70kgs - - - 85 84 76\n",
"2 3.0 Mini Mouse 16.0 NaN - - - 65 69 72\n",
"3 4.0 Scrooge McDuck NaN 78kgs 78 79 72 - - -\n",
"4 5.0 Pink Panther 54.0 90kgs - - - 69 NaN 75\n",
"5 6.0 Huey McDuck 52.0 85kgs - - - 68 75 72\n",
"6 7.0 Dewey McDuck 19.0 56kgs - - - 71 78 75\n",
"7 8.0 Scööpy Doo 32.0 78kgs 78 76 75 - - -\n",
"9 9.0 Huey McDuck 52.0 85kgs - - - 68 75 72\n",
"10 10.0 Louie McDuck 12.0 45kgs - - - 92 95 87"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 删除全空的行\n",
"df.dropna(how='all',inplace=True)\n",
"df.head(12)"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-06T08:13:58.797251Z",
"start_time": "2020-06-06T08:13:58.611614Z"
},
"code_folding": [],
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" id age weight first_name last_name puls_rate sex hour\n",
"0 1.0 56.0 70kgs Micky Mous 72 m 00-06\n",
"1 1.0 56.0 70kgs Micky Mous 69 m 06-12\n",
"2 1.0 56.0 70kgs Micky Mous 71 m 12-18\n",
"3 2.0 34.0 154.89lbs Donald Duck 85 f 00-06\n",
"4 2.0 34.0 154.89lbs Donald Duck 84 f 06-12\n",
"5 2.0 34.0 154.89lbs Donald Duck 76 f 12-18\n",
"6 3.0 16.0 NaN Mini Mouse 65 f 00-06\n",
"7 3.0 16.0 NaN Mini Mouse 69 f 06-12\n",
"8 3.0 16.0 NaN Mini Mouse 72 f 12-18\n",
"9 4.0 NaN 78kgs Scrooge McDuck 78 m 00-06\n",
"10 4.0 NaN 78kgs Scrooge McDuck 79 m 06-12\n",
"11 4.0 NaN 78kgs Scrooge McDuck 72 m 12-18\n",
"12 5.0 54.0 198.658lbs Pink Panther 69 f 00-06\n",
"13 5.0 54.0 198.658lbs Pink Panther NaN f 06-12\n",
"14 5.0 54.0 198.658lbs Pink Panther 75 f 12-18\n",
"15 6.0 52.0 189lbs Huey McDuck 68 f 00-06\n",
"16 6.0 52.0 189lbs Huey McDuck 75 f 06-12\n",
"17 6.0 52.0 189lbs Huey McDuck 72 f 12-18\n",
"18 7.0 19.0 56kgs Dewey McDuck 71 f 00-06\n",
"19 7.0 19.0 56kgs Dewey McDuck 78 f 06-12\n",
"20 7.0 19.0 56kgs Dewey McDuck 75 f 12-18\n",
"21 8.0 32.0 78kgs Scpy Doo 78 m 00-06\n",
"22 8.0 32.0 78kgs Scpy Doo 76 m 06-12\n",
"23 8.0 32.0 78kgs Scpy Doo 75 m 12-18\n",
"24 10.0 12.0 45kgs Louie McDuck 92 f 00-06\n",
"25 10.0 12.0 45kgs Louie McDuck 95 f 06-12\n",
"26 10.0 12.0 45kgs Louie McDuck 87 f 12-18\n"
]
}
],
"source": [
"import pandas as pd\n",
"# 增加列头\n",
"column_names= ['id', 'name', 'age', 'weight','m0006','m0612','m1218','f0006','f0612','f1218']\n",
"df = pd.read_csv('../data/patient_heart_rate.csv', names = column_names)\n",
"# 切分名字,删除源数据列\n",
"df[['first_name','last_name']] = df['name'].str.split(expand=True)\n",
"df.drop('name', axis=1, inplace=True)\n",
"# 获取 weight 数据列中单位为 lbs 的数据\n",
"rows_with_lbs = df['weight'].str.contains('lbs').fillna(False)\n",
"df[rows_with_lbs]\n",
"# 将 lbs 的数据转换为 kgs 数据\n",
"for i,lbs_row in df[rows_with_lbs].iterrows():\n",
" weight = int(float(lbs_row['weight'][:-3])/2.2)\n",
"df.at[i,'weight'] = '{}kgs'.format(weight)\n",
"# 删除全空的行\n",
"df.dropna(how='all',inplace=True)\n",
"# 删除重复数据行\n",
"df.drop_duplicates(['first_name','last_name'],inplace=True)\n",
"# 删除非 ASCII 字符\n",
"df['first_name'].replace({r'[^\\x00-\\x7F]+':''}, regex=True, inplace=True)\n",
"df['last_name'].replace({r'[^\\x00-\\x7F]+':''}, regex=True, inplace=True)\n",
"# 切分 sex_hour 列为 sex 列和 hour 列\n",
"sorted_columns = ['id','age','weight','first_name','last_name']\n",
"df = pd.melt(df,\n",
"id_vars=sorted_columns,var_name='sex_hour',value_name='puls_rate').sort_values(sorted_columns)\n",
"df[['sex','hour']] = df['sex_hour'].apply(lambda x:pd.Series(([x[:1],'{}-{}'.format(x[1:3],x[3:])])))[[0,1]]\n",
"df.dr op('sex_hour', axis=1, inplace=True)\n",
"# 删除没有心率的数据\n",
"row_with_dashes = df['puls_rate'].str.contains('-').fillna(False)\n",
"df.drop(df[row_with_dashes].index,\n",
"inplace=True)\n",
"# 重置索引,不做也没关系,主要是为了看着美观一点\n",
"df = df.reset_index(drop=True)\n",
"print(df)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"更多练习: https://github.com/computational-class/data_cleaning \n",
"\n",
"![image.png](./images/end.png)"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autoclose": false,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}