{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "# 数据清洗之推特数据\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 数据清洗(data cleaning)\n", "是数据分析的重要步骤,其主要目标是将混杂的数据清洗为可以被直接分析的数据,一般需要将数据转化为数据框(data frame)的样式。\n", "\n", "本章将以推特文本的清洗作为例子,介绍数据清洗的基本逻辑。\n", "\n", "- 清洗错误行\n", "- 正确分列\n", "- 提取所要分析的内容\n", "- 介绍通过按行、chunk的方式对大规模数据进行预处理\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 同时考虑分列符和引用符\n", "\n", "- 分列符🔥分隔符:sep, delimiter\n", "- 引用符☁️:quotechar\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:07:10.567235Z", "start_time": "2023-11-17T07:07:10.551054Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# 提示:你可能需要修改以下路径名\n", "with open(\"./data/ows_tweets_sample.txt\", 'r') as f:\n", " chunk = f.readlines()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:07:14.434792Z", "start_time": "2023-11-17T07:07:14.425674Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "2754" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(chunk)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:07:29.063364Z", "start_time": "2023-11-17T07:07:29.035705Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2628\n" ] } ], "source": [ "import csv\n", "lines_csv = csv.reader(chunk, delimiter=',', quotechar='\"') \n", "print(len(list(lines_csv)))\n", "# next(lines_csv)\n", "# next(lines_csv)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:07:57.675044Z", "start_time": "2023-11-17T07:07:57.170877Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | Twitter ID | \n", "Text | \n", "Profile Image URL | \n", "Day | \n", "Hour | \n", "Minute | \n", "Created At | \n", "Geo | \n", "From User | \n", "From User ID | \n", "Language | \n", "To User | \n", "To User ID | \n", "Source | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "121813144174727168 | \n", "RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLIN... | \n", "http://a2.twimg.com/profile_images/1539375713/... | \n", "2011-10-06 | \n", "5 | \n", "4 | \n", "2011-10-06 05:04:51 | \n", "N; | \n", "Anonops_Cop | \n", "401240477 | \n", "en | \n", "NaN | \n", "0 | \n", "<a href="http://twitter.com/">... | \n", "
1 | \n", "121813146137657344 | \n", "@jamiekilstein @allisonkilkenny Interesting in... | \n", "http://a2.twimg.com/profile_images/1574715503/... | \n", "2011-10-06 | \n", "5 | \n", "4 | \n", "2011-10-06 05:04:51 | \n", "N; | \n", "KittyHybrid | \n", "34532053 | \n", "en | \n", "jamiekilstein | \n", "2149053 | \n", "<a href="http://twitter.com/">... | \n", "
2 | \n", "121813150000619521 | \n", "@Seductivpancake Right! Those guys have a vict... | \n", "http://a1.twimg.com/profile_images/1241412831/... | \n", "2011-10-06 | \n", "5 | \n", "4 | \n", "2011-10-06 05:04:52 | \n", "N; | \n", "nerdsherpa | \n", "95067344 | \n", "en | \n", "Seductivpancake | \n", "19695580 | \n", "<a href="http://www.echofon.com/"... | \n", "