{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"# 对大数据进行预处理\n",
"\n",
"以占领华尔街推特数据为例\n",
"\n",
"\n",
"\n",
"![image.png](./images/author.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## 字节(Byte /bait/)\n",
"\n",
"计算机信息技术用于计量存储容量的一种计量单位,通常情况下一字节等于有八位, [1] 也表示一些计算机编程语言中的数据类型和语言字符。\n",
"- 1B(byte,字节)= 8 bit;\n",
"- 1KB=1000B;1MB=1000KB=1000×1000B。其中1000=10^3。\n",
"- 1KB(kilobyte,千字节)=1000B= 10^3 B;\n",
"- 1MB(Megabyte,兆字节,百万字节,简称“兆”)=1000KB= 10^6 B;\n",
"- 1GB(Gigabyte,吉字节,十亿字节,又称“千兆”)=1000MB= 10^9 B;"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## 分段读取数据并进行处理\n",
"\n",
"Lazy Method for Reading Big File in Python?"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2023-11-17T06:22:59.078274Z",
"start_time": "2023-11-17T06:22:49.044934Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"9\r"
]
}
],
"source": [
"from time import sleep\n",
"# import sys\n",
"\n",
"# flush print\n",
"# def flushPrint(d):\n",
"# sys.stdout.write('\\r')\n",
"# sys.stdout.write(str(d))\n",
"# sys.stdout.flush()\n",
"for i in range(10): \n",
" sleep(1)\n",
" print(i, end= '\\r')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2023-11-17T06:26:50.084136Z",
"start_time": "2023-11-17T06:26:36.844506Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"6900000\r"
]
}
],
"source": [
"# 按行读取数据\n",
"line_num = 0\n",
"cops_num = 0\n",
"# windows users may need to add encoding = 'utf8' into the folling line.\n",
"with open('/Users/datalab/bigdata/cjc/ows-raw.txt', 'r') as f:\n",
" for i in f:\n",
" line_num += 1\n",
" if 'cops' in i:\n",
" cops_num += 1\n",
" if line_num % 100000 ==0:\n",
" print(line_num, end='\\r')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2023-11-17T06:27:01.999022Z",
"start_time": "2023-11-17T06:27:01.988434Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"6911408"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"line_num"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2023-11-17T06:27:23.023515Z",
"start_time": "2023-11-17T06:27:23.019331Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.011413448605551865"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cops_num/line_num"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2023-11-17T06:28:47.966196Z",
"start_time": "2023-11-17T06:28:47.953838Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2754\n"
]
}
],
"source": [
"bigfile = open('/Users/datalab/bigdata/cjc/ows-raw.txt', 'r')\n",
"chunkSize = 1000000\n",
"chunk = bigfile.readlines(chunkSize)\n",
"print(len(chunk))\n",
"# with open(\"../data/ows_tweets_sample.txt\", 'w') as f:\n",
"# for i in chunk:\n",
"# f.write(i) "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2020-10-20T01:48:30.428016Z",
"start_time": "2020-10-20T01:48:30.363808Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"bigfile.readlines?"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2020-10-20T01:50:03.152797Z",
"start_time": "2020-10-20T01:50:03.149044Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"5%5"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2023-11-17T06:34:57.560930Z",
"start_time": "2023-11-17T06:33:53.291904Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"25 6602141\r"
]
}
],
"source": [
"# https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python?lq=1\n",
"import csv\n",
"bigfile = open('/Users/datalab/bigdata/cjc/ows-raw.txt', 'r')\n",
"chunkSize = 10**8\n",
"chunk = bigfile.readlines(chunkSize)\n",
"num_chunk, num_lines, num_cops = 0, 0, 0\n",
"while chunk:\n",
" lines = csv.reader((line.replace('\\x00','') for line in chunk), \n",
" delimiter=',', quotechar='\"')\n",
" # do sth.\n",
" num_lines += len(list(lines))\n",
" for i in lines:\n",
" if 'cops' in i:\n",
" num_cops +=1\n",
" if num_chunk % 5 ==0:\n",
" print(num_chunk, num_lines, end = '\\r')\n",
" num_chunk += 1\n",
" chunk = bigfile.readlines(chunkSize) # read another chunk"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## 用Pandas的get_chunk功能来处理亿级数据\n",
"\n",
"> 只有在超过5TB数据量的规模下,Hadoop才是一个合理的技术选择。"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"ExecuteTime": {
"end_time": "2023-11-17T06:38:41.480856Z",
"start_time": "2023-11-17T06:38:32.402014Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"100000"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"f = open('/Users/datalab/bigdata/cjc/ows-raw.txt',encoding='utf-8')\n",
"reader = pd.read_table(f, sep=',', quotechar='\"', iterator=True, on_bad_lines='skip') #跳过报错行\n",
"chunkSize = 100000\n",
"chunk = reader.get_chunk(chunkSize)\n",
"len(chunk)\n",
"\n",
"#pd.read_table?"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2023-11-17T06:38:55.954580Z",
"start_time": "2023-11-17T06:38:55.938243Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Twitter ID | \n",
" Text | \n",
" Profile Image URL | \n",
" Day | \n",
" Hour | \n",
" Minute | \n",
" Created At | \n",
" Geo | \n",
" From User | \n",
" From User ID | \n",
" Language | \n",
" To User | \n",
" To User ID | \n",
" Source | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 121813144174727168 | \n",
" RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLIN... | \n",
" http://a2.twimg.com/profile_images/1539375713/... | \n",
" 2011-10-06 | \n",
" 5 | \n",
" 4 | \n",
" 2011-10-06 05:04:51 | \n",
" N; | \n",
" Anonops_Cop | \n",
" 401240477 | \n",
" en | \n",
" NaN | \n",
" 0 | \n",
" <a href="http://twitter.com/">... | \n",
"
\n",
" \n",
" 1 | \n",
" 121813146137657344 | \n",
" @jamiekilstein @allisonkilkenny Interesting in... | \n",
" http://a2.twimg.com/profile_images/1574715503/... | \n",
" 2011-10-06 | \n",
" 5 | \n",
" 4 | \n",
" 2011-10-06 05:04:51 | \n",
" N; | \n",
" KittyHybrid | \n",
" 34532053 | \n",
" en | \n",
" jamiekilstein | \n",
" 2149053 | \n",
" <a href="http://twitter.com/">... | \n",
"
\n",
" \n",
" 2 | \n",
" 121813150000619521 | \n",
" @Seductivpancake Right! Those guys have a vict... | \n",
" http://a1.twimg.com/profile_images/1241412831/... | \n",
" 2011-10-06 | \n",
" 5 | \n",
" 4 | \n",
" 2011-10-06 05:04:52 | \n",
" N; | \n",
" nerdsherpa | \n",
" 95067344 | \n",
" en | \n",
" Seductivpancake | \n",
" 19695580 | \n",
" <a href="http://www.echofon.com/"... | \n",
"
\n",
" \n",
" 3 | \n",
" 121813150701072385 | \n",
" RT @bembel "Occupy Wall Street" als ... | \n",
" http://a0.twimg.com/profile_images/1106399092/... | \n",
" 2011-10-06 | \n",
" 5 | \n",
" 4 | \n",
" 2011-10-06 05:04:52 | \n",
" N; | \n",
" hamudistan | \n",
" 35862923 | \n",
" en | \n",
" NaN | \n",
" 0 | \n",
" <a href="http://levelupstudio.com"... | \n",
"
\n",
" \n",
" 4 | \n",
" 121813163778899968 | \n",
" #ows White shirt= Brown shirt. | \n",
" http://a2.twimg.com/profile_images/1568117871/... | \n",
" 2011-10-06 | \n",
" 5 | \n",
" 4 | \n",
" 2011-10-06 05:04:56 | \n",
" N; | \n",
" kl_knox | \n",
" 419580636 | \n",
" en | \n",
" NaN | \n",
" 0 | \n",
" <a href="http://twitter.com/">... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Twitter ID Text \\\n",
"0 121813144174727168 RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLIN... \n",
"1 121813146137657344 @jamiekilstein @allisonkilkenny Interesting in... \n",
"2 121813150000619521 @Seductivpancake Right! Those guys have a vict... \n",
"3 121813150701072385 RT @bembel "Occupy Wall Street" als ... \n",
"4 121813163778899968 #ows White shirt= Brown shirt. \n",
"\n",
" Profile Image URL Day Hour \\\n",
"0 http://a2.twimg.com/profile_images/1539375713/... 2011-10-06 5 \n",
"1 http://a2.twimg.com/profile_images/1574715503/... 2011-10-06 5 \n",
"2 http://a1.twimg.com/profile_images/1241412831/... 2011-10-06 5 \n",
"3 http://a0.twimg.com/profile_images/1106399092/... 2011-10-06 5 \n",
"4 http://a2.twimg.com/profile_images/1568117871/... 2011-10-06 5 \n",
"\n",
" Minute Created At Geo From User From User ID Language \\\n",
"0 4 2011-10-06 05:04:51 N; Anonops_Cop 401240477 en \n",
"1 4 2011-10-06 05:04:51 N; KittyHybrid 34532053 en \n",
"2 4 2011-10-06 05:04:52 N; nerdsherpa 95067344 en \n",
"3 4 2011-10-06 05:04:52 N; hamudistan 35862923 en \n",
"4 4 2011-10-06 05:04:56 N; kl_knox 419580636 en \n",
"\n",
" To User To User ID \\\n",
"0 NaN 0 \n",
"1 jamiekilstein 2149053 \n",
"2 Seductivpancake 19695580 \n",
"3 NaN 0 \n",
"4 NaN 0 \n",
"\n",
" Source \n",
"0 <a href="http://twitter.com/">... \n",
"1 <a href="http://twitter.com/">... \n",
"2 <a href="http://www.echofon.com/"... \n",
"3 <a href="http://levelupstudio.com"... \n",
"4 <a href="http://twitter.com/">... "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chunk.head()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"ExecuteTime": {
"end_time": "2023-11-17T06:45:05.267226Z",
"start_time": "2023-11-17T06:43:47.216691Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Iteration is stopped.\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"f = open('/Users/datalab/bigdata/cjc/ows-raw.txt',encoding='utf-8')\n",
"reader = pd.read_table(f, sep=',', quotechar='\"', \n",
" iterator=True, on_bad_lines='skip') #跳过报错行\n",
"chunkSize = 100000\n",
"loop = True\n",
"cops_data = []\n",
"num_chunk, num_lines = 0, 0\n",
"while loop:\n",
" try:\n",
" chunk = reader.get_chunk(chunkSize)\n",
" # dat = data_cleaning_funtion(chunk) # do sth. e.g., if cops in dat\n",
" dat=[chunk.loc[k] for k in chunk.index if 'cops' in str(chunk['Text'][k]) ]\n",
" num_lines += len(chunk)\n",
" print(num_chunk, num_lines, end = '\\r')\n",
" num_chunk +=1\n",
" for d in dat:\n",
" cops_data.append(d) \n",
" except StopIteration:\n",
" loop = False\n",
" print(\"Iteration is stopped.\")\n",
"#df = pd.concat(data, ignore_index=True)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2023-11-17T09:09:27.799944Z",
"start_time": "2023-11-17T09:08:54.885171Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total sum of the specified column: 6602120\n"
]
}
],
"source": [
"# chatgpt告诉我这样更简单!!\n",
"file_path = '/Users/datalab/bigdata/cjc/ows-raw.txt'\n",
"\n",
"# Specify the chunk size (number of rows to read at a time)\n",
"chunk_size = 100000\n",
"\n",
"# Create a dataframe reader object\n",
"chunk_reader = pd.read_csv(file_path, sep=',', quotechar='\"', \n",
" iterator=True, on_bad_lines='skip', chunksize=chunk_size)\n",
"\n",
"# Initialize a variable to store the total sum\n",
"total_sum = 0\n",
"num_chunk = 0\n",
"# Iterate over chunks\n",
"for chunk in chunk_reader:\n",
" # Process the chunk as needed\n",
" # For example, calculate the sum of a specific column\n",
" column_sum = len(chunk['Text'])\n",
" \n",
" # Add the sum of the current chunk to the total sum\n",
" total_sum += column_sum\n",
" print(num_chunk, total_sum, end = '\\r')\n",
" num_chunk +=1\n",
"\n",
"# After the loop, you have processed the entire dataset in chunks\n",
"print(\"Total sum of the specified column:\", total_sum)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2023-11-17T09:09:37.708886Z",
"start_time": "2023-11-17T09:09:37.676670Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Twitter ID | \n",
" Text | \n",
" Profile Image URL | \n",
" Day | \n",
" Hour | \n",
" Minute | \n",
" Created At | \n",
" Geo | \n",
" From User | \n",
" From User ID | \n",
" Language | \n",
" To User | \n",
" To User ID | \n",
" Source | \n",
"
\n",
" \n",
" \n",
" \n",
" 6600000 | \n",
" 170726983490211841 | \n",
" Stand Up Mr. US Business Man and take responsi... | \n",
" http://a2.twimg.com/profile_images/1752607483/... | \n",
" 2012-02-18 | \n",
" 4 | \n",
" 30 | \n",
" 2012-02-18 04:30:59 | \n",
" N; | \n",
" bentley_cat | \n",
" 463108759 | \n",
" en | \n",
" NaN | \n",
" 0 | \n",
" <a href="http://www.bestoftheinternets... | \n",
"
\n",
" \n",
" 6600001 | \n",
" 170727024841854976 | \n",
" RT @C0d3Fr0sty: MT( Link shortened) @Kaymee: I... | \n",
" http://a2.twimg.com/profile_images/1599465487/... | \n",
" 2012-02-18 | \n",
" 4 | \n",
" 31 | \n",
" 2012-02-18 04:31:09 | \n",
" N; | \n",
" marylouise996S | \n",
" 15380166 | \n",
" en | \n",
" NaN | \n",
" 0 | \n",
" <a href="http://www.tweetdeck.com"... | \n",
"
\n",
" \n",
" 6600002 | \n",
" 170727037370253312 | \n",
" China had an #ows before everyone else 1989 Ti... | \n",
" http://a0.twimg.com/profile_images/1302276340/... | \n",
" 2012-02-18 | \n",
" 4 | \n",
" 31 | \n",
" 2012-02-18 04:31:12 | \n",
" N; | \n",
" dfwlibrarian | \n",
" 17644162 | \n",
" en | \n",
" NaN | \n",
" 0 | \n",
" <a href="http://janetter.net/" re... | \n",
"
\n",
" \n",
" 6600003 | \n",
" 170727054361362433 | \n",
" Currency, Capital and Evolution: - http://t.co... | \n",
" http://a3.twimg.com/profile_images/1597982571/... | \n",
" 2012-02-18 | \n",
" 4 | \n",
" 31 | \n",
" 2012-02-18 04:31:16 | \n",
" N; | \n",
" OmniusManifesto | \n",
" 394061184 | \n",
" it | \n",
" NaN | \n",
" 0 | \n",
" <a href="http://www.socialoomph.com&qu... | \n",
"
\n",
" \n",
" 6600004 | \n",
" 170727082391900160 | \n",
" Our problems rise much more from Govts corrupt... | \n",
" http://a0.twimg.com/profile_images/1592676372/... | \n",
" 2012-02-18 | \n",
" 4 | \n",
" 31 | \n",
" 2012-02-18 04:31:23 | \n",
" N; | \n",
" IndyPolitico | \n",
" 73935439 | \n",
" en | \n",
" NaN | \n",
" 0 | \n",
" <a href="http://www.socialoomph.com&qu... | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 6602115 | \n",
" 170811007516672000 | \n",
" Man's knowledge makes another leap through the... | \n",
" http://a3.twimg.com/profile_images/1600926992/... | \n",
" 2012-02-18 | \n",
" 10 | \n",
" 4 | \n",
" 2012-02-18 10:04:52 | \n",
" N; | \n",
" darealmaozedong | \n",
" 395911020 | \n",
" en | \n",
" NaN | \n",
" 0 | \n",
" <a href="http://github.com/fons/cl-twi... | \n",
"
\n",
" \n",
" 6602116 | \n",
" 170811073648279552 | \n",
" When we give any president - one man - too muc... | \n",
" http://a2.twimg.com/profile_images/1603734590/... | \n",
" 2012-02-18 | \n",
" 10 | \n",
" 5 | \n",
" 2012-02-18 10:05:08 | \n",
" N; | \n",
" RonPaulsVoice | \n",
" 396995779 | \n",
" en | \n",
" NaN | \n",
" 0 | \n",
" <a href="http://twitter.com/RonPaulsVo... | \n",
"
\n",
" \n",
" 6602117 | \n",
" 170811301411553281 | \n",
" NYC forecast Tue 2/21/12: Partly cloudy. High ... | \n",
" http://a1.twimg.com/profile_images/1612658667/... | \n",
" 2012-02-18 | \n",
" 10 | \n",
" 6 | \n",
" 2012-02-18 10:06:02 | \n",
" N; | \n",
" OccupyWeather | \n",
" 400559295 | \n",
" en | \n",
" NaN | \n",
" 0 | \n",
" <a href="http://24Ahead.com/" rel... | \n",
"
\n",
" \n",
" 6602118 | \n",
" 170811326703206400 | \n",
" The moral promise of a free society involves t... | \n",
" http://a2.twimg.com/profile_images/1603734590/... | \n",
" 2012-02-18 | \n",
" 10 | \n",
" 6 | \n",
" 2012-02-18 10:06:08 | \n",
" N; | \n",
" RonPaulsVoice | \n",
" 396995779 | \n",
" en | \n",
" NaN | \n",
" 0 | \n",
" <a href="http://twitter.com/RonPaulsVo... | \n",
"
\n",
" \n",
" 6602119 | \n",
" 170811328037007360 | \n",
" RT @AnonOfTheAbove: RT @Apneac MT @MelMajik9: ... | \n",
" http://a3.twimg.com/profile_images/1600926992/... | \n",
" 2012-02-18 | \n",
" 10 | \n",
" 6 | \n",
" 2012-02-18 10:06:08 | \n",
" N; | \n",
" darealmaozedong | \n",
" 395911020 | \n",
" en | \n",
" NaN | \n",
" 0 | \n",
" <a href="http://twitterfeed.com" ... | \n",
"
\n",
" \n",
"
\n",
"
2120 rows × 14 columns
\n",
"
"
],
"text/plain": [
" Twitter ID \\\n",
"6600000 170726983490211841 \n",
"6600001 170727024841854976 \n",
"6600002 170727037370253312 \n",
"6600003 170727054361362433 \n",
"6600004 170727082391900160 \n",
"... ... \n",
"6602115 170811007516672000 \n",
"6602116 170811073648279552 \n",
"6602117 170811301411553281 \n",
"6602118 170811326703206400 \n",
"6602119 170811328037007360 \n",
"\n",
" Text \\\n",
"6600000 Stand Up Mr. US Business Man and take responsi... \n",
"6600001 RT @C0d3Fr0sty: MT( Link shortened) @Kaymee: I... \n",
"6600002 China had an #ows before everyone else 1989 Ti... \n",
"6600003 Currency, Capital and Evolution: - http://t.co... \n",
"6600004 Our problems rise much more from Govts corrupt... \n",
"... ... \n",
"6602115 Man's knowledge makes another leap through the... \n",
"6602116 When we give any president - one man - too muc... \n",
"6602117 NYC forecast Tue 2/21/12: Partly cloudy. High ... \n",
"6602118 The moral promise of a free society involves t... \n",
"6602119 RT @AnonOfTheAbove: RT @Apneac MT @MelMajik9: ... \n",
"\n",
" Profile Image URL Day Hour \\\n",
"6600000 http://a2.twimg.com/profile_images/1752607483/... 2012-02-18 4 \n",
"6600001 http://a2.twimg.com/profile_images/1599465487/... 2012-02-18 4 \n",
"6600002 http://a0.twimg.com/profile_images/1302276340/... 2012-02-18 4 \n",
"6600003 http://a3.twimg.com/profile_images/1597982571/... 2012-02-18 4 \n",
"6600004 http://a0.twimg.com/profile_images/1592676372/... 2012-02-18 4 \n",
"... ... ... ... \n",
"6602115 http://a3.twimg.com/profile_images/1600926992/... 2012-02-18 10 \n",
"6602116 http://a2.twimg.com/profile_images/1603734590/... 2012-02-18 10 \n",
"6602117 http://a1.twimg.com/profile_images/1612658667/... 2012-02-18 10 \n",
"6602118 http://a2.twimg.com/profile_images/1603734590/... 2012-02-18 10 \n",
"6602119 http://a3.twimg.com/profile_images/1600926992/... 2012-02-18 10 \n",
"\n",
" Minute Created At Geo From User From User ID \\\n",
"6600000 30 2012-02-18 04:30:59 N; bentley_cat 463108759 \n",
"6600001 31 2012-02-18 04:31:09 N; marylouise996S 15380166 \n",
"6600002 31 2012-02-18 04:31:12 N; dfwlibrarian 17644162 \n",
"6600003 31 2012-02-18 04:31:16 N; OmniusManifesto 394061184 \n",
"6600004 31 2012-02-18 04:31:23 N; IndyPolitico 73935439 \n",
"... ... ... .. ... ... \n",
"6602115 4 2012-02-18 10:04:52 N; darealmaozedong 395911020 \n",
"6602116 5 2012-02-18 10:05:08 N; RonPaulsVoice 396995779 \n",
"6602117 6 2012-02-18 10:06:02 N; OccupyWeather 400559295 \n",
"6602118 6 2012-02-18 10:06:08 N; RonPaulsVoice 396995779 \n",
"6602119 6 2012-02-18 10:06:08 N; darealmaozedong 395911020 \n",
"\n",
" Language To User To User ID \\\n",
"6600000 en NaN 0 \n",
"6600001 en NaN 0 \n",
"6600002 en NaN 0 \n",
"6600003 it NaN 0 \n",
"6600004 en NaN 0 \n",
"... ... ... ... \n",
"6602115 en NaN 0 \n",
"6602116 en NaN 0 \n",
"6602117 en NaN 0 \n",
"6602118 en NaN 0 \n",
"6602119 en NaN 0 \n",
"\n",
" Source \n",
"6600000 <a href="http://www.bestoftheinternets... \n",
"6600001 <a href="http://www.tweetdeck.com"... \n",
"6600002 <a href="http://janetter.net/" re... \n",
"6600003 <a href="http://www.socialoomph.com&qu... \n",
"6600004 <a href="http://www.socialoomph.com&qu... \n",
"... ... \n",
"6602115 <a href="http://github.com/fons/cl-twi... \n",
"6602116 <a href="http://twitter.com/RonPaulsVo... \n",
"6602117 <a href="http://24Ahead.com/" rel... \n",
"6602118 <a href="http://twitter.com/RonPaulsVo... \n",
"6602119 <a href="http://twitterfeed.com" ... \n",
"\n",
"[2120 rows x 14 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chunk"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"ExecuteTime": {
"end_time": "2023-11-17T06:46:45.218279Z",
"start_time": "2023-11-17T06:46:45.212404Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"78397"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(cops_data)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"ExecuteTime": {
"end_time": "2023-11-17T06:46:52.916220Z",
"start_time": "2023-11-17T06:46:52.908455Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 170734732877893632\n",
"1 RT @DiceyTroop: When I got here, cops were has...\n",
"2 http://a2.twimg.com/profile_images/1753747297/...\n",
"3 2012-02-18\n",
"4 5\n",
"5 1\n",
"6 2012-02-18 05:01:47\n",
"7 N;\n",
"8 shushugah\n",
"9 28624302\n",
"10 en\n",
"11 NaN\n",
"12 0\n",
"13 <a href="http://twitter.com/#!/downloa...\n",
"Name: 6600282, dtype: object"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat(dat, ignore_index=True)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"ExecuteTime": {
"end_time": "2023-11-17T06:47:29.967729Z",
"start_time": "2023-11-17T06:47:28.371862Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Twitter ID | \n",
" Text | \n",
" Profile Image URL | \n",
" Day | \n",
" Hour | \n",
" Minute | \n",
" Created At | \n",
" Geo | \n",
" From User | \n",
" From User ID | \n",
" Language | \n",
" To User | \n",
" To User ID | \n",
" Source | \n",
"
\n",
" \n",
" \n",
" \n",
" 57 | \n",
" 121813549478707200 | \n",
" RT @kittylight: Dear #cops THE WHOLE WORLD IS ... | \n",
" http://a2.twimg.com/profile_images/1146887237/... | \n",
" 2011-10-06 | \n",
" 5 | \n",
" 6 | \n",
" 2011-10-06 05:06:28 | \n",
" N; | \n",
" dove_hawk | \n",
" 361839281 | \n",
" en | \n",
" NaN | \n",
" 0 | \n",
" <a href="http://twitter.com/#!/downloa... | \n",
"
\n",
" \n",
" 95 | \n",
" 121813722099482624 | \n",
" The whiny, sanctimonious drivel coming out of ... | \n",
" http://a3.twimg.com/profile_images/1573938172/... | \n",
" 2011-10-06 | \n",
" 5 | \n",
" 7 | \n",
" 2011-10-06 05:07:09 | \n",
" N; | \n",
" wryson | \n",
" 351681669 | \n",
" en | \n",
" NaN | \n",
" 0 | \n",
" <a href="http://twitter.com/">... | \n",
"
\n",
" \n",
" 98 | \n",
" 121813748003508224 | \n",
" RT @KeithOlbermann: Again NYPD supervisors do ... | \n",
" http://a1.twimg.com/profile_images/509909348/t... | \n",
" 2011-10-06 | \n",
" 5 | \n",
" 7 | \n",
" 2011-10-06 05:07:15 | \n",
" N; | \n",
" dannydoodar | \n",
" 76258793 | \n",
" en | \n",
" NaN | \n",
" 0 | \n",
" <a href="http://stone.com/Twittelator&... | \n",
"
\n",
" \n",
" 267 | \n",
" 121814376234754049 | \n",
" RT @kittylight: #isad #stayhungry #ThinkDiffer... | \n",
" http://a2.twimg.com/profile_images/1540184395/... | \n",
" 2011-10-06 | \n",
" 5 | \n",
" 9 | \n",
" 2011-10-06 05:09:45 | \n",
" N; | \n",
" kittylightsCat | \n",
" 406361898 | \n",
" en | \n",
" NaN | \n",
" 0 | \n",
" <a href="http://twitter.com/#!/downloa... | \n",
"
\n",
" \n",
" 278 | \n",
" 121814402025533440 | \n",
" RT @kittylight: Dear #cops THE WHOLE WORLD IS ... | \n",
" http://a2.twimg.com/profile_images/1540184395/... | \n",
" 2011-10-06 | \n",
" 5 | \n",
" 9 | \n",
" 2011-10-06 05:09:51 | \n",
" N; | \n",
" kittylightsCat | \n",
" 406361898 | \n",
" en | \n",
" NaN | \n",
" 0 | \n",
" <a href="http://twitter.com/#!/downloa... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Twitter ID Text \\\n",
"57 121813549478707200 RT @kittylight: Dear #cops THE WHOLE WORLD IS ... \n",
"95 121813722099482624 The whiny, sanctimonious drivel coming out of ... \n",
"98 121813748003508224 RT @KeithOlbermann: Again NYPD supervisors do ... \n",
"267 121814376234754049 RT @kittylight: #isad #stayhungry #ThinkDiffer... \n",
"278 121814402025533440 RT @kittylight: Dear #cops THE WHOLE WORLD IS ... \n",
"\n",
" Profile Image URL Day Hour \\\n",
"57 http://a2.twimg.com/profile_images/1146887237/... 2011-10-06 5 \n",
"95 http://a3.twimg.com/profile_images/1573938172/... 2011-10-06 5 \n",
"98 http://a1.twimg.com/profile_images/509909348/t... 2011-10-06 5 \n",
"267 http://a2.twimg.com/profile_images/1540184395/... 2011-10-06 5 \n",
"278 http://a2.twimg.com/profile_images/1540184395/... 2011-10-06 5 \n",
"\n",
" Minute Created At Geo From User From User ID Language \\\n",
"57 6 2011-10-06 05:06:28 N; dove_hawk 361839281 en \n",
"95 7 2011-10-06 05:07:09 N; wryson 351681669 en \n",
"98 7 2011-10-06 05:07:15 N; dannydoodar 76258793 en \n",
"267 9 2011-10-06 05:09:45 N; kittylightsCat 406361898 en \n",
"278 9 2011-10-06 05:09:51 N; kittylightsCat 406361898 en \n",
"\n",
" To User To User ID Source \n",
"57 NaN 0 <a href="http://twitter.com/#!/downloa... \n",
"95 NaN 0 <a href="http://twitter.com/">... \n",
"98 NaN 0 <a href="http://stone.com/Twittelator&... \n",
"267 NaN 0 <a href="http://twitter.com/#!/downloa... \n",
"278 NaN 0 <a href="http://twitter.com/#!/downloa... "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame.from_dict(cops_data)\n",
"df.head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"![image.png](./images/end.png)"
]
}
],
"metadata": {
"celltoolbar": "幻灯片",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autoclose": false,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 0,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": false,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {
"height": "1260px",
"left": "1602px",
"top": "-352px",
"width": "159.359px"
},
"toc_section_display": false,
"toc_window_display": true
}
},
"nbformat": 4,
"nbformat_minor": 1
}