对大数据进行预处理#
以占领华尔街推特数据为例
字节(Byte /bait/)#
计算机信息技术用于计量存储容量的一种计量单位,通常情况下一字节等于有八位, [1] 也表示一些计算机编程语言中的数据类型和语言字符。
1B(byte,字节)= 8 bit;
1KB=1000B;1MB=1000KB=1000×1000B。其中1000=10^3。
1KB(kilobyte,千字节)=1000B= 10^3 B;
1MB(Megabyte,兆字节,百万字节,简称“兆”)=1000KB= 10^6 B;
1GB(Gigabyte,吉字节,十亿字节,又称“千兆”)=1000MB= 10^9 B;
分段读取数据并进行处理#
Lazy Method for Reading Big File in Python?
from time import sleep
# import sys
# flush print
# def flushPrint(d):
# sys.stdout.write('\r')
# sys.stdout.write(str(d))
# sys.stdout.flush()
for i in range(10):
sleep(1)
print(i, end= '\r')
9
# 按行读取数据
line_num = 0
cops_num = 0
# windows users may need to add encoding = 'utf8' into the folling line.
with open('/Users/datalab/bigdata/cjc/ows-raw.txt', 'r') as f:
for i in f:
line_num += 1
if 'cops' in i:
cops_num += 1
if line_num % 100000 ==0:
print(line_num, end='\r')
6900000
line_num
6911408
cops_num/line_num
0.011413448605551865
bigfile = open('/Users/datalab/bigdata/cjc/ows-raw.txt', 'r')
chunkSize = 1000000
chunk = bigfile.readlines(chunkSize)
print(len(chunk))
# with open("../data/ows_tweets_sample.txt", 'w') as f:
# for i in chunk:
# f.write(i)
2754
bigfile.readlines?
5%5
0
# https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python?lq=1
import csv
bigfile = open('/Users/datalab/bigdata/cjc/ows-raw.txt', 'r')
chunkSize = 10**8
chunk = bigfile.readlines(chunkSize)
num_chunk, num_lines, num_cops = 0, 0, 0
while chunk:
lines = csv.reader((line.replace('\x00','') for line in chunk),
delimiter=',', quotechar='"')
# do sth.
num_lines += len(list(lines))
for i in lines:
if 'cops' in i:
num_cops +=1
if num_chunk % 5 ==0:
print(num_chunk, num_lines, end = '\r')
num_chunk += 1
chunk = bigfile.readlines(chunkSize) # read another chunk
25 6602141
用Pandas的get_chunk功能来处理亿级数据#
只有在超过5TB数据量的规模下,Hadoop才是一个合理的技术选择。
import pandas as pd
f = open('/Users/datalab/bigdata/cjc/ows-raw.txt',encoding='utf-8')
reader = pd.read_table(f, sep=',', quotechar='"', iterator=True, on_bad_lines='skip') #跳过报错行
chunkSize = 100000
chunk = reader.get_chunk(chunkSize)
len(chunk)
#pd.read_table?
100000
chunk.head()
Twitter ID | Text | Profile Image URL | Day | Hour | Minute | Created At | Geo | From User | From User ID | Language | To User | To User ID | Source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 121813144174727168 | RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLIN... | http://a2.twimg.com/profile_images/1539375713/... | 2011-10-06 | 5 | 4 | 2011-10-06 05:04:51 | N; | Anonops_Cop | 401240477 | en | NaN | 0 | <a href="http://twitter.com/">... |
1 | 121813146137657344 | @jamiekilstein @allisonkilkenny Interesting in... | http://a2.twimg.com/profile_images/1574715503/... | 2011-10-06 | 5 | 4 | 2011-10-06 05:04:51 | N; | KittyHybrid | 34532053 | en | jamiekilstein | 2149053 | <a href="http://twitter.com/">... |
2 | 121813150000619521 | @Seductivpancake Right! Those guys have a vict... | http://a1.twimg.com/profile_images/1241412831/... | 2011-10-06 | 5 | 4 | 2011-10-06 05:04:52 | N; | nerdsherpa | 95067344 | en | Seductivpancake | 19695580 | <a href="http://www.echofon.com/"... |
3 | 121813150701072385 | RT @bembel "Occupy Wall Street" als ... | http://a0.twimg.com/profile_images/1106399092/... | 2011-10-06 | 5 | 4 | 2011-10-06 05:04:52 | N; | hamudistan | 35862923 | en | NaN | 0 | <a href="http://levelupstudio.com"... |
4 | 121813163778899968 | #ows White shirt= Brown shirt. | http://a2.twimg.com/profile_images/1568117871/... | 2011-10-06 | 5 | 4 | 2011-10-06 05:04:56 | N; | kl_knox | 419580636 | en | NaN | 0 | <a href="http://twitter.com/">... |
import pandas as pd
f = open('/Users/datalab/bigdata/cjc/ows-raw.txt',encoding='utf-8')
reader = pd.read_table(f, sep=',', quotechar='"',
iterator=True, on_bad_lines='skip') #跳过报错行
chunkSize = 100000
loop = True
cops_data = []
num_chunk, num_lines = 0, 0
while loop:
try:
chunk = reader.get_chunk(chunkSize)
# dat = data_cleaning_funtion(chunk) # do sth. e.g., if cops in dat
dat=[chunk.loc[k] for k in chunk.index if 'cops' in str(chunk['Text'][k]) ]
num_lines += len(chunk)
print(num_chunk, num_lines, end = '\r')
num_chunk +=1
for d in dat:
cops_data.append(d)
except StopIteration:
loop = False
print("Iteration is stopped.")
#df = pd.concat(data, ignore_index=True)
Iteration is stopped.
# chatgpt告诉我这样更简单!!
file_path = '/Users/datalab/bigdata/cjc/ows-raw.txt'
# Specify the chunk size (number of rows to read at a time)
chunk_size = 100000
# Create a dataframe reader object
chunk_reader = pd.read_csv(file_path, sep=',', quotechar='"',
iterator=True, on_bad_lines='skip', chunksize=chunk_size)
# Initialize a variable to store the total sum
total_sum = 0
num_chunk = 0
# Iterate over chunks
for chunk in chunk_reader:
# Process the chunk as needed
# For example, calculate the sum of a specific column
column_sum = len(chunk['Text'])
# Add the sum of the current chunk to the total sum
total_sum += column_sum
print(num_chunk, total_sum, end = '\r')
num_chunk +=1
# After the loop, you have processed the entire dataset in chunks
print("Total sum of the specified column:", total_sum)
Total sum of the specified column: 6602120
chunk
Twitter ID | Text | Profile Image URL | Day | Hour | Minute | Created At | Geo | From User | From User ID | Language | To User | To User ID | Source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6600000 | 170726983490211841 | Stand Up Mr. US Business Man and take responsi... | http://a2.twimg.com/profile_images/1752607483/... | 2012-02-18 | 4 | 30 | 2012-02-18 04:30:59 | N; | bentley_cat | 463108759 | en | NaN | 0 | <a href="http://www.bestoftheinternets... |
6600001 | 170727024841854976 | RT @C0d3Fr0sty: MT( Link shortened) @Kaymee: I... | http://a2.twimg.com/profile_images/1599465487/... | 2012-02-18 | 4 | 31 | 2012-02-18 04:31:09 | N; | marylouise996S | 15380166 | en | NaN | 0 | <a href="http://www.tweetdeck.com"... |
6600002 | 170727037370253312 | China had an #ows before everyone else 1989 Ti... | http://a0.twimg.com/profile_images/1302276340/... | 2012-02-18 | 4 | 31 | 2012-02-18 04:31:12 | N; | dfwlibrarian | 17644162 | en | NaN | 0 | <a href="http://janetter.net/" re... |
6600003 | 170727054361362433 | Currency, Capital and Evolution: - http://t.co... | http://a3.twimg.com/profile_images/1597982571/... | 2012-02-18 | 4 | 31 | 2012-02-18 04:31:16 | N; | OmniusManifesto | 394061184 | it | NaN | 0 | <a href="http://www.socialoomph.com&qu... |
6600004 | 170727082391900160 | Our problems rise much more from Govts corrupt... | http://a0.twimg.com/profile_images/1592676372/... | 2012-02-18 | 4 | 31 | 2012-02-18 04:31:23 | N; | IndyPolitico | 73935439 | en | NaN | 0 | <a href="http://www.socialoomph.com&qu... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6602115 | 170811007516672000 | Man's knowledge makes another leap through the... | http://a3.twimg.com/profile_images/1600926992/... | 2012-02-18 | 10 | 4 | 2012-02-18 10:04:52 | N; | darealmaozedong | 395911020 | en | NaN | 0 | <a href="http://github.com/fons/cl-twi... |
6602116 | 170811073648279552 | When we give any president - one man - too muc... | http://a2.twimg.com/profile_images/1603734590/... | 2012-02-18 | 10 | 5 | 2012-02-18 10:05:08 | N; | RonPaulsVoice | 396995779 | en | NaN | 0 | <a href="http://twitter.com/RonPaulsVo... |
6602117 | 170811301411553281 | NYC forecast Tue 2/21/12: Partly cloudy. High ... | http://a1.twimg.com/profile_images/1612658667/... | 2012-02-18 | 10 | 6 | 2012-02-18 10:06:02 | N; | OccupyWeather | 400559295 | en | NaN | 0 | <a href="http://24Ahead.com/" rel... |
6602118 | 170811326703206400 | The moral promise of a free society involves t... | http://a2.twimg.com/profile_images/1603734590/... | 2012-02-18 | 10 | 6 | 2012-02-18 10:06:08 | N; | RonPaulsVoice | 396995779 | en | NaN | 0 | <a href="http://twitter.com/RonPaulsVo... |
6602119 | 170811328037007360 | RT @AnonOfTheAbove: RT @Apneac MT @MelMajik9: ... | http://a3.twimg.com/profile_images/1600926992/... | 2012-02-18 | 10 | 6 | 2012-02-18 10:06:08 | N; | darealmaozedong | 395911020 | en | NaN | 0 | <a href="http://twitterfeed.com" ... |
2120 rows × 14 columns
len(cops_data)
78397
pd.concat(dat, ignore_index=True)
0 170734732877893632
1 RT @DiceyTroop: When I got here, cops were has...
2 http://a2.twimg.com/profile_images/1753747297/...
3 2012-02-18
4 5
5 1
6 2012-02-18 05:01:47
7 N;
8 shushugah
9 28624302
10 en
11 NaN
12 0
13 <a href="http://twitter.com/#!/downloa...
Name: 6600282, dtype: object
df = pd.DataFrame.from_dict(cops_data)
df.head()
Twitter ID | Text | Profile Image URL | Day | Hour | Minute | Created At | Geo | From User | From User ID | Language | To User | To User ID | Source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
57 | 121813549478707200 | RT @kittylight: Dear #cops THE WHOLE WORLD IS ... | http://a2.twimg.com/profile_images/1146887237/... | 2011-10-06 | 5 | 6 | 2011-10-06 05:06:28 | N; | dove_hawk | 361839281 | en | NaN | 0 | <a href="http://twitter.com/#!/downloa... |
95 | 121813722099482624 | The whiny, sanctimonious drivel coming out of ... | http://a3.twimg.com/profile_images/1573938172/... | 2011-10-06 | 5 | 7 | 2011-10-06 05:07:09 | N; | wryson | 351681669 | en | NaN | 0 | <a href="http://twitter.com/">... |
98 | 121813748003508224 | RT @KeithOlbermann: Again NYPD supervisors do ... | http://a1.twimg.com/profile_images/509909348/t... | 2011-10-06 | 5 | 7 | 2011-10-06 05:07:15 | N; | dannydoodar | 76258793 | en | NaN | 0 | <a href="http://stone.com/Twittelator&... |
267 | 121814376234754049 | RT @kittylight: #isad #stayhungry #ThinkDiffer... | http://a2.twimg.com/profile_images/1540184395/... | 2011-10-06 | 5 | 9 | 2011-10-06 05:09:45 | N; | kittylightsCat | 406361898 | en | NaN | 0 | <a href="http://twitter.com/#!/downloa... |
278 | 121814402025533440 | RT @kittylight: Dear #cops THE WHOLE WORLD IS ... | http://a2.twimg.com/profile_images/1540184395/... | 2011-10-06 | 5 | 9 | 2011-10-06 05:09:51 | N; | kittylightsCat | 406361898 | en | NaN | 0 | <a href="http://twitter.com/#!/downloa... |