对大数据进行预处理#

以占领华尔街推特数据为例

image.png

字节(Byte /bait/)#

计算机信息技术用于计量存储容量的一种计量单位,通常情况下一字节等于有八位, [1] 也表示一些计算机编程语言中的数据类型和语言字符。

  • 1B(byte,字节)= 8 bit;

  • 1KB=1000B;1MB=1000KB=1000×1000B。其中1000=10^3。

  • 1KB(kilobyte,千字节)=1000B= 10^3 B;

  • 1MB(Megabyte,兆字节,百万字节,简称“兆”)=1000KB= 10^6 B;

  • 1GB(Gigabyte,吉字节,十亿字节,又称“千兆”)=1000MB= 10^9 B;

分段读取数据并进行处理#

Lazy Method for Reading Big File in Python?

from time import sleep
# import sys

# flush print
# def flushPrint(d):
#     sys.stdout.write('\r')
#     sys.stdout.write(str(d))
#     sys.stdout.flush()
for i in range(10): 
    sleep(1)
    print(i, end= '\r')
9
# 按行读取数据
line_num = 0
cops_num = 0
# windows users may need to add encoding = 'utf8' into the folling line.
with open('/Users/datalab/bigdata/cjc/ows-raw.txt', 'r') as f:
    for i in f:
        line_num += 1
        if 'cops' in i:
            cops_num += 1
        if line_num % 100000 ==0:
            print(line_num, end='\r')
6900000
line_num
6911408
cops_num/line_num
0.011413448605551865
bigfile = open('/Users/datalab/bigdata/cjc/ows-raw.txt', 'r')
chunkSize = 1000000
chunk = bigfile.readlines(chunkSize)
print(len(chunk))
# with open("../data/ows_tweets_sample.txt", 'w') as f:
#     for i in chunk:
#         f.write(i)  
2754
bigfile.readlines?
5%5
0
# https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python?lq=1
import csv
bigfile = open('/Users/datalab/bigdata/cjc/ows-raw.txt', 'r')
chunkSize = 10**8
chunk = bigfile.readlines(chunkSize)
num_chunk, num_lines, num_cops = 0, 0, 0
while chunk:
    lines = csv.reader((line.replace('\x00','') for line in chunk), 
                       delimiter=',', quotechar='"')
    # do sth.
    num_lines += len(list(lines))
    for i in lines:
        if 'cops' in i:
            num_cops +=1
    if num_chunk % 5 ==0:
        print(num_chunk, num_lines, end = '\r')
    num_chunk += 1
    chunk = bigfile.readlines(chunkSize) # read another chunk
25 6602141

用Pandas的get_chunk功能来处理亿级数据#

只有在超过5TB数据量的规模下,Hadoop才是一个合理的技术选择。

import pandas as pd

f = open('/Users/datalab/bigdata/cjc/ows-raw.txt',encoding='utf-8')
reader = pd.read_table(f,  sep=',',  quotechar='"', iterator=True, on_bad_lines='skip') #跳过报错行
chunkSize = 100000
chunk = reader.get_chunk(chunkSize)
len(chunk)

#pd.read_table?
100000
chunk.head()
Twitter ID Text Profile Image URL Day Hour Minute Created At Geo From User From User ID Language To User To User ID Source
0 121813144174727168 RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLIN... http://a2.twimg.com/profile_images/1539375713/... 2011-10-06 5 4 2011-10-06 05:04:51 N; Anonops_Cop 401240477 en NaN 0 <a href="http://twitter.com/">...
1 121813146137657344 @jamiekilstein @allisonkilkenny Interesting in... http://a2.twimg.com/profile_images/1574715503/... 2011-10-06 5 4 2011-10-06 05:04:51 N; KittyHybrid 34532053 en jamiekilstein 2149053 <a href="http://twitter.com/">...
2 121813150000619521 @Seductivpancake Right! Those guys have a vict... http://a1.twimg.com/profile_images/1241412831/... 2011-10-06 5 4 2011-10-06 05:04:52 N; nerdsherpa 95067344 en Seductivpancake 19695580 <a href="http://www.echofon.com/"...
3 121813150701072385 RT @bembel "Occupy Wall Street" als ... http://a0.twimg.com/profile_images/1106399092/... 2011-10-06 5 4 2011-10-06 05:04:52 N; hamudistan 35862923 en NaN 0 <a href="http://levelupstudio.com&quot...
4 121813163778899968 #ows White shirt= Brown shirt. http://a2.twimg.com/profile_images/1568117871/... 2011-10-06 5 4 2011-10-06 05:04:56 N; kl_knox 419580636 en NaN 0 <a href="http://twitter.com/">...
import pandas as pd

f = open('/Users/datalab/bigdata/cjc/ows-raw.txt',encoding='utf-8')
reader = pd.read_table(f,  sep=',',  quotechar='"', 
                       iterator=True, on_bad_lines='skip') #跳过报错行
chunkSize = 100000
loop = True
cops_data = []
num_chunk, num_lines = 0, 0
while loop:
    try:
        chunk = reader.get_chunk(chunkSize)
        # dat = data_cleaning_funtion(chunk) # do sth. e.g., if cops in dat
        dat=[chunk.loc[k] for k in chunk.index if 'cops' in str(chunk['Text'][k]) ]
        num_lines += len(chunk)
        print(num_chunk, num_lines, end = '\r')
        num_chunk +=1
        for d in dat:
            cops_data.append(d) 
    except StopIteration:
        loop = False
        print("Iteration is stopped.")
#df = pd.concat(data, ignore_index=True)
Iteration is stopped.
# chatgpt告诉我这样更简单!!
file_path = '/Users/datalab/bigdata/cjc/ows-raw.txt'

# Specify the chunk size (number of rows to read at a time)
chunk_size = 100000

# Create a dataframe reader object
chunk_reader = pd.read_csv(file_path, sep=',',  quotechar='"', 
                        iterator=True, on_bad_lines='skip', chunksize=chunk_size)

# Initialize a variable to store the total sum
total_sum = 0
num_chunk = 0
# Iterate over chunks
for chunk in chunk_reader:
    # Process the chunk as needed
    # For example, calculate the sum of a specific column
    column_sum = len(chunk['Text'])
    
    # Add the sum of the current chunk to the total sum
    total_sum += column_sum
    print(num_chunk, total_sum, end = '\r')
    num_chunk +=1

# After the loop, you have processed the entire dataset in chunks
print("Total sum of the specified column:", total_sum)
Total sum of the specified column: 6602120
chunk
Twitter ID Text Profile Image URL Day Hour Minute Created At Geo From User From User ID Language To User To User ID Source
6600000 170726983490211841 Stand Up Mr. US Business Man and take responsi... http://a2.twimg.com/profile_images/1752607483/... 2012-02-18 4 30 2012-02-18 04:30:59 N; bentley_cat 463108759 en NaN 0 <a href="http://www.bestoftheinternets...
6600001 170727024841854976 RT @C0d3Fr0sty: MT( Link shortened) @Kaymee: I... http://a2.twimg.com/profile_images/1599465487/... 2012-02-18 4 31 2012-02-18 04:31:09 N; marylouise996S 15380166 en NaN 0 <a href="http://www.tweetdeck.com&quot...
6600002 170727037370253312 China had an #ows before everyone else 1989 Ti... http://a0.twimg.com/profile_images/1302276340/... 2012-02-18 4 31 2012-02-18 04:31:12 N; dfwlibrarian 17644162 en NaN 0 <a href="http://janetter.net/" re...
6600003 170727054361362433 Currency, Capital and Evolution: - http://t.co... http://a3.twimg.com/profile_images/1597982571/... 2012-02-18 4 31 2012-02-18 04:31:16 N; OmniusManifesto 394061184 it NaN 0 <a href="http://www.socialoomph.com&qu...
6600004 170727082391900160 Our problems rise much more from Govts corrupt... http://a0.twimg.com/profile_images/1592676372/... 2012-02-18 4 31 2012-02-18 04:31:23 N; IndyPolitico 73935439 en NaN 0 <a href="http://www.socialoomph.com&qu...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6602115 170811007516672000 Man's knowledge makes another leap through the... http://a3.twimg.com/profile_images/1600926992/... 2012-02-18 10 4 2012-02-18 10:04:52 N; darealmaozedong 395911020 en NaN 0 <a href="http://github.com/fons/cl-twi...
6602116 170811073648279552 When we give any president - one man - too muc... http://a2.twimg.com/profile_images/1603734590/... 2012-02-18 10 5 2012-02-18 10:05:08 N; RonPaulsVoice 396995779 en NaN 0 <a href="http://twitter.com/RonPaulsVo...
6602117 170811301411553281 NYC forecast Tue 2/21/12: Partly cloudy. High ... http://a1.twimg.com/profile_images/1612658667/... 2012-02-18 10 6 2012-02-18 10:06:02 N; OccupyWeather 400559295 en NaN 0 <a href="http://24Ahead.com/" rel...
6602118 170811326703206400 The moral promise of a free society involves t... http://a2.twimg.com/profile_images/1603734590/... 2012-02-18 10 6 2012-02-18 10:06:08 N; RonPaulsVoice 396995779 en NaN 0 <a href="http://twitter.com/RonPaulsVo...
6602119 170811328037007360 RT @AnonOfTheAbove: RT @Apneac MT @MelMajik9: ... http://a3.twimg.com/profile_images/1600926992/... 2012-02-18 10 6 2012-02-18 10:06:08 N; darealmaozedong 395911020 en NaN 0 <a href="http://twitterfeed.com" ...

2120 rows × 14 columns

len(cops_data)
78397
pd.concat(dat, ignore_index=True)
0                                    170734732877893632
1     RT @DiceyTroop: When I got here, cops were has...
2     http://a2.twimg.com/profile_images/1753747297/...
3                                            2012-02-18
4                                                     5
5                                                     1
6                                   2012-02-18 05:01:47
7                                                    N;
8                                             shushugah
9                                              28624302
10                                                   en
11                                                  NaN
12                                                    0
13    <a href="http://twitter.com/#!/downloa...
Name: 6600282, dtype: object
df = pd.DataFrame.from_dict(cops_data)
df.head()
Twitter ID Text Profile Image URL Day Hour Minute Created At Geo From User From User ID Language To User To User ID Source
57 121813549478707200 RT @kittylight: Dear #cops THE WHOLE WORLD IS ... http://a2.twimg.com/profile_images/1146887237/... 2011-10-06 5 6 2011-10-06 05:06:28 N; dove_hawk 361839281 en NaN 0 <a href="http://twitter.com/#!/downloa...
95 121813722099482624 The whiny, sanctimonious drivel coming out of ... http://a3.twimg.com/profile_images/1573938172/... 2011-10-06 5 7 2011-10-06 05:07:09 N; wryson 351681669 en NaN 0 <a href="http://twitter.com/">...
98 121813748003508224 RT @KeithOlbermann: Again NYPD supervisors do ... http://a1.twimg.com/profile_images/509909348/t... 2011-10-06 5 7 2011-10-06 05:07:15 N; dannydoodar 76258793 en NaN 0 <a href="http://stone.com/Twittelator&...
267 121814376234754049 RT @kittylight: #isad #stayhungry #ThinkDiffer... http://a2.twimg.com/profile_images/1540184395/... 2011-10-06 5 9 2011-10-06 05:09:45 N; kittylightsCat 406361898 en NaN 0 <a href="http://twitter.com/#!/downloa...
278 121814402025533440 RT @kittylight: Dear #cops THE WHOLE WORLD IS ... http://a2.twimg.com/profile_images/1540184395/... 2011-10-06 5 9 2011-10-06 05:09:51 N; kittylightsCat 406361898 en NaN 0 <a href="http://twitter.com/#!/downloa...

image.png