对大数据进行预处理

以占领华尔街推特数据为例

image.png

字节(Byte /bait/)

计算机信息技术用于计量存储容量的一种计量单位,通常情况下一字节等于有八位, [1] 也表示一些计算机编程语言中的数据类型和语言字符。

  • 1B(byte,字节)= 8 bit;

  • 1KB=1000B;1MB=1000KB=1000×1000B。其中1000=10^3。

  • 1KB(kilobyte,千字节)=1000B= 10^3 B;

  • 1MB(Megabyte,兆字节,百万字节,简称“兆”)=1000KB= 10^6 B;

  • 1GB(Gigabyte,吉字节,十亿字节,又称“千兆”)=1000MB= 10^9 B;

按照Chunk读取数据并进行处理

Lazy Method for Reading Big File in Python?

# 按行读取数据
line_num = 0
cops_num = 0
with open('/Users/datalab/bigdata/cjc/ows-raw.txt', 'r') as f:
    for i in f:
        line_num += 1
        if 'cops' in i:
            cops_num += 1
        if line_num % 100000 ==0:
            print(line_num)
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000
200000
210000
220000
230000
240000
250000
260000
270000
280000
290000
300000
310000
320000
330000
340000
350000
360000
370000
380000
390000
400000
410000
420000
430000
440000
450000
460000
470000
480000
490000
500000
510000
520000
530000
540000
550000
560000
570000
580000
590000
600000
610000
620000
630000
640000
650000
660000
670000
680000
690000
700000
710000
720000
730000
740000
750000
760000
770000
780000
790000
800000
810000
820000
830000
840000
850000
860000
870000
880000
890000
900000
910000
920000
930000
940000
950000
960000
970000
980000
990000
1000000
1010000
1020000
1030000
1040000
1050000
1060000
1070000
1080000
1090000
1100000
1110000
1120000
1130000
1140000
1150000
1160000
1170000
1180000
1190000
1200000
1210000
1220000
1230000
1240000
1250000
1260000
1270000
1280000
1290000
1300000
1310000
1320000
1330000
1340000
1350000
1360000
1370000
1380000
1390000
1400000
1410000
1420000
1430000
1440000
1450000
1460000
1470000
1480000
1490000
1500000
1510000
1520000
1530000
1540000
1550000
1560000
1570000
1580000
1590000
1600000
1610000
1620000
1630000
1640000
1650000
1660000
1670000
1680000
1690000
1700000
1710000
1720000
1730000
1740000
1750000
1760000
1770000
1780000
1790000
1800000
1810000
1820000
1830000
1840000
1850000
1860000
1870000
1880000
1890000
1900000
1910000
1920000
1930000
1940000
1950000
1960000
1970000
1980000
1990000
2000000
2010000
2020000
2030000
2040000
2050000
2060000
2070000
2080000
2090000
2100000
2110000
2120000
2130000
2140000
2150000
2160000
2170000
2180000
2190000
2200000
2210000
2220000
2230000
2240000
2250000
2260000
2270000
2280000
2290000
2300000
2310000
2320000
2330000
2340000
2350000
2360000
2370000
2380000
2390000
2400000
2410000
2420000
2430000
2440000
2450000
2460000
2470000
2480000
2490000
2500000
2510000
2520000
2530000
2540000
2550000
2560000
2570000
2580000
2590000
2600000
2610000
2620000
2630000
2640000
2650000
2660000
2670000
2680000
2690000
2700000
2710000
2720000
2730000
2740000
2750000
2760000
2770000
2780000
2790000
2800000
2810000
2820000
2830000
2840000
2850000
2860000
2870000
2880000
2890000
2900000
2910000
2920000
2930000
2940000
2950000
2960000
2970000
2980000
2990000
3000000
3010000
3020000
3030000
3040000
3050000
3060000
3070000
3080000
3090000
3100000
3110000
3120000
3130000
3140000
3150000
3160000
3170000
3180000
3190000
3200000
3210000
3220000
3230000
3240000
3250000
3260000
3270000
3280000
3290000
3300000
3310000
3320000
3330000
3340000
3350000
3360000
3370000
3380000
3390000
3400000
3410000
3420000
3430000
3440000
3450000
3460000
3470000
3480000
3490000
3500000
3510000
3520000
3530000
3540000
3550000
3560000
3570000
3580000
3590000
3600000
3610000
3620000
3630000
3640000
3650000
3660000
3670000
3680000
3690000
3700000
3710000
3720000
3730000
3740000
3750000
3760000
3770000
3780000
3790000
3800000
3810000
3820000
3830000
3840000
3850000
3860000
3870000
3880000
3890000
3900000
3910000
3920000
3930000
3940000
3950000
3960000
3970000
3980000
3990000
4000000
4010000
4020000
4030000
4040000
4050000
4060000
4070000
4080000
4090000
4100000
4110000
4120000
4130000
4140000
4150000
4160000
4170000
4180000
4190000
4200000
4210000
4220000
4230000
4240000
4250000
4260000
4270000
4280000
4290000
4300000
4310000
4320000
4330000
4340000
4350000
4360000
4370000
4380000
4390000
4400000
4410000
4420000
4430000
4440000
4450000
4460000
4470000
4480000
4490000
4500000
4510000
4520000
4530000
4540000
4550000
4560000
4570000
4580000
4590000
4600000
4610000
4620000
4630000
4640000
4650000
4660000
4670000
4680000
4690000
4700000
4710000
4720000
4730000
4740000
4750000
4760000
4770000
4780000
4790000
4800000
4810000
4820000
4830000
4840000
4850000
4860000
4870000
4880000
4890000
4900000
4910000
4920000
4930000
4940000
4950000
4960000
4970000
4980000
4990000
5000000
5010000
5020000
5030000
5040000
5050000
5060000
5070000
5080000
5090000
5100000
5110000
5120000
5130000
5140000
5150000
5160000
5170000
5180000
5190000
5200000
5210000
5220000
5230000
5240000
5250000
5260000
5270000
5280000
5290000
5300000
5310000
5320000
5330000
5340000
5350000
5360000
5370000
5380000
5390000
5400000
5410000
5420000
5430000
5440000
5450000
5460000
5470000
5480000
5490000
5500000
5510000
5520000
5530000
5540000
5550000
5560000
5570000
5580000
5590000
5600000
5610000
5620000
5630000
5640000
5650000
5660000
5670000
5680000
5690000
5700000
5710000
5720000
5730000
5740000
5750000
5760000
5770000
5780000
5790000
5800000
5810000
5820000
5830000
5840000
5850000
5860000
5870000
5880000
5890000
5900000
5910000
5920000
5930000
5940000
5950000
5960000
5970000
5980000
5990000
6000000
6010000
6020000
6030000
6040000
6050000
6060000
6070000
6080000
6090000
6100000
6110000
6120000
6130000
6140000
6150000
6160000
6170000
6180000
6190000
6200000
6210000
6220000
6230000
6240000
6250000
6260000
6270000
6280000
6290000
6300000
6310000
6320000
6330000
6340000
6350000
6360000
6370000
6380000
6390000
6400000
6410000
6420000
6430000
6440000
6450000
6460000
6470000
6480000
6490000
6500000
6510000
6520000
6530000
6540000
6550000
6560000
6570000
6580000
6590000
6600000
6610000
6620000
6630000
6640000
6650000
6660000
6670000
6680000
6690000
6700000
6710000
6720000
6730000
6740000
6750000
6760000
6770000
6780000
6790000
6800000
6810000
6820000
6830000
6840000
6850000
6860000
6870000
6880000
6890000
6900000
6910000
line_num
6911408
cops_num/line_num
0.011413448605551865
bigfile = open('/Users/datalab/bigdata/cjc/ows-raw.txt', 'r')
chunkSize = 1000000
chunk = bigfile.readlines(chunkSize)
print(len(chunk))
# with open("../data/ows_tweets_sample.txt", 'w') as f:
#     for i in chunk:
#         f.write(i)  
2754
bigfile.readlines?
5%5
0
# https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python?lq=1
import csv
bigfile = open('/Users/datalab/bigdata/cjc/ows-raw.txt', 'r')
chunkSize = 10**8
chunk = bigfile.readlines(chunkSize)
num_chunk, num_lines = 0, 0
while chunk:
    lines = csv.reader((line.replace('\x00','') for line in chunk), 
                       delimiter=',', quotechar='"')
    #do sth.
    num_lines += len(list(lines))
    if num_chunk % 5 ==0:
        print(num_chunk, num_lines)
    num_chunk += 1
    chunk = bigfile.readlines(chunkSize) # read another chunk
0 262665
5 1574666
10 2880857
15 4189419
20 5492578
25 6602141
num_lines
6602141

用Pandas的get_chunk功能来处理亿级数据

只有在超过5TB数据量的规模下,Hadoop才是一个合理的技术选择。

import pandas as pd

f = open('/Users/datalab/bigdata/cjc/ows-raw.txt',encoding='utf-8')
reader = pd.read_table(f,  sep=',',  quotechar='"', iterator=True, error_bad_lines=False) #跳过报错行
chunkSize = 100000
chunk = reader.get_chunk(chunkSize)
len(chunk)

#pd.read_table?
100000
chunk.head()
Twitter ID Text Profile Image URL Day Hour Minute Created At Geo From User From User ID Language To User To User ID Source
0 121813144174727168 RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLIN... http://a2.twimg.com/profile_images/1539375713/... 2011-10-06 5 4 2011-10-06 05:04:51 N; Anonops_Cop 401240477 en NaN 0 <a href="http://twitter.com/">...
1 121813146137657344 @jamiekilstein @allisonkilkenny Interesting in... http://a2.twimg.com/profile_images/1574715503/... 2011-10-06 5 4 2011-10-06 05:04:51 N; KittyHybrid 34532053 en jamiekilstein 2149053 <a href="http://twitter.com/">...
2 121813150000619521 @Seductivpancake Right! Those guys have a vict... http://a1.twimg.com/profile_images/1241412831/... 2011-10-06 5 4 2011-10-06 05:04:52 N; nerdsherpa 95067344 en Seductivpancake 19695580 <a href="http://www.echofon.com/"...
3 121813150701072385 RT @bembel "Occupy Wall Street" als ... http://a0.twimg.com/profile_images/1106399092/... 2011-10-06 5 4 2011-10-06 05:04:52 N; hamudistan 35862923 en NaN 0 <a href="http://levelupstudio.com&quot...
4 121813163778899968 #ows White shirt= Brown shirt. http://a2.twimg.com/profile_images/1568117871/... 2011-10-06 5 4 2011-10-06 05:04:56 N; kl_knox 419580636 en NaN 0 <a href="http://twitter.com/">...
import pandas as pd

f = open('/Users/datalab/bigdata/cjc/ows-raw.txt',encoding='utf-8')
reader = pd.read_table(f,  sep=',',  quotechar='"', 
                       iterator=True, error_bad_lines=False) #跳过报错行
chunkSize = 100000
loop = True
#data = []
num_chunk, num_lines = 0, 0
while loop:
    try:
        chunk = reader.get_chunk(chunkSize)
        # dat = data_cleaning_funtion(chunk) # do sth.
        num_lines += len(chunk)
        print(num_chunk, num_lines)
        num_chunk +=1
        #data.append(dat) 
    except StopIteration:
        loop = False
        print("Iteration is stopped.")
#df = pd.concat(data, ignore_index=True)
0 100000
1 200000
2 300000
3 400000
4 500000
5 600000
6 700000
7 800000
8 900000
9 1000000
10 1100000
11 1200000
12 1300000
13 1400000
14 1500000
15 1600000
16 1700000
17 1800000
18 1900000
19 2000000
20 2100000
21 2200000
22 2300000
23 2400000
24 2500000
25 2600000
26 2700000
27 2800000
28 2900000
29 3000000
30 3100000
31 3200000
32 3300000
33 3400000
34 3500000
35 3600000
36 3700000
37 3800000
38 3900000
39 4000000
40 4100000
41 4200000
42 4300000
43 4400000
44 4500000
45 4600000
46 4700000
47 4800000
48 4900000
49 5000000
b'Skipping line 5051743: expected 14 fields, saw 15\n'
50 5100000
51 5200000
b'Skipping line 5254718: expected 14 fields, saw 15\n'
b'Skipping line 5281095: expected 14 fields, saw 15\n'
52 5300000
53 5400000
b'Skipping line 5481759: expected 14 fields, saw 15\nSkipping line 5482014: expected 14 fields, saw 15\nSkipping line 5482532: expected 14 fields, saw 15\n'
54 5500000
b'Skipping line 5516605: expected 14 fields, saw 15\n'
55 5600000
56 5700000
b'Skipping line 5709055: expected 14 fields, saw 15\n'
b'Skipping line 5796658: expected 14 fields, saw 15\n'
57 5800000
58 5900000
b'Skipping line 5927412: expected 14 fields, saw 15\nSkipping line 5927419: expected 14 fields, saw 15\nSkipping line 5927421: expected 14 fields, saw 15\nSkipping line 5927451: expected 14 fields, saw 15\nSkipping line 5927478: expected 14 fields, saw 15\n'
59 6000000
60 6100000
61 6200000
b'Skipping line 6229621: expected 14 fields, saw 16\nSkipping line 6245861: expected 14 fields, saw 17\n'
b'Skipping line 6278728: expected 14 fields, saw 15\n'
62 6300000
b'Skipping line 6350262: expected 14 fields, saw 15\n'
b'Skipping line 6387321: expected 14 fields, saw 15\nSkipping line 6388879: expected 14 fields, saw 15\n'
63 6400000
64 6500000
65 6600000
66 6602120
Iteration is stopped.

image.png