{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "# 数据清洗之推特数据\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 数据清洗(data cleaning)\n", "是数据分析的重要步骤,其主要目标是将混杂的数据清洗为可以被直接分析的数据,一般需要将数据转化为数据框(data frame)的样式。\n", "\n", "本章将以推特文本的清洗作为例子,介绍数据清洗的基本逻辑。\n", "\n", "- 清洗错误行\n", "- 正确分列\n", "- 提取所要分析的内容\n", "- 介绍通过按行、chunk的方式对大规模数据进行预处理\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 同时考虑分列符和引用符\n", "\n", "- 分列符🔥分隔符:sep, delimiter\n", "- 引用符☁️:quotechar\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:07:10.567235Z", "start_time": "2023-11-17T07:07:10.551054Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# 提示:你可能需要修改以下路径名\n", "with open(\"./data/ows_tweets_sample.txt\", 'r') as f:\n", " chunk = f.readlines()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:07:14.434792Z", "start_time": "2023-11-17T07:07:14.425674Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "2754" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(chunk)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:07:29.063364Z", "start_time": "2023-11-17T07:07:29.035705Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2628\n" ] } ], "source": [ "import csv\n", "lines_csv = csv.reader(chunk, delimiter=',', quotechar='\"') \n", "print(len(list(lines_csv)))\n", "# next(lines_csv)\n", "# next(lines_csv)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:07:57.675044Z", "start_time": "2023-11-17T07:07:57.170877Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Twitter IDTextProfile Image URLDayHourMinuteCreated AtGeoFrom UserFrom User IDLanguageTo UserTo User IDSource
0121813144174727168RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLIN...http://a2.twimg.com/profile_images/1539375713/...2011-10-06542011-10-06 05:04:51N;Anonops_Cop401240477enNaN0<a href="http://twitter.com/">...
1121813146137657344@jamiekilstein @allisonkilkenny Interesting in...http://a2.twimg.com/profile_images/1574715503/...2011-10-06542011-10-06 05:04:51N;KittyHybrid34532053enjamiekilstein2149053<a href="http://twitter.com/">...
2121813150000619521@Seductivpancake Right! Those guys have a vict...http://a1.twimg.com/profile_images/1241412831/...2011-10-06542011-10-06 05:04:52N;nerdsherpa95067344enSeductivpancake19695580<a href="http://www.echofon.com/"...
\n", "
" ], "text/plain": [ " Twitter ID Text \\\n", "0 121813144174727168 RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLIN... \n", "1 121813146137657344 @jamiekilstein @allisonkilkenny Interesting in... \n", "2 121813150000619521 @Seductivpancake Right! Those guys have a vict... \n", "\n", " Profile Image URL Day Hour \\\n", "0 http://a2.twimg.com/profile_images/1539375713/... 2011-10-06 5 \n", "1 http://a2.twimg.com/profile_images/1574715503/... 2011-10-06 5 \n", "2 http://a1.twimg.com/profile_images/1241412831/... 2011-10-06 5 \n", "\n", " Minute Created At Geo From User From User ID Language \\\n", "0 4 2011-10-06 05:04:51 N; Anonops_Cop 401240477 en \n", "1 4 2011-10-06 05:04:51 N; KittyHybrid 34532053 en \n", "2 4 2011-10-06 05:04:52 N; nerdsherpa 95067344 en \n", "\n", " To User To User ID \\\n", "0 NaN 0 \n", "1 jamiekilstein 2149053 \n", "2 Seductivpancake 19695580 \n", "\n", " Source \n", "0 <a href="http://twitter.com/">... \n", "1 <a href="http://twitter.com/">... \n", "2 <a href="http://www.echofon.com/"... " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.read_csv(\"./data/ows_tweets_sample.txt\",\n", " sep = ',', quotechar='\"')\n", "df[:3]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:08:02.149131Z", "start_time": "2023-11-17T07:08:02.145843Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "2627" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df) " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:08:52.981486Z", "start_time": "2023-11-17T07:08:52.976558Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE RT !!HELP!!!!'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#df.columns #\n", "df['Text'][0]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:10:19.924927Z", "start_time": "2023-11-17T07:10:19.917805Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "0 Anonops_Cop\n", "1 KittyHybrid\n", "2 nerdsherpa\n", "3 hamudistan\n", "4 kl_knox\n", "5 vickycrampton\n", "6 burgerbuilders\n", "7 neverfox\n", "8 davidgaliel\n", "9 AnonOws\n", "Name: From User, dtype: object" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['From User'][:10]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### 统计发帖数量所对应的人数的分布\n", "> 人数在发帖数量方面的分布情况" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:11:26.784098Z", "start_time": "2023-11-17T07:11:26.780421Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from collections import defaultdict\n", "data_dict = defaultdict(int)\n", "for i in df['From User']:\n", " data_dict[i] +=1 " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:11:32.344723Z", "start_time": "2023-11-17T07:11:32.340738Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[('Anonops_Cop', 1),\n", " ('KittyHybrid', 1),\n", " ('nerdsherpa', 2),\n", " ('hamudistan', 1),\n", " ('kl_knox', 1)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(data_dict.items())[:5]\n", "#data_dict" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:12:35.074285Z", "start_time": "2023-11-17T07:12:34.057696Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import pylab as plt\n", "\n", "plt.rcParams['font.sans-serif'] = ['Microsoft YaHei'] # 用来正常显示中文标签\n", "plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号, 注意['SimHei']对应这句不行.\n", "\n", "plt.style.use('ggplot') " ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:13:15.957926Z", "start_time": "2023-11-17T07:13:15.273134Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.hist(data_dict.values())\n", "plt.yscale('log')\n", "plt.xscale('log')\n", "plt.xlabel(u'发帖数', fontsize = 20)\n", "plt.ylabel(u'人数', fontsize = 20)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:14:19.854819Z", "start_time": "2023-11-17T07:14:19.391617Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tweet_dict = defaultdict(int)\n", "for i in data_dict.values():\n", " tweet_dict[i] += 1 \n", " \n", "plt.loglog(list(tweet_dict.keys()), list(tweet_dict.values()), 'bo')#linewidth=2) \n", "plt.xlabel(u'推特数', fontsize=20)\n", "plt.ylabel(u'人数', fontsize=20 ) \n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:15:00.984537Z", "start_time": "2023-11-17T07:14:55.590346Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import numpy as np\n", "import statsmodels.api as sm\n", "\n", "def powerPlot(d_value, d_freq, color, marker):\n", " d_freq = [i + 1 for i in d_freq]\n", " d_prob = [float(i)/sum(d_freq) for i in d_freq]\n", " #d_rank = ss.rankdata(d_value).astype(int)\n", " x = np.log(d_value)\n", " y = np.log(d_prob)\n", " xx = sm.add_constant(x, prepend=True)\n", " res = sm.OLS(y,xx).fit()\n", " constant,beta = res.params\n", " r2 = res.rsquared\n", " plt.plot(d_value, d_prob, linestyle = '',\\\n", " color = color, marker = marker)\n", " plt.plot(d_value, np.exp(constant+x*beta),\"red\")\n", " plt.xscale('log'); plt.yscale('log')\n", " plt.text(max(d_value)/2,max(d_prob)/10,\n", " r'$\\beta$ = ' + str(round(beta,2)) +'\\n' + r'$R^2$ = ' + str(round(r2, 2)), fontsize = 20)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:15:05.728842Z", "start_time": "2023-11-17T07:15:05.272443Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "histo, bin_edges = np.histogram(list(data_dict.values()), 15)\n", "bin_center = 0.5*(bin_edges[1:] + bin_edges[:-1])\n", "powerPlot(bin_center,histo, 'r', '^')\n", "#lg=plt.legend(labels = [u'Tweets', u'Fit'], loc=3, fontsize=20)\n", "plt.ylabel(u'概率', fontsize=20)\n", "plt.xlabel(u'推特数', fontsize=20) \n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2021-11-08T08:29:28.805177Z", "start_time": "2021-11-08T08:29:28.796546Z" }, "code_folding": [], "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import statsmodels.api as sm\n", "from collections import defaultdict\n", "import numpy as np\n", "\n", "def powerPlot2(data):\n", " d = sorted(data, reverse = True )\n", " d_table = defaultdict(int)\n", " for k in d:\n", " d_table[k] += 1\n", " d_value = sorted(d_table)\n", " d_value = [i+1 for i in d_value]\n", " d_freq = [d_table[i]+1 for i in d_value]\n", " d_prob = [float(i)/sum(d_freq) for i in d_freq]\n", " x = np.log(d_value)\n", " y = np.log(d_prob)\n", " xx = sm.add_constant(x, prepend=True)\n", " res = sm.OLS(y,xx).fit()\n", " constant,beta = res.params\n", " r2 = res.rsquared\n", " plt.plot(d_value, d_prob, 'ro')\n", " plt.plot(d_value, np.exp(constant+x*beta),\"red\")\n", " plt.xscale('log'); plt.yscale('log')\n", " plt.text(max(d_value)/2,max(d_prob)/5,\n", " 'Beta = ' + str(round(beta,2)) +'\\n' + 'R squared = ' + str(round(r2, 2)))\n", " plt.title('Distribution')\n", " plt.ylabel('P(K)')\n", " plt.xlabel('K')\n", " plt.show()\n", " " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2021-11-08T08:29:33.438755Z", "start_time": "2021-11-08T08:29:33.052468Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "powerPlot2(data_dict.values())" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "ExecuteTime": { "end_time": "2020-06-06T09:11:14.088105Z", "start_time": "2020-06-06T09:11:09.461725Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting powerlaw\n", " Downloading powerlaw-1.4.6.tar.gz (27 kB)\n", "Requirement already satisfied: scipy in /opt/anaconda3/lib/python3.7/site-packages (from powerlaw) (1.4.1)\n", "Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.7/site-packages (from powerlaw) (1.18.1)\n", "Requirement already satisfied: matplotlib in /opt/anaconda3/lib/python3.7/site-packages (from powerlaw) (3.1.3)\n", "Requirement already satisfied: mpmath in /opt/anaconda3/lib/python3.7/site-packages (from powerlaw) (1.1.0)\n", "Requirement already satisfied: python-dateutil>=2.1 in /opt/anaconda3/lib/python3.7/site-packages (from matplotlib->powerlaw) (2.8.1)\n", "Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/lib/python3.7/site-packages (from matplotlib->powerlaw) (0.10.0)\n", "Requirement already satisfied: kiwisolver>=1.0.1 in /opt/anaconda3/lib/python3.7/site-packages (from matplotlib->powerlaw) (1.1.0)\n", "Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/anaconda3/lib/python3.7/site-packages (from matplotlib->powerlaw) (2.4.6)\n", "Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.7/site-packages (from python-dateutil>=2.1->matplotlib->powerlaw) (1.14.0)\n", "Requirement already satisfied: setuptools in /opt/anaconda3/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib->powerlaw) (46.0.0.post20200309)\n", "Building wheels for collected packages: powerlaw\n", " Building wheel for powerlaw (setup.py) ... \u001b[?25ldone\n", "\u001b[?25h Created wheel for powerlaw: filename=powerlaw-1.4.6-py3-none-any.whl size=24787 sha256=0e7d23e9100feb4fed1092e73633e496f9c1cb767c3f8632b9d9e9d488215eda\n", " Stored in directory: /Users/datalab/Library/Caches/pip/wheels/ee/51/38/2e0f20cf80e1a0909acdd527df2288bd9feb8356b926d7d775\n", "Successfully built powerlaw\n", "Installing collected packages: powerlaw\n", "Successfully installed powerlaw-1.4.6\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "pip install powerlaw" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2021-11-08T08:30:49.354240Z", "start_time": "2021-11-08T08:30:49.344654Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import powerlaw\n", "def plotPowerlaw(data,ax,col,xlab):\n", " fit = powerlaw.Fit(data,xmin=2)\n", " #fit = powerlaw.Fit(data)\n", " fit.plot_pdf(color = col, linewidth = 2)\n", " a,x = (fit.power_law.alpha,fit.power_law.xmin)\n", " fit.power_law.plot_pdf(color = col, linestyle = 'dotted', ax = ax, \\\n", " label = r\"$\\alpha = %d \\:\\:, x_{min} = %d$\" % (a,x))\n", " ax.set_xlabel(xlab, fontsize = 20)\n", " ax.set_ylabel('$Probability$', fontsize = 20)\n", " plt.legend(loc = 0, frameon = False)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2021-11-08T08:30:58.060920Z", "start_time": "2021-11-08T08:30:58.056644Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from collections import defaultdict\n", "data_dict = defaultdict(int)\n", "\n", "for i in df['From User']:\n", " data_dict[i] += 1" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2021-11-08T08:31:02.499748Z", "start_time": "2021-11-08T08:31:01.946001Z" }, "code_folding": [ 0 ], "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide\n", " (Theoretical_CDF * (1 - Theoretical_CDF))\n", "/opt/anaconda3/lib/python3.7/site-packages/powerlaw.py:700: RuntimeWarning: invalid value encountered in true_divide\n", " (Theoretical_CDF * (1 - Theoretical_CDF))\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# \n", "import matplotlib.cm as cm\n", "cmap = cm.get_cmap('rainbow_r',6)\n", "\n", "fig = plt.figure(figsize=(6, 4),facecolor='white')\n", "ax = fig.add_subplot(1, 1, 1)\n", "plotPowerlaw(list(data_dict.values()), ax,cmap(1), \n", " '$Tweets$')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 清洗tweets文本" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:17:12.275188Z", "start_time": "2023-11-17T07:17:12.272549Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "tweet = '''RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! \n", " #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com \n", " http://ccc.nju.edu.cn RT !!HELP!!!!'''" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:17:53.753775Z", "start_time": "2023-11-17T07:17:53.635162Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "#!pip install twitter-text\n", "import re\n", "import twitter_text \n", "# https://github.com/dryan/twitter-text-py/issues/21\n", "#Macintosh HD ▸ 用户 ▸ datalab ▸ 应用程序 ▸ anaconda ▸ lib ▸ python3.5 ▸ site-packages" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:18:57.391676Z", "start_time": "2023-11-17T07:18:57.384787Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "' @AnonKitsu: @who'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "\n", "tweet = '''RT @AnonKitsu: @who ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! \n", " #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com \n", " http://ccc.nju.edu.cn RT !!HELP!!!!'''\n", "\n", "rt_patterns = re.compile(r\"(RT|via)((?:\\b\\W*@\\w+)+)\", re.IGNORECASE)\n", "rt_user_name = rt_patterns.findall(tweet)[0][1]#.strip(' @').split(':')[0]\n", "rt_user_name " ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:19:35.078725Z", "start_time": "2023-11-17T07:19:35.072864Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'AnonKitsu'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "\n", "tweet = '''RT @AnonKitsu: @who ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! \n", " #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com \n", " http://ccc.nju.edu.cn RT !!HELP!!!!'''\n", "\n", "rt_patterns = re.compile(r\"(RT|via)((?:\\b\\W*@\\w+)+)\", \\\n", " re.IGNORECASE)\n", "rt_user_name = rt_patterns.findall(tweet)[0][1].strip(' @').split(':')[0]\n", "rt_user_name" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:20:04.778179Z", "start_time": "2023-11-17T07:20:04.773989Z" }, "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[]\n", "None\n" ] } ], "source": [ "import re\n", "\n", "tweet = '''@chengjun:@who ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! \n", " #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com \n", " http://ccc.nju.edu.cn RT !!HELP!!!!'''\n", "\n", "rt_patterns = re.compile(r\"(RT|via)((?:\\b\\W*@\\w+)+)\", re.IGNORECASE)\n", "rt_user_name = rt_patterns.findall(tweet)\n", "print(rt_user_name)\n", "\n", "if rt_user_name:\n", " print('it exits.')\n", "else:\n", " print('None')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:20:37.233484Z", "start_time": "2023-11-17T07:20:37.229741Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import re\n", "\n", "def extract_rt_user(tweet):\n", " rt_patterns = re.compile(r\"(RT|via)((?:\\b\\W*@\\w+)+)\", re.IGNORECASE)\n", " rt_user_name = rt_patterns.findall(tweet)\n", " if rt_user_name:\n", " rt_user_name = rt_user_name[0][1].strip(' @').split(':')[0]\n", " else:\n", " rt_user_name = None\n", " return rt_user_name" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:20:42.433344Z", "start_time": "2023-11-17T07:20:42.429431Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'chengjun'" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tweet = '''RT @chengjun: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! \n", " #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com \n", " http://ccc.nju.edu.cn RT !!HELP!!!!'''\n", "\n", "extract_rt_user(tweet) " ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:20:58.137408Z", "start_time": "2023-11-17T07:20:58.133727Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n" ] } ], "source": [ "tweet = '''@chengjun: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! \n", " #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com \n", " http://ccc.nju.edu.cn RT !!HELP!!!!'''\n", "\n", "print(extract_rt_user(tweet) )" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:22:24.572818Z", "start_time": "2023-11-17T07:22:24.546424Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "[('RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE RT !!HELP!!!!',\n", " 'Anonops_Cop'),\n", " ('@jamiekilstein @allisonkilkenny Interesting interview (never aired, wonder why??) by Fox with #ows protester http://t.co/Fte55Kh7',\n", " 'KittyHybrid'),\n", " (\"@Seductivpancake Right! Those guys have a victory condition: regime change. #ows doesn't seem to have a goal I can figure out.\",\n", " 'nerdsherpa')]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import csv\n", "\n", "with open(\"./data/ows_tweets_sample.txt\", 'r') as f:\n", " chunk = f.readlines()\n", " \n", "rt_network = []\n", "lines = csv.reader(chunk[1:], delimiter=',', quotechar='\"')\n", "tweet_user_data = [(i[1], i[8]) for i in lines]\n", "tweet_user_data[:3]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:23:21.609921Z", "start_time": "2023-11-17T07:23:21.577762Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[(('Anonops_Cop', 'AnonKitsu'), 1),\n", " (('hamudistan', 'bembel'), 1),\n", " (('vickycrampton', 'TheNewDeal'), 2)]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import defaultdict\n", "\n", "rt_network = []\n", "rt_dict = defaultdict(int)\n", "for k, i in enumerate(tweet_user_data):\n", " tweet,user = i\n", " rt_user = extract_rt_user(tweet)\n", " if rt_user:\n", " rt_network.append((user, rt_user)) #(rt_user,' ', user, end = '\\n')\n", " rt_dict[(user, rt_user)] += 1\n", "#rt_network[:5]\n", "list(rt_dict.items())[:3]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 获得清洗过的推特文本\n", "\n", "不含人名、url、各种符号(如RT @等)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:24:45.464970Z", "start_time": "2023-11-17T07:24:45.461187Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def extract_tweet_text(tweet, at_names, urls):\n", " for i in at_names:\n", " tweet = tweet.replace(i, '')\n", " for j in urls:\n", " tweet = tweet.replace(j, '')\n", " marks = ['RT @', '@', '"', '#', '\\n', '\\t', ' ']\n", " for k in marks:\n", " tweet = tweet.replace(k, '')\n", " return tweet" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 安装twitter_text\n", "\n", "[twitter-text-py](https://github.com/dryan/twitter-text-py/issues/21) could not be used for python 3\n", "\n", "Glyph debug the problem, and make [a new repo of twitter-text-py3](https://github.com/glyph/twitter-text-py).\n", "\n", "> pip install twitter-text\n" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:25:48.921110Z", "start_time": "2023-11-17T07:25:48.916523Z" }, "scrolled": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['AnonKitsu', 'chengjun', 'mili'] ['https://computational-communication.com', 'http://ccc.nju.edu.cn'] ['OCCUPYWALLSTREET', 'OWS', 'OCCUPYNY'] AnonKitsu --------> : ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! OCCUPYWALLSTREET OWS OCCUPYNY PLEASE RT !!HELP!!!!\n" ] } ], "source": [ "import twitter_text\n", "\n", "tweet = '''RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! \n", " #OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili https://computational-communication.com \n", " http://ccc.nju.edu.cn RT !!HELP!!!!'''\n", "\n", "ex = twitter_text.Extractor(tweet)\n", "at_names = ex.extract_mentioned_screen_names()\n", "urls = ex.extract_urls()\n", "hashtags = ex.extract_hashtags()\n", "rt_user = extract_rt_user(tweet)\n", "tweet_text = extract_tweet_text(tweet, at_names, urls)\n", "\n", "print(at_names, urls, hashtags, rt_user,'-------->', tweet_text)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:26:08.918860Z", "start_time": "2023-11-17T07:26:08.899934Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import csv\n", "\n", "lines = csv.reader(chunk,delimiter=',', quotechar='\"')\n", "tweets = [i[1] for i in lines] " ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2023-11-17T07:26:17.280551Z", "start_time": "2023-11-17T07:26:17.275568Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[] [] [] None\n", "['AnonKitsu'] [] ['OCCUPYWALLSTREET', 'OWS', 'OCCUPYNY'] AnonKitsu\n", "['jamiekilstein', 'allisonkilkenny'] ['http://t.co/Fte55Kh7'] ['ows'] None\n", "['Seductivpancake'] [] ['ows'] None\n", "['bembel'] ['http://j.mp/rhHavq'] ['OccupyWallStreet', 'OWS'] bembel\n" ] } ], "source": [ "for tweet in tweets[:5]:\n", " ex = twitter_text.Extractor(tweet)\n", " at_names = ex.extract_mentioned_screen_names()\n", " urls = ex.extract_urls()\n", " hashtags = ex.extract_hashtags()\n", " rt_user = extract_rt_user(tweet)\n", " #tweet_text = extract_tweet_text(tweet, at_names, urls)\n", "\n", " print(at_names, urls, hashtags, rt_user)\n", " #print(tweet_text)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "slide" } }, "source": [ "## 作业\n", "\n", "提取出raw tweets中的rtuser与user的转发网络\n", "\n", "格式:\n", "\n", "rt_user1, user1, 3\n", "\n", "rt_user2, user3, 2\n", "\n", "rt_user2, user4, 1\n", "\n", "...\n", "\n", "数据保存为csv格式" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "1260px", "left": "1835px", "top": "224px", "width": "512px" }, "toc_section_display": false, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 1 }