{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"# 第八章 文本挖掘\n",
"\n",
"\n",
"![image.png](images/author.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"What can be learned from 5 million books\n",
"\n",
"https://www.bilibili.com/video/BV1jJ411u7Nd\n",
"\n",
"This talk by Jean-Baptiste Michel and Erez Lieberman Aiden is phenomenal. \n",
"\n",
"\n",
"Michel, J.-B., et al. (2011). Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331, 176–182."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T02:38:38.471705Z",
"start_time": "2021-05-22T02:38:38.462496Z"
},
"code_folding": [],
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%html \n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"![](./img/books.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"试一下谷歌图书的数据: https://books.google.com/ngrams/\n",
" \n",
"\n",
"数据下载: http://www.culturomics.org/home"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Bag-of-words model (BOW)\n",
"\n",
"Represent text as numerical feature vectors"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- We create a vocabulary of unique tokens—for example, words—from the entire set of documents.\n",
"- We construct a feature vector from each document that contains the counts of how often each word occurs in the particular document."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Since the unique words in each document represent only a small subset of all the\n",
"words in the bag-of-words vocabulary, the feature vectors will consist of mostly\n",
"zeros, which is why we call them sparse"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"![image.png](images/bow.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"“词袋模型”(Bag of words model)假定对于一个文本:\n",
"- 忽略词序、语法、句法;\n",
"- 将其仅仅看做是一个词集合或组合;\n",
"- 每个词的出现都是独立的,不依赖于其他词是否出现。\n",
" - 文本任意一个位置出现某一个词汇是独立选择的,不受前面句子的影响。\n",
"\n",
"这种假设虽然对自然语言进行了简化,便于模型化。\n",
"\n",
"Document-Term Matrix (DTM)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"问题:例如在新闻个性化推荐中,用户对“南京醉酒驾车事故”这个短语很感兴趣。词袋模型忽略了顺序和句法,认为用户对“南京”、“醉酒”、“驾车”和“事故”感兴趣,因此可能推荐出和“南京”、“公交车”、“事故”相关的新闻。\n",
"\n",
"解决方法: 可抽取出整个短语;或者采用高阶(2阶以上)统计语言模型。例如bigram、trigram来将词序保留下来,相当于bag of bigram和bag of trigram。"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Transforming words into feature vectors"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. \n",
"\n",
"In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. \n",
"\n",
"There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"D1 = \"I like databases\"\n",
"\n",
"D2 = \"I hate databases\"\n",
"\n",
"| | I | like |hate | databases |\n",
"| -------------|:-------------:|:-------------:|:-------------:|-----:|\n",
"| D1| 1| 1 | 0 |1|\n",
"| D2| 1| 0 | 1 |1|"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:29:57.420415Z",
"start_time": "2021-05-22T03:29:57.414495Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"import numpy as np\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"count = CountVectorizer(ngram_range=(1, 2))\n",
"docs = np.array([\n",
" 'The sun is shining',\n",
" 'The weather is sweet',\n",
" 'The sun is shining and the weather is sweet'])\n",
"bag = count.fit_transform(docs)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:29:59.422188Z",
"start_time": "2021-05-22T03:29:59.417891Z"
}
},
"outputs": [],
"source": [
"count?"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:30:05.202621Z",
"start_time": "2021-05-22T03:30:05.198695Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['and',\n",
" 'and the',\n",
" 'is',\n",
" 'is shining',\n",
" 'is sweet',\n",
" 'shining',\n",
" 'shining and',\n",
" 'sun',\n",
" 'sun is',\n",
" 'sweet',\n",
" 'the',\n",
" 'the sun',\n",
" 'the weather',\n",
" 'weather',\n",
" 'weather is']"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"count.get_feature_names()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:30:08.176627Z",
"start_time": "2021-05-22T03:30:08.173567Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'the': 10, 'sun': 7, 'is': 2, 'shining': 5, 'the sun': 11, 'sun is': 8, 'is shining': 3, 'weather': 13, 'sweet': 9, 'the weather': 12, 'weather is': 14, 'is sweet': 4, 'and': 0, 'shining and': 6, 'and the': 1}\n"
]
}
],
"source": [
"print(count.vocabulary_) # word: position index"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:30:12.312138Z",
"start_time": "2021-05-22T03:30:12.308561Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"scipy.sparse.csr.csr_matrix"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(bag)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:30:22.024858Z",
"start_time": "2021-05-22T03:30:22.021489Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0 0 1 1 0 1 0 1 1 0 1 1 0 0 0]\n",
" [0 0 1 0 1 0 0 0 0 1 1 0 1 1 1]\n",
" [1 1 2 1 1 1 1 1 1 1 2 1 1 1 1]]\n"
]
}
],
"source": [
"print(bag.toarray())"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:30:26.211674Z",
"start_time": "2021-05-22T03:30:26.197493Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" and | \n",
" and the | \n",
" is | \n",
" is shining | \n",
" is sweet | \n",
" shining | \n",
" shining and | \n",
" sun | \n",
" sun is | \n",
" sweet | \n",
" the | \n",
" the sun | \n",
" the weather | \n",
" weather | \n",
" weather is | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" and and the is is shining is sweet shining shining and sun sun is \\\n",
"0 0 0 1 1 0 1 0 1 1 \n",
"1 0 0 1 0 1 0 0 0 0 \n",
"2 1 1 2 1 1 1 1 1 1 \n",
"\n",
" sweet the the sun the weather weather weather is \n",
"0 0 1 1 0 0 0 \n",
"1 1 1 0 1 1 1 \n",
"2 1 2 1 1 1 1 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"pd.DataFrame(bag.toarray(), columns = count.get_feature_names())"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"The sequence of items in the bag-of-words model that we just created is also called the 1-gram or unigram model: each item or token in the vocabulary represents a single word. \n",
"\n",
"## n-gram model\n",
"The choice of the number n in the n-gram model depends on the particular application\n",
"\n",
"- 1-gram: \"the\", \"sun\", \"is\", \"shining\"\n",
"- 2-gram: \"the sun\", \"sun is\", \"is shining\" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The CountVectorizer class in scikit-learn allows us to use different\n",
"n-gram models via its `ngram_range` parameter. \n",
"\n",
"While a 1-gram\n",
"representation is used by default\n",
"\n",
"we could switch to a 2-gram\n",
"representation by initializing a new CountVectorizer instance with\n",
"ngram_range=(2,2)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## TF-IDF\n",
"Assessing word relevancy via term frequency-inverse document frequency"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"$$tf*idf(t, d) = tf(t, d) \\times idf(t)$$\n",
"\n",
"- $tf(t, d)$ is the term frequency of term t in document d.\n",
"- inverse document frequency $idf(t)$ can be calculated as: $idf(t) = log \\frac{n_d}{1 + df(d, t)}$\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Question: Why do we add the constant 1 to the denominator ?\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The tf-idf equation that was implemented in scikit-learn is as follows: $tf*idf(t, d) = tf(t, d) \\times (idf(t, d) + 1)$\n",
" \n",
"[SKlearn use `smooth_idf=True`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer) $idf(t) = log \\frac{1+n_d}{1 + df(d, t)} + 1$\n",
"\n",
"where $n_d$ is the total number of documents, and $df(d, t)$ is the number of documents $d$ that contain the term $t$. \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
" \n",
"### L2-normalization\n",
"\n",
"$$l2_{x} = \\frac{x} {\\sqrt{\\sum {x^2}}}$$\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"课堂作业:请根据公式计算'is'这个词在文本2中的tfidf数值?\n",
"\n",
"![](./img/ask.jpeg)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"### TfidfTransformer\n",
"Scikit-learn implements yet another transformer, the TfidfTransformer, that\n",
"takes the raw term frequencies from CountVectorizer as input and transforms\n",
"them into tf-idfs:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:32:33.176788Z",
"start_time": "2021-05-22T03:32:33.167661Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0. 0. 0.31 0.4 0. 0.4 0. 0.4 0.4 0. 0.31 0.4 0. 0.\n",
" 0. ]\n",
" [0. 0. 0.31 0. 0.4 0. 0. 0. 0. 0.4 0.31 0. 0.4 0.4\n",
" 0.4 ]\n",
" [0.29 0.29 0.35 0.22 0.22 0.22 0.29 0.22 0.22 0.22 0.35 0.22 0.22 0.22\n",
" 0.22]]\n"
]
}
],
"source": [
"from sklearn.feature_extraction.text import TfidfTransformer\n",
"np.set_printoptions(precision=2)\n",
"\n",
"tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)\n",
"print(tfidf.fit_transform(count.fit_transform(docs)).toarray())\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:32:44.097290Z",
"start_time": "2021-05-22T03:32:44.090976Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0. 0. 1. 1.29 0. 1.29 0. 1.29 1.29 0. 1. 1.29 0. 0.\n",
" 0. ]\n",
" [0. 0. 1. 0. 1.29 0. 0. 0. 0. 1.29 1. 0. 1.29 1.29\n",
" 1.29]\n",
" [1.69 1.69 2. 1.29 1.29 1.29 1.69 1.29 1.29 1.29 2. 1.29 1.29 1.29\n",
" 1.29]]\n"
]
}
],
"source": [
"from sklearn.feature_extraction.text import TfidfTransformer\n",
"np.set_printoptions(precision=2)\n",
"\n",
"tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)\n",
"print(tfidf.fit_transform(count.fit_transform(docs)).toarray())\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:32:45.115109Z",
"start_time": "2021-05-22T03:32:45.096807Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" and | \n",
" and the | \n",
" is | \n",
" is shining | \n",
" is sweet | \n",
" shining | \n",
" shining and | \n",
" sun | \n",
" sun is | \n",
" sweet | \n",
" the | \n",
" the sun | \n",
" the weather | \n",
" weather | \n",
" weather is | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 1.0 | \n",
" 1.287682 | \n",
" 0.000000 | \n",
" 1.287682 | \n",
" 0.000000 | \n",
" 1.287682 | \n",
" 1.287682 | \n",
" 0.000000 | \n",
" 1.0 | \n",
" 1.287682 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 1.0 | \n",
" 0.000000 | \n",
" 1.287682 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 1.287682 | \n",
" 1.0 | \n",
" 0.000000 | \n",
" 1.287682 | \n",
" 1.287682 | \n",
" 1.287682 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.693147 | \n",
" 1.693147 | \n",
" 2.0 | \n",
" 1.287682 | \n",
" 1.287682 | \n",
" 1.287682 | \n",
" 1.693147 | \n",
" 1.287682 | \n",
" 1.287682 | \n",
" 1.287682 | \n",
" 2.0 | \n",
" 1.287682 | \n",
" 1.287682 | \n",
" 1.287682 | \n",
" 1.287682 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" and and the is is shining is sweet shining shining and \\\n",
"0 0.000000 0.000000 1.0 1.287682 0.000000 1.287682 0.000000 \n",
"1 0.000000 0.000000 1.0 0.000000 1.287682 0.000000 0.000000 \n",
"2 1.693147 1.693147 2.0 1.287682 1.287682 1.287682 1.693147 \n",
"\n",
" sun sun is sweet the the sun the weather weather \\\n",
"0 1.287682 1.287682 0.000000 1.0 1.287682 0.000000 0.000000 \n",
"1 0.000000 0.000000 1.287682 1.0 0.000000 1.287682 1.287682 \n",
"2 1.287682 1.287682 1.287682 2.0 1.287682 1.287682 1.287682 \n",
"\n",
" weather is \n",
"0 0.000000 \n",
"1 1.287682 \n",
"2 1.287682 "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"bag = tfidf.fit_transform(count.fit_transform(docs))\n",
"pd.DataFrame(bag.toarray(), columns = count.get_feature_names())"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:32:50.060578Z",
"start_time": "2021-05-22T03:32:50.056153Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tf-idf of term \"is\" = 2.00\n"
]
}
],
"source": [
"# 一个词的tfidf值\n",
"import numpy as np\n",
"tf_is = 2.0\n",
"n_docs = 3.0\n",
"# smooth_idf=True & norm = None\n",
"idf_is = np.log((1+n_docs) / (1+3)) + 1\n",
"\n",
"tfidf_is = tf_is * idf_is\n",
"print('tf-idf of term \"is\" = %.2f' % tfidf_is)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:32:53.351686Z",
"start_time": "2021-05-22T03:32:53.344649Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(array([1.69, 1.69, 2. , 1.29, 1.29, 1.29, 1.69, 1.29, 1.29, 1.29, 2. ,\n",
" 1.29, 1.29, 1.29, 1.29]),\n",
" ['and',\n",
" 'and the',\n",
" 'is',\n",
" 'is shining',\n",
" 'is sweet',\n",
" 'shining',\n",
" 'shining and',\n",
" 'sun',\n",
" 'sun is',\n",
" 'sweet',\n",
" 'the',\n",
" 'the sun',\n",
" 'the weather',\n",
" 'weather',\n",
" 'weather is'])"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# *最后一个文本*里的词的tfidf原始数值(未标准化)\n",
"tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)\n",
"raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]\n",
"raw_tfidf, count.get_feature_names()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:32:57.954615Z",
"start_time": "2021-05-22T03:32:57.950400Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([0.29, 0.29, 0.35, 0.22, 0.22, 0.22, 0.29, 0.22, 0.22, 0.22, 0.35,\n",
" 0.22, 0.22, 0.22, 0.22])"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# l2标准化后的tfidf数值\n",
"l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))\n",
"l2_tfidf "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## 政府工作报告文本挖掘"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 0. 读取数据"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:33:43.031865Z",
"start_time": "2021-05-22T03:33:43.018360Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"with open('./data/gov_reports1954-2021.txt', 'r', encoding = 'utf-8') as f:\n",
" reports = f.readlines()\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:33:44.893931Z",
"start_time": "2021-05-22T03:33:44.890399Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"52"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(reports)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:34:39.928274Z",
"start_time": "2021-05-22T03:34:39.924465Z"
},
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1964\t1964年国务院政府工作报告(摘要)——1964年12月21日和22日在第三届全国人民代表大会第一次会议上 国务院总理周恩来 五年来,我国各族人民在中国共产党的英明领导下,高举毛泽东思想的光辉旗帜,坚持鼓足干劲、力争上游、多快好省地建设社会主义的总路线,在全国范围内开展了阶级斗争、生产斗争、科学实验三大革命运动,有力地反击了资本主义和封建势力的进攻,提高了人民群众的社会主义觉悟,基本上完成了调整国民经济的任务,使工农业生产全面高涨,整个国民经济全面好转,我国自力更生的力量大为增强。同时,在国际上,我们同美帝国主义、各国反动派和现代修正主义进行了针锋相对的斗争,打退了他们掀起的一次又一次的反华高潮;积极地支援了各国革命人民,发展了同许多国家的友好合作关系;我国的国际威望更加提高了,我们的朋友遍天下。 我们要进一步开展社会主义教育运动,坚决依靠工人阶级、贫农下中农、革命的干部、革命的知识分子和其他革命分子,根据社会主义的彻底革命的原则,在政治、经济、思想和组织这四个方面,进行清理和基本建设,在人民群众中深刻地进行一次阶级教育和社会主义教育;要进一步开展思想文化战线上的社会主义革命,逐步实现知识分子劳动化,劳动人民知识化;要进一步巩固和发展人民民主统一战线,加强各民族的大团结;各级机关和各级干部必须革命化,都要学习解放军、大庆、大寨的彻底革命的精神和工作作风。在深入广泛开展社会主义教育运动的基础上,一九六五年要大力组织工农业生产的新高潮,为一九六六年开始的第三个五年计划作好准备,争取在不太长的历史时期内,把我国建成一个具有现代农业、现代工业、现代国防和现代科学技术的社会主义强国。在国际方面,我们要继续贯彻我国对外政策的总路线,同全世界人民一起,坚决反对美帝国主义及其走狗,为争取世界和平、民族解放、人民民主和社会主义事业的新胜利而奋斗。 国民经济的成就和今后的建设任务 周恩来总理在报告中首先指出,从第二届全国人民代表大会第一次会议以来,我国各族人民,在中国共产党的英明领导下,高举毛泽东思想的光辉旗帜,坚持鼓足干劲、力争上游、多快好省地建设社会主义的总路线,在全国范围内展开了阶级斗争、生产斗争、科学实验三大革命运动,在国际上同帝国主义、各国反动派和现代修正主义进行了针锋相对的斗争,取得了一个又一个的伟大胜\n"
]
}
],
"source": [
"print(reports[-7][:1000])"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-04T03:28:25.061141Z",
"start_time": "2020-06-04T03:28:25.056942Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1959\t\t\t\t\t\t1959年国务院政府工作报告\t——1959年4月18日在第二届全国人民代表大会第一次会议上\t 国务院总理周恩来\t各位代表:\t 我现在根据国务院的决定,向第二届全国人民代表大会第一次会议作政府工作报告。\t 一、第一个五年计划时期内和第二个五年计划的第一年——一九五八年的伟大成就\t 在第一届全国人民代表大会的四年多的任期中间,我们的国家经历了一系列的具有重大历史意义的变化。\t 当一九五四年第一届全国人民代表大会第一次会议召开的时候,我国社会主义经济已经在国民经济中居于主导的地位,但是,我国还存在着大量的资本主义的工业和商业,并且大量地存在着个体的农业和手工业。农村中劳动互助运动已经广泛地发展起来,参加农业劳动互助组的农户达到了百分之六十左右,但是,组成农业生产合作社的农户还只占农户总数的百分之二左右。在那时候,我国已经完成了经济恢复时期的任务,开始了大规模的、有计划的经济建设。但是,究竟我们能不能在一个较短的时间内,使我国这样一个有六亿多人口的大国,建立起社会主义工业化的基础来,还有待于事实的证明。而现在呢?大家看到,只经过四年\n"
]
}
],
"source": [
"print(reports[4][:500])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
" pip install jieba\n",
"> https://github.com/fxsjy/jieba\n",
"\n",
" pip install wordcloud\n",
"> https://github.com/amueller/word_cloud\n",
"\n",
" pip install gensim\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-04T03:30:39.183616Z",
"start_time": "2020-06-04T03:29:45.923640Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting gensim\n",
" Downloading gensim-3.8.3-cp37-cp37m-macosx_10_9_x86_64.whl (24.2 MB)\n",
"\u001b[K |████████████████████████████████| 24.2 MB 376 kB/s eta 0:00:01\n",
"\u001b[?25hRequirement already satisfied: scipy>=0.18.1 in /opt/anaconda3/lib/python3.7/site-packages (from gensim) (1.4.1)\n",
"Requirement already satisfied: numpy>=1.11.3 in /opt/anaconda3/lib/python3.7/site-packages (from gensim) (1.18.1)\n",
"Requirement already satisfied: six>=1.5.0 in /opt/anaconda3/lib/python3.7/site-packages (from gensim) (1.14.0)\n",
"Collecting smart-open>=1.8.1\n",
" Downloading smart_open-2.0.0.tar.gz (103 kB)\n",
"\u001b[K |████████████████████████████████| 103 kB 625 kB/s eta 0:00:01\n",
"\u001b[?25hRequirement already satisfied: requests in /opt/anaconda3/lib/python3.7/site-packages (from smart-open>=1.8.1->gensim) (2.22.0)\n",
"Requirement already satisfied: boto in /opt/anaconda3/lib/python3.7/site-packages (from smart-open>=1.8.1->gensim) (2.49.0)\n",
"Requirement already satisfied: boto3 in /opt/anaconda3/lib/python3.7/site-packages (from smart-open>=1.8.1->gensim) (1.9.191)\n",
"Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/anaconda3/lib/python3.7/site-packages (from requests->smart-open>=1.8.1->gensim) (1.25.8)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/lib/python3.7/site-packages (from requests->smart-open>=1.8.1->gensim) (2019.11.28)\n",
"Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/anaconda3/lib/python3.7/site-packages (from requests->smart-open>=1.8.1->gensim) (3.0.4)\n",
"Requirement already satisfied: idna<2.9,>=2.5 in /opt/anaconda3/lib/python3.7/site-packages (from requests->smart-open>=1.8.1->gensim) (2.8)\n",
"Requirement already satisfied: s3transfer<0.3.0,>=0.2.0 in /opt/anaconda3/lib/python3.7/site-packages (from boto3->smart-open>=1.8.1->gensim) (0.2.1)\n",
"Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/anaconda3/lib/python3.7/site-packages (from boto3->smart-open>=1.8.1->gensim) (0.9.5)\n",
"Requirement already satisfied: botocore<1.13.0,>=1.12.191 in /opt/anaconda3/lib/python3.7/site-packages (from boto3->smart-open>=1.8.1->gensim) (1.12.191)\n",
"Requirement already satisfied: docutils>=0.10 in /opt/anaconda3/lib/python3.7/site-packages (from botocore<1.13.0,>=1.12.191->boto3->smart-open>=1.8.1->gensim) (0.16)\n",
"Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/anaconda3/lib/python3.7/site-packages (from botocore<1.13.0,>=1.12.191->boto3->smart-open>=1.8.1->gensim) (2.8.1)\n",
"Building wheels for collected packages: smart-open\n",
" Building wheel for smart-open (setup.py) ... \u001b[?25ldone\n",
"\u001b[?25h Created wheel for smart-open: filename=smart_open-2.0.0-py3-none-any.whl size=101341 sha256=373e4939f516de66ae607886c52d7a00529e0930868ce9f5ae1ecec2297f40f4\n",
" Stored in directory: /Users/datalab/Library/Caches/pip/wheels/bb/1c/9c/412ec03f6d5ac7d41f4b965bde3fc0d1bd201da5ba3e2636de\n",
"Successfully built smart-open\n",
"Installing collected packages: smart-open, gensim\n",
"Successfully installed gensim-3.8.3 smart-open-2.0.0\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"pip install gensim"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:35:23.275747Z",
"start_time": "2021-05-22T03:35:23.268614Z"
},
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.cm as cm\n",
"import matplotlib.pyplot as plt\n",
"import sys \n",
"import numpy as np\n",
"from collections import defaultdict\n",
"import statsmodels.api as sm\n",
"from wordcloud import WordCloud\n",
"import jieba\n",
"import matplotlib\n",
"import gensim\n",
"from gensim import corpora, models, similarities\n",
"from gensim.utils import simple_preprocess\n",
"from gensim.parsing.preprocessing import STOPWORDS\n",
"#matplotlib.rcParams['font.sans-serif'] = ['Microsoft YaHei'] #指定默认字体 \n",
"matplotlib.rc(\"savefig\", dpi=400)"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"ExecuteTime": {
"end_time": "2019-06-14T15:54:22.523158Z",
"start_time": "2019-06-14T15:54:22.520252Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"# 为了确保中文可以在matplotlib里正确显示\n",
"#matplotlib.rcParams['font.sans-serif'] = ['Microsoft YaHei'] #指定默认字体 \n",
"# 需要确定系统安装了Microsoft YaHei"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"ExecuteTime": {
"end_time": "2019-06-14T15:54:23.082594Z",
"start_time": "2019-06-14T15:54:23.080029Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"# import matplotlib\n",
"# my_font = matplotlib.font_manager.FontProperties(\n",
"# fname='/Users/chengjun/github/cjc/data/msyh.ttf')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 1. 分词"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:35:26.777933Z",
"start_time": "2021-05-22T03:35:26.771814Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Full Mode: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学\n",
"Default Mode: 我/ 来到/ 北京/ 清华大学\n",
"他, 来到, 了, 网易, 杭研, 大厦\n",
"小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, ,, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造\n"
]
}
],
"source": [
"import jieba\n",
"\n",
"seg_list = jieba.cut(\"我来到北京清华大学\", cut_all=True)\n",
"print(\"Full Mode: \" + \"/ \".join(seg_list)) # 全模式\n",
"\n",
"seg_list = jieba.cut(\"我来到北京清华大学\", cut_all=False)\n",
"print(\"Default Mode: \" + \"/ \".join(seg_list)) # 精确模式\n",
"\n",
"seg_list = jieba.cut(\"他来到了网易杭研大厦\") # 默认是精确模式\n",
"print(\", \".join(seg_list))\n",
"\n",
"seg_list = jieba.cut_for_search(\"小明硕士毕业于中国科学院计算所,后在日本京都大学深造\") # 搜索引擎模式\n",
"print(\", \".join(seg_list))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## 2. 停用词"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:35:30.325550Z",
"start_time": "2021-05-22T03:35:30.320581Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"filename = './data/stopwords.txt'\n",
"stopwords = {}\n",
"f = open(filename, 'r')\n",
"line = f.readline().rstrip()\n",
"while line:\n",
" stopwords.setdefault(line, 0)\n",
" stopwords[line] = 1\n",
" line = f.readline().rstrip()\n",
"f.close()"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-04T03:56:17.990921Z",
"start_time": "2020-06-04T03:56:17.987684Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"adding_stopwords = [u'我们', u'要', u'地', u'有', u'这', u'人',\n",
" u'发展',u'建设',u'加强',u'继续',u'对',u'等',\n",
" u'推进',u'工作',u'增加']\n",
"for s in adding_stopwords: stopwords[s]=10 "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 3. 关键词抽取"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### 基于TF-IDF 算法的关键词抽取"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:35:40.012338Z",
"start_time": "2021-05-22T03:35:39.202333Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import jieba.analyse\n",
"txt = reports[-1]\n",
"tf = jieba.analyse.extract_tags(txt, topK=200, withWeight=True)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:35:48.291509Z",
"start_time": "2021-05-22T03:35:48.287281Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'人民、我们、国家、我国、一九五三年、工业、一九五四年、必须、工作、建设、发展、和平、一九四九年、社会主义、一九五、国家机关、生产、计划、全国、农业、亚洲、美国、事业、企业、应当、经济、这些、改造、增加、并且、完成、但是、已经、等于、合作社、集团、方面、需要、台湾、资本主义、反对、几年、生活、建立、为了、技术、这个、进行、日内瓦、问题'"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"u\"、\".join([i[0] for i in tf[:50]])"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"ExecuteTime": {
"end_time": "2021-05-22T03:36:22.004539Z",
"start_time": "2021-05-22T03:36:21.853310Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAAD6CAYAAABJTke4AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAQkklEQVR4nO3df4xldXnH8fdHYAXMYooMXdhddo1IC9jEyjQai5ViLVL+KSYaRKzYwtpICq1Va2z/INVuMGolUtN2K5VWSFtalRYpsSwF25jadPihCBFcYBeWAg5gUtBS2Z2nf8xdvazLznLOmb2z+32/khvu/Z7znfM8c2Y/99xz7z2kqpAkteUFky5AkrT3Gf6S1CDDX5IaZPhLUoMMf0lq0IGTLuCII46otWvXTroMSdqn3HLLLY9V1VTX+RMP/7Vr1zIzMzPpMiRpn5JkS5/5nvaRpAYZ/pLUoD0O/ySHJDluMYuRJO0dC4Z/ksOSXAM8Cnxgp2WHJ3k0yR+MjV2SZGuSO5KcNHzJkqS+9uTIfw64DHjvLpZ9Arh1x4MkpwInA2tH61/ev0RJ0tAWDP+qeqqqbgS2jY8necNo7D/Hht8MXFFV26rqBmAqyYohC5Yk9dfpDd8khwAfBn5vp0WrgfGPHz0EHLWL+euSzCSZmZ2d7VKCJKmHrp/2uRj4dFU9sdP4MuZPE+0wB2zfeXJVbaiq6aqanprq/B0FSVJHXb/kdTZwWpL3AyuASnI/8DCwcmy9o4Gt/UqUJA2tU/hX1eod95NcDGyrqquSPA28J8lVwKnAPbt4dTCYtR+8brF+9G5tvuSMiWxXkoayYPgnWQ7cBiwHDk5yCnB+Vd20i9W/CLweuA94nPlXCJKkJWbB8K+qJ4Fjd7P84rH7c8CFo5skaYny8g6S1CDDX5IaZPhLUoMMf0lqkOEvSQ0y/CWpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kNMvwlqUGGvyQ1yPCXpAYZ/pLUIMNfkhpk+EtSgwx/SWrQHod/kkOSHLeYxUiS9o4Fwz/JYUmuAR4FPjAae0mSv0vy7ST3JjlrbP1LkmxNckeSkxavdElSV3ty5D8HXAa8d2xsCvjTqno5cBrwZ0kOSnIqcDKwdrT+5cOWK0kawoLhX1VPVdWNwLaxsW9V1c2j+5uAZ4BDgDcDV1TVtqq6AZhKsmJRKpckddb7Dd8kpwO3VtX/AKuBLWOLHwKO2sWcdUlmkszMzs72LUGS9Dz1Cv8kxwIfA949GlrG/GmiHeaA7TvPq6oNVTVdVdNTU1N9SpAkddA5/JOsAf4B+LWq2jwafhhYObba0cDWztVJkhZFp/BPshL4AnB+Vd06tug64J1JDkjyRuCeqnpigDolSQM6cKEVkiwHbgOWAwcnOQUIcATwN0l2rHoC8EXg9cB9wOPA2cOXLEnqa8Hwr6ongWOfx8+8cHSTJC1RXt5Bkhpk+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1CDDX5IaZPhLUoMMf0lqkOEvSQ0y/CWpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kN2uPwT3JIkuMWsxhJ0t6xYPgnOSzJNcCjwAfGxi9K8kCSu5OcPjZ+SZKtSe5IctLilC1J6uPAPVhnDrgM+BLwGoAkLwMuAE4EVgMbk6wBXgecDKwFfhG4HHjl4FVLknpZ8Mi/qp6qqhuBbWPDZwJXV9WTVXUXsBk4CXgzcEVVbauqG4CpJCsWoW5JUg9d3/BdDWwZe7wVOGoX4w+Nxp8lybokM0lmZmdnO5YgSeqqa/gvY/500A5zwPbdjD9LVW2oqumqmp6amupYgiSpq67h/zCwcuzxKuDBXYwfzfyrAknSEtI1/K8DzkpyaJLjgcOB20fj70xyQJI3AvdU1RMD1SpJGsiCn/ZJshy4DVgOHJzkFOB84ErgTuBp4LyqqiRfBF4P3Ac8Dpy9SHVLknpYMPyr6kng2F0suglYv9O6c8CFo5skaYny8g6S1CDDX5IaZPhLUoMMf0lqkOEvSQ0y/CWpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kNMvwlqUGGvyQ1yPCXpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDeoV/knem+TbSe5PcsFo7KIkDyS5O8npw5QpSRrSgV0nJlkLXAicCBwM3JfkZuCC0dhqYGOSNVX1TO9KJUmD6XPkvyPQ55h/EnkK+BXg6qp6sqruAjYDJ/WqUJI0uM7hX1UPARcDXwM2Am8DVgFbxlbbChy189wk65LMJJmZnZ3tWoIkqaPO4Z/kMOBs4CLgj4H3AcuYfyWwwxywfee5VbWhqqaranpqaqprCZKkjvqc9jkH+EZV3VxVnx2NPQKsHFtnFfBgj21IkhZBn/B/GnhlkoOSLAeOY/70z1lJDk1yPHA4cPsAdUqSBtT50z7AlcCpwH3A/wJ/VVVfTXIlcCfzTw7nVVX1L1OSNKTO4V9VP2D+1M/O4+uB9X2KkiQtLr/hK0kNMvwlqUGGvyQ1yPCXpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1CDDX5IaZPhLUoMMf0lqkOEvSQ0y/CWpQYa/JDXI8JekBhn+ktSgXuGf5MVJ/jbJQ0nuTbIsyUVJHkhyd5LThypUkjScA3vOvwz4JvA24IXAauAC4MTR/Y1J1lTVMz23I0kaUOcj/yQrgNcC62ve08CZwNVV9WRV3QVsBk4apFJJ0mD6HPmfCNwPfD7JCcC1wEHMvxLYYStw1M4Tk6wD1gEcc8wxPUqQJHXRJ/yPBE4AXg18F9gIrAC+MbbOHLB954lVtQHYADA9PV09apAkddAn/L8D3FJVWwGS3MB80K8cW2cV8GCPbUiSFkGfT/t8DTghydFJXgj8EvAUcFaSQ5McDxwO3D5AnZKkAXU+8q+q7yX5LeAG5j/pc0VVfWL0RHAn8DRwXlV5WkeSlpheH/WsquuB63caWw+s7/NzJUmLy2/4SlKDDH9JapDhL0kNMvwlqUGGvyQ1yPCXpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1CDDX5IaZPhLUoMMf0lqkOEvSQ0y/CWpQYa/JDWoV/gnWZbkriSfGT2+KMkDSe5OcvowJUqShnZgz/kfAjYDJHkZcAFwIrAa2JhkTVU903MbkqSBdT7yT3I88HPA1aOhM4Grq+rJqrqL+SeFk3pXKEkaXKfwTxLgU8BFY8OrgS1jj7cCRz3H/HVJZpLMzM7OdilBktRD1yP/3wRurqpNY2PLgLmxx3PA9l1NrqoNVTVdVdNTU1MdS5AkddX1nP87gOVJ3gIcDryI+VcCK8fWWQU82K88SdJi6BT+VfXaHfeTnAucDHwJ+FySjwNrmH9SuH2AGpectR+8bmLb3nzJGRPbtqT9R99P+/xQVd2S5ErgTuBp4LyqqqF+viRpOL3Dv6quAK4Y3V8PrO/7MyVJi8tv+EpSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1CDDX5IaZPhLUoMMf0lqkOEvSQ0y/CWpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kNMvwlqUGGvyQ1qHP4Jzk4yYYkdyfZkuR3RuMXJXlgNH76cKVKkoZyYI+5LwK+DLwbeAlwZ5JbgQuAE4HVwMYka6rqmd6VSpIG0/nIv6oer6rP17zHgAeBXwCurqonq+ouYDNw0jClSpKGMsg5/ySvAA4GjgC2jC3aChy1i/XXJZlJMjM7OztECZKk56F3+Cc5Avgc8C5gGTA3tngO2L7znKraUFXTVTU9NTXVtwRJ0vPUK/yT/ARwLfChqvov4GFg5dgqq5g/HSRJWkL6fNrnMOCfgD+qqutHw9cBZyU5NMnxwOHA7f3LlCQNqc+R/4XAq4BLk2xKsgn4LnAlcCfwBeD8qqr+ZUqShtT5o55V9RHgI7tYtH50kyQtUX7DV5IaZPhLUoMMf0lqkOEvSQ0y/CWpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kNMvwlqUF9/gfumoC1H7xuItvdfMkZE9mupMXhkb8kNcgjf+2RSb3iAF91SIvBI39JapDhL0kN8rSPljzf5JaG55G/JDXI8JekBg1+2ifJW4GPAtuB9VX1l0NvQ9rfeapr72n1dz1o+CdZDnwCeA3z4X97kmuranbI7Uh7wyQ/3jopLfbcqqFP+5wGfKWqHqqqR4B/Bd4w8DYkST0NfdpnNbBl7PFW4KidV0qyDlg3evhUkrsHruP5OgJ4bMI1DME+lp79pRf7GFg+2vtH/FSfyUOH/zJgbuzxHPOnf56lqjYAGwbedmdJZqpqetJ19GUfS8/+0ot9LD1JZvrMH/q0z8PAyrHHq4AHB96GJKmnocP/y8BpSY5MsgJ4LfAvA29DktTToKd9qurRJL8P/Mdo6Her6ntDbmORLJlTUD3Zx9Kzv/RiH0tPr15SVUMVIknaR/gNX0lqkOEvSQ0y/J9DkoOSnDDpOiRpMez34Z/krUnuT7Ipya/vtOwVSb6eZEuSy5K8YDR+JfMfW/3UJGp+Lh17uSTJt5I8kOTjk6n82Tr28bkkd4/mvWMylT9blz7Glm9MsnHvVrxrHffHzUk2j+ZsSnLAZKp/to69HJTk00keGs1dM5nqn1Xr8+ojyTlj+2JTkqeSvGe3G6mq/fYGLGf+ewYrgRXAI8DU2PJ/A04HDgC+AvzqaPwM4E3Axkn3MEAv54/GDgW+Cfz8PtrH0aP//jTw2L66P0bLzgX+eSn8ffXYHzcDaydd/0C9/CHw56PxA4GD9sU+xpa/APg2cMTutrO/H/k/57WGkkwBL62q66tqO3AV84FPVV0HPD2hmp9L117+oqq2V9X3gW8Bh0+m/B/q2sd/j+avBb6+16v+cZ36GC37DeDSyZT9Yzr1sUQ9716SHAS8C3j/6N/Jtqp6ZlINjPTdJ28Cbquq3V7GYn8P/91da2gV8MBzLFuKevUy+tLdzwI3LWKNe6JTH0nenuQR4HLg/XuhzoV03R+XAh8CfrDYBe6hrn38H3BTktuSnLPoVe6ZLr0cAzwBXJrkniSfTXLw3ih2N/rm1nnM/zvZrf09/Hd3raE9ug7REtK5lySHAFcDv11VTy1ynQvp1EdVXVVVK4C3AtckOWwv1Lo7z7uPJG8CvldV/753StwjXffHaVX1UuDtwMeS9LrI2EC69HIk8DLgk8AJwIv50UUnJ6XPv/UVwCuBGxbayP4e/ru71tC+dh2iTr0keSHwBeCvq+ravVDnQnrtk6r6KvNHPsctYo17oksf7wJOTnI78Bng1Uk+uRdq3Z2+++Mu4KvA8YtY457q0st3gPuq6o6q2gZ8iZ5XyxxAn31yLnBlVY0/QezapN+kWeQ3Tn4SeIj5Z/cVwH3Ai8aW3wGcwo/eODl5bNkpLIE35Pr0AhwE/CPw7knX37OPI4FjR8tfPvpjP2xf62On+Uvi76trH2P7Y81of6zdF3sBAnwD+BnmD4b/Hjh3X+tjNB7gHubfE1h4O5PeYXvhF3kucO/odubo9r7RsleNfpEPAh8em3Pr6Jf/fWATcM6k++jSC3AO8+eWN43dXrcP9rFqNHYfcBvwy5Puoevf1tjcU1gC4d+1j9HY/cCdwFsm3UPPXqZHf1f3An8CvGAf7eNU4MY93YbX9pGkBu3v5/wlSbtg+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1KD/B5l/9/JxfIcUAAAAAElFTkSuQmCC\n",
"text/plain": [
"