{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 抓取江苏省政协十年提案"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"打开 http://www.jszx.gov.cn/zxta/2022ta/\n",
"\n",
"- 点击下一页,url不变!\n",
"\n",
"> 所以数据的更新是使用js推送的\n",
"- 分析network中的内容,发现proposalList.jsp\n",
" - 查看它的header,并发现了form_data\n",
" \n",
"\n",
"\n",
"http://www.jszx.gov.cn/wcm/zxweb/proposalList.jsp 无法在新的tab中打开\n",
"\n",
"\n",
"根据form data重构url\n",
"\n",
"http://www.jszx.gov.cn/wcm/zxweb/proposalList.jsp?year=2022&pagenum=1&pagesize=20"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-10T01:38:35.751604Z",
"start_time": "2019-10-10T01:38:35.464535Z"
}
},
"outputs": [],
"source": [
"import requests\n",
"from bs4 import BeautifulSoup"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-10T01:38:44.359398Z",
"start_time": "2019-10-10T01:38:44.270556Z"
}
},
"outputs": [],
"source": [
"form_data = {'year':2022, # change it to the current year\n",
" 'pagenum':1,\n",
" 'pagesize':20\n",
"}\n",
"url = 'http://www.jszx.gov.cn/wcm/zxweb/proposalList.jsp'\n",
"content = requests.get(url, form_data)\n",
"content.encoding = 'utf-8'\n",
"js = content.json()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-10T01:38:52.271556Z",
"start_time": "2019-10-10T01:38:52.262613Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'630'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"js['data']['totalcount']"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-10T01:39:00.241237Z",
"start_time": "2019-10-10T01:39:00.238570Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"32.0"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dat = js['data']['list']\n",
"pagenum = js['data']['pagecount']\n",
"pagenum"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-10T01:39:16.463805Z",
"start_time": "2019-10-10T01:39:11.445898Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"32\r"
]
}
],
"source": [
"for i in range(2, int(pagenum)+1):\n",
" print(i, end = '\\r')\n",
" form_data['pagenum'] = i\n",
" content = requests.get(url, form_data)\n",
" content.encoding = 'utf-8'\n",
" js = content.json()\n",
" for j in js['data']['list']:\n",
" dat.append(j)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-10T01:39:23.103499Z",
"start_time": "2019-10-10T01:39:23.100188Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"630"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(dat)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-10T01:39:31.057027Z",
"start_time": "2019-10-10T01:39:31.053393Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'rownum': 1,\n",
" 'proposal_number': '0001',\n",
" 'reason': '关于深入落实长江大保护战略,推动我省沿江化工产业绿色高质量发展的建议',\n",
" 'pkid': 'dd619f014d23456cb403ceb12506739a',\n",
" 'year': '2022',\n",
" 'publish_time': '2022-01-18 16:12:23',\n",
" 'personnel_name': '严华',\n",
" 'type': '工业商贸'}"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dat[0]"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-10T01:39:40.176411Z",
"start_time": "2019-10-10T01:39:39.125256Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" rownum | \n",
" proposal_number | \n",
" reason | \n",
" pkid | \n",
" year | \n",
" publish_time | \n",
" personnel_name | \n",
" type | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0001 | \n",
" 关于深入落实长江大保护战略,推动我省沿江化工产业绿色高质量发展的建议 | \n",
" dd619f014d23456cb403ceb12506739a | \n",
" 2022 | \n",
" 2022-01-18 16:12:23 | \n",
" 严华 | \n",
" 工业商贸 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 0002 | \n",
" 关于重视人工智能应用安全的建议 | \n",
" df4b6c2109af42b2a04b135212923f98 | \n",
" 2022 | \n",
" 2022-01-18 10:29:37 | \n",
" 仲盛 | \n",
" 科学技术 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 0004 | \n",
" 关于打造软件信息产业联动先行区的建议 | \n",
" 7f97456a314444c3b59ced0374bb01fc | \n",
" 2022 | \n",
" 2022-01-18 16:12:23 | \n",
" 钱再见 | \n",
" 工业商贸 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 0005 | \n",
" 关于设立“江苏工匠日”的建议 | \n",
" f5f0aa468ecf4af5be2438393d54a49d | \n",
" 2022 | \n",
" 2022-01-18 16:06:13 | \n",
" 马永青等9人 | \n",
" 文化宣传 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 0006 | \n",
" 关于进一步重视和支持企业提升人才吸引力的建议 | \n",
" a666191fb1644a5f83009ac1a0dd5e5b | \n",
" 2022 | \n",
" 2022-01-19 19:23:47 | \n",
" 甘霖 | \n",
" 社会事业 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rownum proposal_number reason \\\n",
"0 1 0001 关于深入落实长江大保护战略,推动我省沿江化工产业绿色高质量发展的建议 \n",
"1 2 0002 关于重视人工智能应用安全的建议 \n",
"2 3 0004 关于打造软件信息产业联动先行区的建议 \n",
"3 4 0005 关于设立“江苏工匠日”的建议 \n",
"4 5 0006 关于进一步重视和支持企业提升人才吸引力的建议 \n",
"\n",
" pkid year publish_time personnel_name \\\n",
"0 dd619f014d23456cb403ceb12506739a 2022 2022-01-18 16:12:23 严华 \n",
"1 df4b6c2109af42b2a04b135212923f98 2022 2022-01-18 10:29:37 仲盛 \n",
"2 7f97456a314444c3b59ced0374bb01fc 2022 2022-01-18 16:12:23 钱再见 \n",
"3 f5f0aa468ecf4af5be2438393d54a49d 2022 2022-01-18 16:06:13 马永青等9人 \n",
"4 a666191fb1644a5f83009ac1a0dd5e5b 2022 2022-01-19 19:23:47 甘霖 \n",
"\n",
" type \n",
"0 工业商贸 \n",
"1 科学技术 \n",
"2 工业商贸 \n",
"3 文化宣传 \n",
"4 社会事业 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.DataFrame(dat)\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-10T01:39:49.950879Z",
"start_time": "2019-10-10T01:39:49.941246Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"type\n",
"农林水利 69\n",
"医卫体育 69\n",
"城乡建设 31\n",
"工业商贸 89\n",
"政治建设 12\n",
"教育事业 68\n",
"文化宣传 33\n",
"法制建设 23\n",
"社会事业 92\n",
"科学技术 18\n",
"经济发展 69\n",
"统战综合 5\n",
"财税金融 14\n",
"资源环境 38\n",
"dtype: int64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('type').size()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 抓取提案内容\n",
"http://www.jszx.gov.cn/zxta/2019ta/index_61.html?pkid=18b1b347f9e34badb8934c2acec80e9e"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-10T01:40:17.900495Z",
"start_time": "2019-10-10T01:40:17.896621Z"
}
},
"outputs": [],
"source": [
"url_base = 'http://www.jszx.gov.cn/wcm/zxweb/proposalInfo.jsp?pkid='\n",
"urls = [url_base + i for i in df['pkid']]"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"http://www.jszx.gov.cn/wcm/zxweb/proposalInfo.jsp?pkid=dd619f014d23456cb403ceb12506739a\n",
"http://www.jszx.gov.cn/wcm/zxweb/proposalInfo.jsp?pkid=df4b6c2109af42b2a04b135212923f98\n",
"http://www.jszx.gov.cn/wcm/zxweb/proposalInfo.jsp?pkid=7f97456a314444c3b59ced0374bb01fc\n",
"http://www.jszx.gov.cn/wcm/zxweb/proposalInfo.jsp?pkid=f5f0aa468ecf4af5be2438393d54a49d\n",
"http://www.jszx.gov.cn/wcm/zxweb/proposalInfo.jsp?pkid=a666191fb1644a5f83009ac1a0dd5e5b\n"
]
}
],
"source": [
"for i in urls[:5]:\n",
" print(i)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-10T01:41:01.068932Z",
"start_time": "2019-10-10T01:40:37.241768Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"629\r"
]
}
],
"source": [
"text = []\n",
"for k, i in enumerate(urls):\n",
" print(k, end = '\\r')\n",
" content = requests.get(i)\n",
" content.encoding = 'utf-8'\n",
" js = content.json()\n",
" js = js['data']['binfo']['_content']\n",
" soup = BeautifulSoup(js, 'html.parser') \n",
" text.append(soup.text)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-10T01:41:02.945741Z",
"start_time": "2019-10-10T01:41:02.942079Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"630"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(text)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-10T01:41:11.704331Z",
"start_time": "2019-10-10T01:41:11.700986Z"
}
},
"outputs": [],
"source": [
"df['content'] = text"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-10T01:41:19.726270Z",
"start_time": "2019-10-10T01:41:19.715176Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" rownum | \n",
" proposal_number | \n",
" reason | \n",
" pkid | \n",
" year | \n",
" publish_time | \n",
" personnel_name | \n",
" type | \n",
" content | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0001 | \n",
" 关于深入落实长江大保护战略,推动我省沿江化工产业绿色高质量发展的建议 | \n",
" dd619f014d23456cb403ceb12506739a | \n",
" 2022 | \n",
" 2022-01-18 16:12:23 | \n",
" 严华 | \n",
" 工业商贸 | \n",
" 调研情况:化工产业是江苏省支柱产业之一,是我省重要的基础性产业,产业规模、行业基础、技术水平... | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 0002 | \n",
" 关于重视人工智能应用安全的建议 | \n",
" df4b6c2109af42b2a04b135212923f98 | \n",
" 2022 | \n",
" 2022-01-18 10:29:37 | \n",
" 仲盛 | \n",
" 科学技术 | \n",
" 调研情况:习近平总书记强调:“人工智能是新一轮科技革命和产业变革的重要驱动力量,加快发展新一... | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 0004 | \n",
" 关于打造软件信息产业联动先行区的建议 | \n",
" 7f97456a314444c3b59ced0374bb01fc | \n",
" 2022 | \n",
" 2022-01-18 16:12:23 | \n",
" 钱再见 | \n",
" 工业商贸 | \n",
" 调研情况: 2021年2月8日,南京都市圈发展规划获国家发改委批复,要求以区域间的就近性、互... | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 0005 | \n",
" 关于设立“江苏工匠日”的建议 | \n",
" f5f0aa468ecf4af5be2438393d54a49d | \n",
" 2022 | \n",
" 2022-01-18 16:06:13 | \n",
" 马永青等9人 | \n",
" 文化宣传 | \n",
" 调研情况:近年来,省政协总工会界别委员认真学习贯彻党的十九大精神,围绕省委省政府和省“两会”... | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 0006 | \n",
" 关于进一步重视和支持企业提升人才吸引力的建议 | \n",
" a666191fb1644a5f83009ac1a0dd5e5b | \n",
" 2022 | \n",
" 2022-01-19 19:23:47 | \n",
" 甘霖 | \n",
" 社会事业 | \n",
" 调研情况:为进一步加大对民营经济高质量发展支持力度,我省出台《关于促进民营经济高质量发展的意... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rownum proposal_number reason \\\n",
"0 1 0001 关于深入落实长江大保护战略,推动我省沿江化工产业绿色高质量发展的建议 \n",
"1 2 0002 关于重视人工智能应用安全的建议 \n",
"2 3 0004 关于打造软件信息产业联动先行区的建议 \n",
"3 4 0005 关于设立“江苏工匠日”的建议 \n",
"4 5 0006 关于进一步重视和支持企业提升人才吸引力的建议 \n",
"\n",
" pkid year publish_time personnel_name \\\n",
"0 dd619f014d23456cb403ceb12506739a 2022 2022-01-18 16:12:23 严华 \n",
"1 df4b6c2109af42b2a04b135212923f98 2022 2022-01-18 10:29:37 仲盛 \n",
"2 7f97456a314444c3b59ced0374bb01fc 2022 2022-01-18 16:12:23 钱再见 \n",
"3 f5f0aa468ecf4af5be2438393d54a49d 2022 2022-01-18 16:06:13 马永青等9人 \n",
"4 a666191fb1644a5f83009ac1a0dd5e5b 2022 2022-01-19 19:23:47 甘霖 \n",
"\n",
" type content \n",
"0 工业商贸 调研情况:化工产业是江苏省支柱产业之一,是我省重要的基础性产业,产业规模、行业基础、技术水平... \n",
"1 科学技术 调研情况:习近平总书记强调:“人工智能是新一轮科技革命和产业变革的重要驱动力量,加快发展新一... \n",
"2 工业商贸 调研情况: 2021年2月8日,南京都市圈发展规划获国家发改委批复,要求以区域间的就近性、互... \n",
"3 文化宣传 调研情况:近年来,省政协总工会界别委员认真学习贯彻党的十九大精神,围绕省委省政府和省“两会”... \n",
"4 社会事业 调研情况:为进一步加大对民营经济高质量发展支持力度,我省出台《关于促进民营经济高质量发展的意... "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"ExecuteTime": {
"end_time": "2019-10-10T01:41:33.470909Z",
"start_time": "2019-10-10T01:41:33.468514Z"
}
},
"outputs": [],
"source": [
"#df.to_csv('./data/jszx2022.csv', index = False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autoclose": false,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}