第三章数据抓取

第三章数据抓取#

Requests和Beautifulsoup简介

基本原理#

爬虫就是请求网站并提取数据的自动化程序。其中请求，提取，自动化是爬虫的关键！爬虫的基本流程：

发起请求
- 通过HTTP库向目标站点发起请求，也就是发送一个Request，请求可以包含额外的header等信息，等待服务器响应
获取响应内容
- 如果服务器能正常响应，会得到一个Response。Response的内容便是所要获取的页面内容，类型可能是HTML、Json字符串、二进制数据（图片或者视频）等类型

解析内容
- 得到的内容可能是HTML,可以用页面解析库、正则表达式进行解析；可能是Json,可以直接转换为Json对象解析；可能是二进制数据，可以做保存或者进一步的处理
保存数据
- 保存形式多样，可以存为文本，也可以保存到数据库，或者保存特定格式的文件

浏览器发送消息给网址所在的服务器，这个过程就叫做Http Request;服务器收到浏览器发送的消息后，能够根据浏览器发送消息的内容，做相应的处理，然后把消息回传给浏览器，这个过程就是Http Response.

需要解决的问题#

页面解析
获取Javascript隐藏源数据
自动翻页
自动登录
连接API接口

一般的数据抓取，使用requests和beautifulsoup配合就可以了。

尤其是对于翻页时url出现规则变化的网页，只需要处理规则化的url就可以了。
以简单的例子是抓取天涯论坛上关于某一个关键词的帖子。
- 在天涯论坛，关于雾霾的帖子的第一页是： http://bbs.tianya.cn/list.jsp?item=free&nextid=0&order=8&k=雾霾
- 第二页是： http://bbs.tianya.cn/list.jsp?item=free&nextid=1&order=8&k=雾霾

第一个爬虫#

Beautifulsoup Quick Start

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

https://socratesacademy.github.io/bigdata/data/test.html

‘Once upon a time there were three little sisters,’ the Dormouse began in a great hurry; ‘and their names were Elsie, Lacie, and Tillie; and they lived at the bottom of a well–’

‘What did they live on?’ said Alice, who always took a great interest in questions of eating and drinking.

‘They lived on treacle,’ said the Dormouse, after thinking a minute or two.

‘They couldn’t have done that, you know,’ Alice gently remarked; ‘they’d have been ill.’

‘So they were,’ said the Dormouse; ‘very ill.’

Alice’s Adventures in Wonderland CHAPTER VII A Mad Tea-Party http://www.gutenberg.org/files/928/928-h/928-h.htm

import requests
from bs4 import BeautifulSoup  

import requests
from bs4 import BeautifulSoup 

url = 'https://vp.fact.qq.com/home'
content = requests.get(url)
soup = BeautifulSoup(content.text, 'html.parser') 

help(requests.get)

Help on function get in module requests.api:

get(url, params=None, **kwargs)
    Sends a GET request.
    
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary, list of tuples or bytes to send
        in the query string for the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response

url = 'https://socratesclub.github.io/bigdata/data/test.html'
content = requests.get(url)
#help(content)

print(content.text)

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

content.encoding

'utf-8'

Beautiful Soup#

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

Beautiful Soup provides a few simple methods. It doesn’t take much code to write an application
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Then you just have to specify the original encoding.
Beautiful Soup sits on top of popular Python parsers like lxml and html5lib.

Install beautifulsoup4#

open your terminal/cmd

~~$ pip install beautifulsoup4~~

html.parser#

Beautiful Soup supports the html.parser included in Python’s standard library

lxml#

but it also supports a number of third-party Python parsers. One is the lxml parser lxml. Depending on your setup, you might install lxml with one of these commands:

$ apt-get install python-lxml

$ easy_install lxml

$ pip install lxml

html5lib#

Another alternative is the pure-Python html5lib parser html5lib, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands:

$ apt-get install python-html5lib

$ easy_install html5lib

$ pip install html5lib

# url = 'http://socratesacademy.github.io/bigdata/data/test.html'
# content = requests.get(url)
content = content.text
soup = BeautifulSoup(content, 'html.parser') 
soup

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p></body></html>

print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

html
- head
  - title
- body
  - p (class = ‘title’, ‘story’ )
    - a (class = ‘sister’)
      - href/id

Select 方法#

标签名不加任何修饰
类名前加点
id名前加 #

我们也可以利用这种特性，使用soup.select()方法筛选元素，返回类型是 list

Select方法三步骤

Inspect (检查)
Copy
- Copy Selector

鼠标选中标题The Dormouse's story, 右键检查Inspect
鼠标移动到选中的源代码
右键Copy–>Copy Selector

body > p.title > b

soup.select('body > p.title > b')[0].text

"The Dormouse's story"

Select 方法: 通过标签名查找#

soup.select('title')[0].text

"The Dormouse's story"

soup.select('a') 

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('b')

[<b>The Dormouse's story</b>]

Select 方法: 通过类名查找#

soup.select('.story')

[<p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

soup.select('.sister')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('.title')

[<p class="title"><b>The Dormouse's story</b></p>]

Select 方法: 通过id名查找#

soup.select('#link1')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select('#link1')[0]['href']

'http://example.com/elsie'

Select 方法: 组合查找#

将标签名、类名、id名进行组合

例如查找 p 标签中，id 等于 link1的内容

soup.select('p #link1')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

Select 方法:属性查找#

加入属性元素

属性需要用大于号>连接
属性和标签属于同一节点，中间不能加空格。

soup.select("head > title")

[<title>The Dormouse's story</title>]

soup.select("body > p")

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

find_all方法#

找到所有的：
- find_all
- ()
只找一个：
- find
- .
find_all('tag name', {'class or id': 'value'})

soup.find_all('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

soup('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

soup.find('p')

<p class="title"><b>The Dormouse's story</b></p>

soup.p

<p class="title"><b>The Dormouse's story</b></p>

[i.text.replace('\n', ' ') for i in soup('p')]

["The Dormouse's story",
 'Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.',
 '...']

for i in soup('p'):
    print(i.text.replace('\n', ' '))

The Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...

for tag in soup.find_all(True):
    print(tag.name)

html
head
title
body
p
b
p
a
a
a
p

soup('title')  # or  soup.title

[<title>The Dormouse's story</title>]

soup.title.name

'title'

soup.title.string

"The Dormouse's story"

soup.title.text
# 推荐使用text方法

"The Dormouse's story"

soup.title.parent.name

'head'

soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all('a', {'class':'sister'})

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all('a', {'class': 'sister'})[0]

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a', {'class': 'sister'})[0].text 

'Elsie'

soup.find_all('a', {'class':'sister'})[0]['href']

'http://example.com/elsie'

soup.find_all('a', {'class': 'sister'})[0]['id']

'link1'

soup.find_all('a', {'id':'link1'})#[0]['id']

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.find_all(["a", "b"])

[<b>The Dormouse's story</b>,
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link1")

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

第三章 数据抓取

Contents

第三章 数据抓取#