The Dormouse's story
\n", "\n", "Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.
\n", "\n", "...
\n" ] } ], "source": [ "print(content.text)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T06:43:00.786372Z", "start_time": "2023-10-27T06:43:00.777585Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'utf-8'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "content.encoding" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Beautiful Soup\n", "> Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:\n", "\n", "- Beautiful Soup provides a few simple methods. It doesn't take much code to write an application\n", "- Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Then you just have to specify the original encoding.\n", "- Beautiful Soup sits on top of popular Python parsers like `lxml` and `html5lib`.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Install beautifulsoup4\n", "\n", "open your terminal/cmd\n", "\n", "The Dormouse's story
\n", "Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.
\n", "...
" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# url = 'http://socratesacademy.github.io/bigdata/data/test.html'\n", "# content = requests.get(url)\n", "content = content.text\n", "soup = BeautifulSoup(content, 'html.parser') \n", "soup" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T06:45:50.190354Z", "start_time": "2023-10-27T06:45:50.186885Z" }, "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " \n", "\n", " \n", " The Dormouse's story\n", " \n", "
\n", "\n", " Once upon a time there were three little sisters; and their names were\n", " \n", " Elsie\n", " \n", " ,\n", " \n", " Lacie\n", " \n", " and\n", " \n", " Tillie\n", " \n", " ;\n", "and they lived at the bottom of a well.\n", "
\n", "\n", " ...\n", "
\n", " \n", "\n" ] } ], "source": [ "print(soup.prettify())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- html\n", " - head\n", " - title\n", " - body\n", " - p (class = 'title', 'story' )\n", " - a (class = 'sister')\n", " - href/id" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Select 方法\n", "\n", "\n", "- 标签名不加任何修饰\n", "- 类名前加点\n", "- id名前加 #\n", "\n", "我们也可以利用这种特性,使用soup.select()方法筛选元素,返回类型是 list" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Select方法三步骤\n", "\n", "- Inspect (检查)\n", "- Copy\n", " - Copy Selector\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- 鼠标选中标题`The Dormouse's story`, 右键检查Inspect\n", "- 鼠标移动到选中的源代码\n", "- 右键Copy-->Copy Selector \n", "\n", "`body > p.title > b`\n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T06:54:02.243539Z", "start_time": "2023-10-27T06:54:02.238944Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "\"The Dormouse's story\"" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('body > p.title > b')[0].text" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Select 方法: 通过标签名查找" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T06:54:58.162310Z", "start_time": "2023-10-27T06:54:58.157979Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "\"The Dormouse's story\"" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('title')[0].text" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T06:55:09.473384Z", "start_time": "2023-10-27T06:55:09.469013Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[Elsie,\n", " Lacie,\n", " Tillie]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('a') " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T06:55:24.216571Z", "start_time": "2023-10-27T06:55:24.212141Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[The Dormouse's story]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('b')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Select 方法: 通过类名查找" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T06:56:10.151624Z", "start_time": "2023-10-27T06:56:10.146748Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[Once upon a time there were three little sisters; and their names were\n", " Elsie,\n", " Lacie and\n", " Tillie;\n", " and they lived at the bottom of a well.
,\n", "...
]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('.story')" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T06:56:28.766310Z", "start_time": "2023-10-27T06:56:28.761632Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[Elsie,\n", " Lacie,\n", " Tillie]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('.sister')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T06:56:38.557921Z", "start_time": "2023-10-27T06:56:38.553664Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[The Dormouse's story
]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('.title')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Select 方法: 通过id名查找" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T06:57:23.636680Z", "start_time": "2023-10-27T06:57:23.632347Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[Elsie]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('#link1')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T06:57:49.380331Z", "start_time": "2023-10-27T06:57:49.376333Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'http://example.com/elsie'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('#link1')[0]['href']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Select 方法: 组合查找\n", "\n", "将标签名、类名、id名进行组合\n", "\n", "- 例如查找 p 标签中,id 等于 link1的内容\n", " " ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T06:58:36.289056Z", "start_time": "2023-10-27T06:58:36.284848Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[Elsie]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('p #link1')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Select 方法:属性查找\n", "\n", "加入属性元素\n", "- 属性需要用大于号`>`连接\n", "- 属性和标签属于同一节点,中间不能加空格。\n", " \n", "\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T06:59:05.510790Z", "start_time": "2023-10-27T06:59:05.506829Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[The Dormouse's story
,\n", "Once upon a time there were three little sisters; and their names were\n", " Elsie,\n", " Lacie and\n", " Tillie;\n", " and they lived at the bottom of a well.
,\n", "...
]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select(\"body > p\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## find_all方法\n", "\n", "- 找到所有的:\n", " - `find_all`\n", " - `()`\n", "- 只找一个:\n", " - `find`\n", " - `.`\n", "- `find_all('tag name', {'class or id': 'value'})`" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T07:00:21.412286Z", "start_time": "2023-10-27T07:00:21.408110Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[The Dormouse's story
,\n", "Once upon a time there were three little sisters; and their names were\n", " Elsie,\n", " Lacie and\n", " Tillie;\n", " and they lived at the bottom of a well.
,\n", "...
]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all('p')" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T07:00:30.564143Z", "start_time": "2023-10-27T07:00:30.559833Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[The Dormouse's story
,\n", "Once upon a time there were three little sisters; and their names were\n", " Elsie,\n", " Lacie and\n", " Tillie;\n", " and they lived at the bottom of a well.
,\n", "...
]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup('p')" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T07:00:36.932827Z", "start_time": "2023-10-27T07:00:36.928226Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "The Dormouse's story
" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find('p')" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T07:00:40.387915Z", "start_time": "2023-10-27T07:00:40.383875Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "The Dormouse's story
" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.p" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T07:16:10.609548Z", "start_time": "2023-10-27T07:16:10.605052Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[\"The Dormouse's story\",\n", " 'Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.',\n", " '...']" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[i.text.replace('\\n', ' ') for i in soup('p')]" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T07:16:22.279642Z", "start_time": "2023-10-27T07:16:22.276497Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Dormouse's story\n", "Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.\n", "...\n" ] } ], "source": [ "for i in soup('p'):\n", " print(i.text.replace('\\n', ' '))" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T07:16:39.366957Z", "start_time": "2023-10-27T07:16:39.362896Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "html\n", "head\n", "title\n", "body\n", "p\n", "b\n", "p\n", "a\n", "a\n", "a\n", "p\n" ] } ], "source": [ "for tag in soup.find_all(True):\n", " print(tag.name)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2023-10-27T07:16:48.535162Z", "start_time": "2023-10-27T07:16:48.531087Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[