BeautifulSoup4学习笔记

共计 9537 个字符，预计需要花费 24 分钟才能阅读完成。

简单入门

from bs4 import BeautifulSoup
#得到一个 bs 对象
 req = urllib.request.Request(url,headers=header)
 response = urllib.request.urlopen(req)
 data = response.read()
 soup = BeautifulSoup(data, 'html.parser')
 #打印 url 内容
 print(soup.prettify())

浏览结构化数据

soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'# 这时获取的内容类型是 <class'bs4.element.NavigableString'>
# 如果想要得到类型为 str，可以尝试 str(soup.title.string)
soup.title.parent.name
# u'head'
#获取 title 的父级标签的 name
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
# 获取第一个 p 标签
soup.p['class']
# u'title'
# 获取 p 标签 class 内容
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

获取所有 a 标签

for link in soup.findAll('a'):
    print(link.get('href'))

获取文档全部文字内容

p = soup.p
print(p.get_text())
print(soup.get_text())

Beautiful Soup 将复杂 HTML 文档转换成一个复杂的树形结构, 每个节点都是 Python 对象, 所有对象可以归纳为 4 种:
Tag , NavigableString , BeautifulSoup , Comment .

tag 就是一些 html 标签，比如 hello beautiful
重点说一下 tag 的 name 和 attr 属性

# 每个标签都有自己的名字
#属性就是 js 中的 attr 那部分，比如是 class data-id id 等
#直接通过字典来访问属性
soup1 = BeautifulSoup("<p class='p' id='p1'></p>",'lxml')
p = soup1.p
print(p['class'])
print(p.attrs)
#['p']
#{'class': ['p'], 'id': 'p1'}

有些属性是拥有多个值的，最常见的就是 class，比如 , 通过字典来访问属性，返回的是一个 list 列表，包括所有的 class 值

soup1 = BeautifulSoup("<p class='p p1 p2' id='p1'></p>",'lxml')
p = soup1.p
print(p['class'])
#['p', 'p1', 'p2']

字符串常被包含在 tag 内.Beautiful Soup 用 NavigableString 类来包装 tag 中的字符串:

tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>

转换成 Unicode

unicode_string = unicode(tag.string)
unicode_string
# u'Extremely bold'
type(unicode_string)
# <type 'unicode'>

NavigableString 对象支持遍历文档树和搜索文档树中定义的大部分属性, 并非全部. 尤其是, 一个字符串不能包含其它内容(tag 能够包含字符串或是其它 tag), 字符串不支持 .contents 或 .string 属性或 find() 方法.

BeautifulSoup 对象表示的是一个文档的全部内容. 大部分时候, 可以把它当作 Tag 对象

因为 BeautifulSoup 对象并不是真正的 HTML 或 XML 的 tag, 所以它没有 name 和 attribute 属性. 但有时查看它的 .name 属性是很方便的, 所以 BeautifulSoup 对象包含了一个值为“[document]”的特殊属性 .name

主要是获取注释内容

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
print(comment)
#<class 'bs4.element.Comment'>
#Hey, buddy. Want to buy a used parser?

Comment 对象是一个特殊类型的 NavigableString 对象:

以下面的 html 为例

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#tag 的操作
head = soup.head
title = soup.title
b = soup.body.b
#通过. 的方式只能获取第一个 tag，如果想获取所有 tag，可以使用 find_all（findAll）a_list = soup.find_all('a')
print(a_list)
print(type(a_list))
print(soup.find_all('b'))
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
#<class 'bs4.element.ResultSet'>
#[<b>The Dormouse's story</b>]

tag 的 .contents 属性可以将 tag 的子节点以列表的方式输出:

# 看上面的 html，第一个 p 包含 b 子标签，下面的代码输出 p 下面的子节点列表
tags = soup.p.contents
print(tags)
#[<b>The Dormouse's story</b>]

.contents 和 .children 属性仅包含 tag 的直接子节点.
.descendants 可以返回子节点和子节点的子节点（孙节点）

print(type(soup.body))
#<class 'bs4.element.Tag'>
tags = soup.body
for child in tags.descendants:
    print(child)
# <p class="title"><b>The Dormouse's story</b></p>
# <b>The Dormouse's story</b>
# The Dormouse's story
#
#
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# Once upon a time there were three little sisters; and their names were
#
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# Elsie
# ,
#
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# Lacie
#  and
#
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# Tillie
# ;
# and they lived at the bottom of a well.
#
#
# <p class="story">...</p>
# ...

如果 tag 只有一个 NavigableString 类型子节点, 那么这个 tag 可以使用 .string 得到子节点:

通过 .parent 属性来获取某个元素的父节点. 在例子“爱丽丝”的文档中,<head> 标签是 <title> 标签的父节点:

和.descendants 对应，获取所有的父节点

上面基本说的是如何获取 tag，获取 tag 下的元素，获取 tag 的父、子、兄元素。
本章主要讲述如何搜索文档。

首先介绍一下过滤器类型

最简单的过滤器是字符串. 在搜索方法中传入一个字符串参数,Beautiful Soup 会查找与字符串完整匹配的内容, 下面的例子用于查找文档中所有的 标签:

soup.find_all('b') # [The Dormouse's story]

正则表达式

如果传入正则表达式作为参数,Beautiful Soup 会通过正则表达式的 match() 来匹配内容. 下面例子中找出所有以 b 开头的标签, 这表示 <body> 和 标签都应该被找到:

import re for tag in soup.find_all(re.compile("^b")): print(tag.name) # body # b

列表

如果传入列表参数,Beautiful Soup 会将与列表中任一元素匹配的内容返回. 下面代码找到文档中所有 <a> 标签和 标签:

soup.find_all(["a", "b"]) # [The Dormouse's story, # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

True
True 可以匹配任何值, 下面代码查找到所有的 tag, 但是不会返回字符串节点

for tag in soup.find_all(True): print(tag.name) # html # head # title # body # p # b # p # a # a # a # p

一些搜索实例

搜索所有的 p 标签

find_all('p')

搜索 id 为 apple 的标签

find_all(id='apple')

通过属性进行搜索

data_soup.find_all(attrs={"attr": "value"})

搜索 class = sister 的元素

find_all('p',class_='sister')

string 参数

通过 string 参数可以搜搜文档中的字符串内容. 与 name 参数的可选值一样, string 参数接受字符串 , 正则表达式 , 列表, True . 看例子:

soup.find_all(string="Elsie") # [u'Elsie'] soup.find_all(string=["Tillie", "Elsie", "Lacie"]) # [u'Elsie', u'Lacie', u'Tillie'] soup.find_all(string=re.compile("Dormouse")) [u"The Dormouse's story", u"The Dormouse's story"]

soup.find_all("a", string="Elsie")

[<a href="http://example.com/elsie" class="sister" id="link1">Elsie]

limit 参数

# 只返回前两条 soup.find_all("a", limit=2)

css 选择器

# 获取 title soup.select("title") # [<title>The Dormouse's story</title>] #获取 p 标签中第三个元素 soup.select("p nth-of-type(3)") # [...]

通过 tag 标签逐层查找:

soup.select("body a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("html head title") # [<title>The Dormouse's story</title>]

找到某个 tag 标签下的直接子标签

soup.select("head > title") # [<title>The Dormouse's story</title>] soup.select("p > a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("p > a:nth-of-type(2)") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] soup.select("p > #link1") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] soup.select("body > a") # []

找到兄弟节点标签

:soup.select("#link1 ~ .sister") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("#link1 + .sister") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过 CSS 的类名查找:

soup.select(".sister") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] #搜索 class 包含 sister 的元素 soup.select("[class~=sister]") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过 tag 的 id 查找:

soup.select("#link1") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] soup.select("a#link2") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

同时用多种 CSS 选择器查询元素:

soup.select("#link1,#link2") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过是否存在某个属性来查找:

soup.select('a[href]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过属性的值来查找:

soup.select('a[href="http://example.com/elsie"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] #搜索以 http://example.com/ 开头的元素 soup.select('a[href^="http://example.com/"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select('a[href$="tillie"]') # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select('a[href*=".com/el"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

通过语言设置来查找:

multilingual_markup = """ Hello Howdy, y'all Pip-pip, old fruit Bonjour mes amis """ multilingual_soup = BeautifulSoup(multilingual_markup) #以 en 为开头的所有元素 multilingual_soup.select('p[lang|=en]') # [Hello, #Howdy, y'all, #Pip-pip, old fruit]

返回查找到的元素的第一个

soup.select_one(".sister") #\<a class="sister" href="http://example.com/elsie" id="link1">Elsie\</a>

BeautifulSoup4学习笔记

爬取页面的时候可以使用 urllib 加上正则表达式，正则表达麻烦且不好使用，bs4 闪亮登场。除了 bs 还有属于大型爬虫框架的 scrapy。

对象种类

tag

多值属性

NavigableString（可以遍历的字符串）

BeautifulSoup

注释及特殊字符串

核心（遍历文档树）

.contents 和 .children

.descendants

.string

父节点.parent

父节点.parents

兄弟节点

搜索文档书（重要）

find 和 find_all

字符串

正则表达式

列表

一些搜索实例

string 参数

[<a href="http://example.com/elsie" class="sister" id="link1">Elsie]

limit 参数

css 选择器