python & bs4 基础

python & bs4

如果基于正则表达式来爬取网页,真的是太麻烦,而且正则要学得好,还真不容易。通过 bs4 select 或者 find 返回soup对象,可以很方便地提取出HTML或XML标签中的内容,简直不能更方便

举例:

1
2
3
4
5
req = urllib2.Request(target_url, headers = _headers)
myPage = urllib2.urlopen(req).read().decode(self.encoding)
soup = BeautifulSoup(myPage,'lxml')

dom_tag_a = soup.select('div[class*="right_wrap"] > div[class*="content"] > div[class*="phref"] > a')

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×