Python 网页下载器和解析器

来源:互联网 发布:天猫和淘宝的盈利模式 编辑:程序博客网 时间:2024/06/05 19:07

某教程网 python 爬虫视频 http://www.imooc.com/learn/563有段代码在3.X中有变化。在 3.x 版本中 ,把 2.x 版本的模块 urllib2 合并为 urllib.request 。


视频中原代码:


更改为 3.x 的版本:

# -*- coding: utf-8 -*-  # python 3.5# 网页下载器测试import urllib.requestimport http.cookiejarurl = "http://www.baidu.com"print("第一种方法:最简洁方法")response1 = urllib.request.urlopen(url) #直接请求print(response1.getcode()) #网页状态吗,200表示成功print(len(response1.read())) #网页html内容print("第二种方法:添加data、http header")request = urllib.request.Request(url)request.add_header("user-agent", "Mozilla/5.5") #添加http的headerresponse2 = urllib.request.urlopen(url) #发送请求获取结果print(response2.getcode())print(len(response2.read()))print("第三种方法:")cj = http.cookiejar.CookieJar() #创建cookie容器opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj)) #创建openerurllib.request.install_opener(opener) #给urllib 安装 openerresponse3 = urllib.request.urlopen(url) #使用带有 cookie 的 urllib 访问网页print(response3.getcode())print(len(response3.read()))

网页解析测试:

# -*- coding: utf-8 -*-  # python 3.5# 网页解析测试import refrom bs4 import BeautifulSouphtml_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""soup = BeautifulSoup(html_doc,'html.parser')print('获取所有链接')links = soup.find_all('a')for link in links:      print(link.name,link['href'],link.get_text())print('获取lacie链接')link_node =  soup.find('a',href='http://example.com/lacie')print(link_node.name,link_node['href'],link_node.get_text())print('正则匹配')link_node =  soup.find('a',href=re.compile(r"ill"))print(link_node.name,link_node['href'],link_node.get_text())print('获取P段落文字')p_node =  soup.find('p',class_="title")print(p_node.name,p_node.get_text())输出结果:获取所有链接a http://example.com/elsie Elsiea http://example.com/lacie Laciea http://example.com/tillie Tillie获取lacie链接a http://example.com/lacie Lacie正则匹配a http://example.com/tillie Tillie获取P段落文字p The Dormouse's story

附上一张简单爬虫框架图: