Python抓取网页
来源:互联网 发布:网络商品直销 编辑:程序博客网 时间:2024/05/21 11:22
1. 抓取一段网页
http://blog.csdn.net/zsuguangh/article/details/6226385
---------------------------------------------------------------------------------------------------------------------------------------------------------
#!/usr/bin/env python# 1.py# use UTF-8# Python 3.3.0# get code of given URL as html text string# Python3 uses urllib.request.urlopen()# instead of Python2's urllib.urlopen() or urllib2.urlopen()# http://blog.csdn.net/zsuguangh/article/details/6226385import urllib.requestfp = urllib.request.urlopen("http://www.baidu.com")mybytes = fp.read()# note that Python3 does not read the html code as string# but as html code bytearray, convert to string withmystr = mybytes.decode("utf8")# 说明接收的数据是UTF-8格式(这样子可以解析和显示中文)fp.close()print(mystr)
---------------------------------------------------------------------------------------------------------------------------------------------------------
2. 分析html的编码方式(其实就是字符串的分析)
---------------------------------------------------------------------------------------------------------------------------------------------------------
#!/usr/bin/env python# 2.py# use UTF-8# Python 3.3.0# get the code of a given URL as html text string# Python3 uses urllib.request.urlopen()# get the encoding used first# tested with Python 3.1 with the Editra IDEimport urllib.requestdef extract(text, sub1, sub2):"""extract a substring from text between firstoccurances of substrings sub1 and sub2"""return text.split(sub1, 1)[-1].split(sub2, 1)[0]fp = urllib.request.urlopen("http://www.baidu.com") # 打开URLmybytes = fp.read()# 读取HTML信息encoding = extract(str(mybytes).lower(), 'charset=', '"')# 查找HTML数据中"charset"字符, 找到编码方式print('-'*50)print( "Encoding type = %s" % encoding )print('-'*50)if encoding:# note that Python3 does not read the html code as string# but as html code bytearray, convert to string withmystr = mybytes.decode(encoding)print(mystr)else:print("Encoding type not found!")fp.close()
---------------------------------------------------------------------------------------------------------------------------------------------------------
- 使用python抓取网页
- Python抓取中文网页
- python抓取网页图片
- Python抓取中文网页
- Python抓取中文网页
- python 抓取网页代码
- [Python]网页信息抓取
- Python抓取网页
- Python抓取网页链接
- python抓取网页
- python网页抓取
- Python抓取中文网页
- python分布式抓取网页
- python抓取网页图片
- python抓取网页
- python抓取网页
- Python 抓取网页 (一)
- Python抓取网页
- HDOJ 1247 -- Hat Words Trie
- 【struts2】赵雅智_struts2国际化
- 初学kettle入门篇(一)
- SQL触发器实例讲解
- 不同Activity之间传递数据
- Python抓取网页
- COM技术内幕学习笔记-COM概述
- 用android LinearLayout和RelativeLayout实现精确布局
- UML用例图总结
- ios的游戏性能优化技巧
- MySQL死锁导致无法查询
- JQuery常见筛选器的使用及方法(入门须知)
- android sdk manager在windows 8 64bit下闪退
- ffmpeg ios6.1编译