爬虫笔记（9/23）-----urllib库的使用

来源：互联网发布：json c license 编辑：程序博客网时间：2024/05/18 11:17

1.读取内容

1）file.read()读取内容的全部，读取到额度内容赋给一个字符串变量

2）file.readline()读取全部内容，读取的内容赋给一个列表变量

3)File.readline()读取文件的一行内容

2.浏览器的模拟

1）修改报头(build_opener()的方法)

import urllib.requesturl="爬虫网页"headers=("User-Agent","随便一个网站f12后找到的信息")opener=urllib.request.build_opener()opener.addheaders=[headers]data=opener.open(url).read()

2）添加报头（add_header()）

import urllib.requesturl="爬虫网页"req=urllib.request.Request(url)req.add_header("User-Agent","随便一个网页f12后找到的信息")data=urllib.request.urlopen(req).read()

两种方法都是为了避免发生403错误。

2.超时设置

urllib.request.urlopen(要打开的网页,timeout=时间值)

3.http协议请求

1）get请求

·首先网页要是get方式，在网址上可以看到“字段1=字段内容”

·对应的URl为参数，构建request对象

·通过urlopen（）打开构建request对象

import urllib.requesturl="http://www.baidu.com/s?wd="key="微微一笑"#中文为了防止出现asscii的错误，所以要用quotekey_code=urllib.request.quote(key)url_all=url+key_codereq=urllib.request.Request(url_all)data=urllib.request.urlopen(req).read()fhandle=open("路径/5.html","wb")fhandle.write(data)fhandle.close()

2）post请求

import urllib.requestimport urllib.parseurl="http://www.iqianyue.com/mypost/"postdata=urllib.parse.urlencode({"name":"ceo@iqianyue.com","pass":"aA123456"}).encode('utf-8')req=urllib.request.Request(url,postdata)req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 LBBROWSER')data=urllib.request.urlopen(req).read()fhandle=open("路径/6.html","wb")fhandle.write(data)fhandle.close()

4.代理服务器

def use_proxy(proxy_add,url):#（代理服务器地址，爬取网页地址）    import urllib.request    proxy=urllib.request.ProxyHandler({'http':proxy_addr})#{‘http’：代理服务器地址}    opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)    urllib.request.install_opener(opener)    data=urllib.request.urlopen(url).read().decode('utf-8')    return dataproxy_addr="110.73.41.125:8123"data=use_proxy(proxy_addr,"http://www.baidu.com")print(len(data))

5.debuglog

import urllib.requesthttphd=urllib.request.HTTPHandler(debuglevel=1)httpshd=urllib.request.HTTPSHandler(debuglevel=1)opener=urllib.request.build_opener(httphd,httpshd)urllib.request.install_opener(opener)data=urllib.request.urlopen("http://edu.51cto.com")

6.异常处理URLError

1）链接不上服务器2）远程URL不存在3）无网络4）触发了HTTPError

import urllib.requestimport urllib.errortry:    urllib.request.urlopen("http://blog.baidusss.net")except urllib.error.URLError as e:    if hasattr(e,"code"):        print(e.code)    if hasattr(e,"reason"):        print(e.reason)

阅读全文

0 0