爬虫自学笔记(Python3.6.1)
来源:互联网 发布:微信清理后的数据恢复 编辑:程序博客网 时间:2024/06/05 19:43
1.GET请求
#!/user/bin/env python# coding:utf8import urllibfrom urllib import requestfrom urllib import parseurl='http://www.baidu.com/s'data={ 'wd':'王浩然'}data=urllib.parse.urlencode(data)full_url=url+'?'+datarequest=urllib.request.urlopen(full_url)result=request.read().decode('utf-8')print(result)
urlencode()函数的作用是将目标转化为url编码,这里如果把data初始化为data='王浩然'
就会出错,因为urlencode()函数只接受映射对象或元组,这里附上官方说明
urllib.parse.urlencode(query, doseq=False, safe=”, encoding=None, errors=None, quote_via=quote_plus)
Convert a mapping object or a sequence of two-element tuples, which may contain str or bytes objects, to a percent-encoded ASCII text string. If the resultant string is to be used as a data for POST operation with the urlopen() function, then it should be encoded to bytes, otherwise it would result in a TypeError.
2.异常的抛出
#!/user/bin/env python# coding:utf8import urllibfrom urllib import requestrequest=urllib.request.Request('http://111')try: response=urllib.request.urlopen(request) html=response.read() #print(html.decode('utf-8'))except urllib.error.URLError as e: print(e.reason)
3.Headers–伪装成浏览器
#!user/bin/env python#coding:utf-8import urllib.requesturl=" http://www.qiushibaike.com/hot/page/1"user_agent={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}request=urllib.request.Request(url,headers=user_agent)request=urllib.request.urlopen(request)result=request.read()print(result.decode('utf-8'))
4.Cookie的使用
4.1简单获取cookie
#!user/bin/env python#coding:utf-8import urllib.requestimport http.cookiejar#声明一个CookieJar对象实例来保存cookiecookie = http.cookiejar.CookieJar()#利用urllib库的HTTPCookieProcessor对象来创建cookie处理器handler=urllib.request.HTTPCookieProcessor(cookie)#通过handler来构建openeropener = urllib.request.build_opener(handler)#此处的open方法同urllib的urlopen方法,也可以传入requestresponse = opener.open('http://www.baidu.com')for item in cookie: print('Name = '+item.name) print('Value = '+item.value)
4.2保存cookie到文件
#!user/bin/env python#coding:utf-8import urllib.requestimport http.cookiejarfilename='data.txt'#设置保存cookie的文件cookie=http.cookiejar.MozillaCookieJar(filename)#声明一个MozillaCookieJar对象实例保存cookiehandle=urllib.request.HTTPCookieProcessor(cookie)opener=urllib.request.build_opener(handle)response=opener.open('http://www.baidu.com')cookie.save(ignore_discard=True, ignore_expires=True)#保存cookie到文件
4.3文件中获取Cookie并访问
#!user/bin/env python#coding:utf-8import urllib.requestimport http.cookiejarcookie=http.cookiejar.MozillaCookieJar()cookie.load('data.txt',ignore_discard=True,ignore_expires=True)req=urllib.request.Request('http://www.baidu.com')opener=urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))response=opener.open(req)print(response.read().decode('utf-8'))
很一般的爬取网址
import reimport urllib.requestfrom collections import dequequeue=deque()visited=set()headers={ 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6' }url='https://www.python.org'queue.append(url)while queue: url=queue.popleft() visited|={url} try: r=urllib.request.Request(url,headers=headers) req=urllib.request.urlopen(r) except urllib.error.URLError as e: print(e.reason) continue if 'html' not in req.getheader('Content-Type'): continue try: data=req.read().decode('utf-8') except: continue da=re.compile('href="(.+?)"') link=da.findall(data) for x in link: if x not in visited and 'http' in x: queue.append(x) print(x)
阅读全文
0 0
- 爬虫自学笔记(Python3.6.1)
- 小白自学Python3爬虫
- 小白自学Python3爬虫
- Python3爬虫笔记一
- Python3.4.1爬虫编写笔记
- 【python3自学笔记1】—Python简介
- 【python3自学笔记2】—基本语法
- Python3 爬虫笔记, 顺带mysql编码解决方案
- python3爬虫笔记(一):了解HTTP协议
- Python3爬虫学习笔记1.0——什么是爬虫?
- python3.x爬虫学习:股票数据定向爬虫笔记
- python3 爬虫
- python3爬虫
- python3 爬虫
- Python3 爬虫
- 【python3自学笔记3】—字符串和编码
- 【Python3自学笔记5】—条件判断与循环
- 【Python3自学笔记6】—使用dict和set
- UVA10534[Wavio Sequence] 动态规划 LIS
- 网易笔试题20170909
- 【拜小白opencv】29-平滑处理2线性滤波之——均值滤波
- Express + Session 实现登录验证
- Ubuntu 16.04搭建LAMP开发环境
- 爬虫自学笔记(Python3.6.1)
- Comparator<String>接口,String类数组按字符串长度排序
- linux命令--rpm命令
- [DP] POJ
- 除法取模与逆元--hdu3970 Harmonious Set
- 悦读---《读者》(2)
- 用递归实现字符串的逆置
- jsp简单介绍
- 【思维-桶记录数组动态滚动】Encrypted Password UVALive