爬虫自学笔记(Python3.6.1)

来源：互联网发布：微信清理后的数据恢复编辑：程序博客网时间：2024/06/05 19:43

1.GET请求

#!/user/bin/env python# coding:utf8import urllibfrom urllib import requestfrom urllib import parseurl='http://www.baidu.com/s'data={    'wd':'王浩然'}data=urllib.parse.urlencode(data)full_url=url+'?'+datarequest=urllib.request.urlopen(full_url)result=request.read().decode('utf-8')print(result)

urlencode()函数的作用是将目标转化为url编码，这里如果把data初始化为data='王浩然'就会出错，因为urlencode()函数只接受映射对象或元组，这里附上官方说明

urllib.parse.urlencode(query, doseq=False, safe=”, encoding=None, errors=None, quote_via=quote_plus)
Convert a mapping object or a sequence of two-element tuples, which may contain str or bytes objects, to a percent-encoded ASCII text string. If the resultant string is to be used as a data for POST operation with the urlopen() function, then it should be encoded to bytes, otherwise it would result in a TypeError.

2.异常的抛出

#!/user/bin/env python# coding:utf8import urllibfrom urllib import requestrequest=urllib.request.Request('http://111')try:    response=urllib.request.urlopen(request)    html=response.read()    #print(html.decode('utf-8'))except urllib.error.URLError as e:    print(e.reason)

3.Headers–伪装成浏览器

#!user/bin/env python#coding:utf-8import urllib.requesturl=" http://www.qiushibaike.com/hot/page/1"user_agent={    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}request=urllib.request.Request(url,headers=user_agent)request=urllib.request.urlopen(request)result=request.read()print(result.decode('utf-8'))

4.Cookie的使用

4.1简单获取cookie

#!user/bin/env python#coding:utf-8import urllib.requestimport http.cookiejar#声明一个CookieJar对象实例来保存cookiecookie = http.cookiejar.CookieJar()#利用urllib库的HTTPCookieProcessor对象来创建cookie处理器handler=urllib.request.HTTPCookieProcessor(cookie)#通过handler来构建openeropener = urllib.request.build_opener(handler)#此处的open方法同urllib的urlopen方法，也可以传入requestresponse = opener.open('http://www.baidu.com')for item in cookie:    print('Name = '+item.name)    print('Value = '+item.value)

4.2保存cookie到文件

#!user/bin/env python#coding:utf-8import urllib.requestimport http.cookiejarfilename='data.txt'#设置保存cookie的文件cookie=http.cookiejar.MozillaCookieJar(filename)#声明一个MozillaCookieJar对象实例保存cookiehandle=urllib.request.HTTPCookieProcessor(cookie)opener=urllib.request.build_opener(handle)response=opener.open('http://www.baidu.com')cookie.save(ignore_discard=True, ignore_expires=True)#保存cookie到文件

4.3文件中获取Cookie并访问

#!user/bin/env python#coding:utf-8import urllib.requestimport http.cookiejarcookie=http.cookiejar.MozillaCookieJar()cookie.load('data.txt',ignore_discard=True,ignore_expires=True)req=urllib.request.Request('http://www.baidu.com')opener=urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))response=opener.open(req)print(response.read().decode('utf-8'))

很一般的爬取网址

import reimport urllib.requestfrom collections import dequequeue=deque()visited=set()headers={    'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6' }url='https://www.python.org'queue.append(url)while queue:    url=queue.popleft()    visited|={url}    try:        r=urllib.request.Request(url,headers=headers)        req=urllib.request.urlopen(r)    except urllib.error.URLError as e:        print(e.reason)        continue    if 'html' not in req.getheader('Content-Type'):        continue    try:        data=req.read().decode('utf-8')    except:        continue    da=re.compile('href="(.+?)"')    link=da.findall(data)    for x in link:        if x not in visited  and 'http' in x:            queue.append(x)            print(x)

阅读全文

0 0