Python的验证码识别,模拟ajax请求,爬取优酷会员(滑稽)

来源:互联网 发布:淘宝收藏的店铺在哪里 编辑:程序博客网 时间:2024/05/19 13:19

首先想写一个爬取一个网站的优酷会员分享,但是是要输入验证码。
首先,我用谷歌分析其验证码的请求。

这里写图片描述

然后拼接url 去访问发现做了限制

这里写图片描述

那么应该是做了检测对请求头。
复制刷新验证码图片的请求头。自己构造个请求,并写出图片

def getyzm():    headers={    'Accept-Encoding':'gzip, deflate, sdch',    'Accept-Language':'zh-CN,zh;q=0.8',    'Connection':'keep-alive',    #Cookie:PHPSESSID=d763fd34e25925880c490955de8e0f2c    'Host':'vip.cengfan6.com',    'Referer':'http://vip.cengfan6.com/y/',    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36',    'X-Requested-With':'XMLHttpRequest'    }    i =random.randint(1,999999)    print(i)    url='http://vip.cengfan6.com/y/../code.php?s=%i' %i    html = requests.get(url,headers=headers)    #写出图片    with open('yzm.png','wb') as f:        f.write(html.content)

然后就是验证码识别了。开始用的pyteeser。真不是很好安装(苦笑)
参考
http://www.th7.cn/Program/Python/201602/768304.shtml

http://m.blog.csdn.net/article/details?id=53537010

https://my.oschina.net/jhao104/blog/647326?fromerr=xJxwPW5X

太麻烦了,然后用的 pytesseract

测试

import pytesseractfrom PIL import Imageimage = Image.open('c:/yzm.png')code = pytesseract.image_to_string(image)print(code)

啊,识别出了英文。我的是数字啊orz

想了下要么看下机器学习训练下。啊,我不会啊,要学!
参考学习 http://www.cnblogs.com/beer/p/5672678.html
先用人工的把(伤心)

#识别验证码def viewyzm():    print('please input yanzhengma')    time.sleep(2)    image = Image.open('yzm.png')    image.show()    yzm = raw_input(u'关闭图片才能输入')    print(yzm)getyzm()viewyzm()

后面又遇到了ajax请求。
谷歌看到请求
很有意思的是,刷新页面请求的是历史记录,先获取之前获取的账号密码。
我写了两个函数,一个是请求新的账号密码和请求历史记录的账号密码。
网站做了限制,只能获取5个。我做了代理还是只能5个。what?不是对ip做了限制?

def  get_vip():    #请求,但是没有解密,可以在历史记录中获取获取到的vip账号    headers={    'Accept-Encoding':'gzip, deflate, sdch',    'Accept-Language':'zh-CN,zh;q=0.8',    'Connection':'keep-alive',    #'Cookie':PHPSESSID=d3a9d9a7a9ad9fee71a9588773388ead    'Host':'vip.cengfan6.com',    'Referer':'http://vip.cengfan6.com/y/',    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36',    'X-Requested-With':'XMLHttpRequest'    }    proxies={        '117.90.6.65':9000    }    vip_url='http://vip.cengfan6.com/ajax.php?code=%s &typename=2' %viewyzm()    viphtml  = requests.get(vip_url,headers=headers,proxies=proxies)    print(viphtml.content)def get_host_vip():    proxies={        '117.90.6.65':9000    }    vip_url= 'http://vip.cengfan6.com/ajax_jilu.php?viptype=2'    viphtml  = requests.get(vip_url,proxies=proxies)    vips =re.findall('<p>优酷(土豆)帐号:(.+?)密码:(.+?)</p>',viphtml.content)    for vip in vips:        print(vip[0]+":"+vip[1])

应该是我设置代理的方式有误。
不过5个也是够的。我经常用这个网站的会员。手动滑稽

所有代码记录下~~:

# -*- coding: UTF-8 -*-#../code.php?s=992671249#url='http://vip.cengfan6.com/y/'import requestsfrom bs4 import BeautifulSoupimport randomfrom PIL import Imageimport timeimport re#获取验证码def getyzm():    headers={    'Accept-Encoding':'gzip, deflate, sdch',    'Accept-Language':'zh-CN,zh;q=0.8',    'Connection':'keep-alive',    #Cookie:PHPSESSID=d763fd34e25925880c490955de8e0f2c    'Host':'vip.cengfan6.com',    'Referer':'http://vip.cengfan6.com/y/',    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36',    'X-Requested-With':'XMLHttpRequest'    }    i =random.randint(1,999999)    print(i)    url='http://vip.cengfan6.com/y/../code.php?s=%i' %i    html = requests.get(url,headers=headers)    #写出图片    with open('yzm.png','wb') as f:        f.write(html.content)#识别验证码def viewyzm():    print('please input yanzhengma')    time.sleep(2)    image = Image.open('yzm.png')    image.show()    yzm = raw_input(u'关闭图片才能输入')    return yzmxhrhd ='''Accept-Encoding:gzip, deflate, sdchAccept-Language:zh-CN,zh;q=0.8Connection:keep-aliveCookie:PHPSESSID=d3a9d9a7a9ad9fee71a9588773388eadHost:vip.cengfan6.comReferer:http://vip.cengfan6.com/y/User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36X-Requested-With:XMLHttpRequest'''def  get_vip():    #请求,但是没有解密,可以在历史记录中获取获取到的vip账号    headers={    'Accept-Encoding':'gzip, deflate, sdch',    'Accept-Language':'zh-CN,zh;q=0.8',    'Connection':'keep-alive',    'Cookie':'PHPSESSID=d3a9d9a7a9ad9fee71a9588773388ewd',    'Host':'vip.cengfan6.com',    'Referer':'http://vip.cengfan6.com/y/',    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36',    'X-Requested-With':'XMLHttpRequest'    }    proxies={        '117.90.6.65':9000    }    vip_url='http://vip.cengfan6.com/ajax.php?code=%s &typename=2' %viewyzm()    viphtml  = requests.get(vip_url,headers=headers,proxies=proxies)    print(viphtml.content)def get_host_vip():    proxies={        '117.90.6.65':9000    }    vip_url= 'http://vip.cengfan6.com/ajax_jilu.php?viptype=2'    viphtml  = requests.get(vip_url,proxies=proxies)    vips =re.findall('<p>优酷(土豆)帐号:(.+?)密码:(.+?)</p>',viphtml.content)    for vip in vips:        print(vip[0]+":"+vip[1])getyzm()get_vip()get_host_vip()

真的不是为了这个获取会员而做的。主要想多写些东西。不写就容易忘。

0 0
原创粉丝点击