python获取代理(终极版本)

来源：互联网发布：手机广告屏蔽软件编辑：程序博客网时间：2024/06/12 01:36

友情链接：python获取代理IP

首先感谢各个提供免费代理IP的网站，虽然IP质量不能保证，但是科技改变生活，让我们开始用程序来“淘金”吧。

之前做过一次获取代理IP（网址可以参考顶部友情链接）由于刚开始玩这些，有些缺点没发现，运行几次后发现了问题，在此花了一下午加熬夜到凌晨3点，把程序修改到第三版，效果目前来说还算满意。

1、第一版是从代理网站过去一次IP，经检查后，重复使用，直到一个IP两次都是无效的，则从列表删除

优点：IP利用率高，访问代理网站次数少

缺点：直接使用本机IP访问，速度慢，一次一个

2、第二版从代理网站获取一次IP，然后全部使用一次，全部丢弃，再重新访问

优点：速度快

缺点：IP利用率不高，本机IP容易被网站（抓取代理IP的网站）服务器加入黑名单，返回503拒绝服务错误，本人亲身经历

3、第三版从网站获取代理IP后，再反向使用获取的代理IP爬取代理网站

优点：多线程，速度快，不会被服务器拉黑名单，自动检测IP缓冲区，低于下限制，开启获取，高于上限制，停止获取

缺点：第一次的获取“启动IP”是关键

getProxyIP_V3.py

#!/usr/bin/python#-*- coding:utf-8 -*-#author:dasuda#CSDN blog:HelloHaibo#data:2017.8.25import urllib2import reimport socketimport threadingimport global_paraimport targetURLsimport randomimport datetimeimport timeimport user_agentsimport get_first_ipimport os#thread locklock = threading.Lock()#get ip and port from htmldef html_to_ip(html):    a = re.compile(r'(?<=<td>)[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}')    b = re.compile(r'(?<=<td>)[\d]{2,5}(?=</td>)')    html_to_ip_ip_table=[]    findIP = re.findall(a, str(html))    findPORT = re.findall(b, str(html))    for i in range(len(findIP)):        temp = findIP[i] + ":" + findPORT[i]        html_to_ip_ip_table.append(temp)    # html_to_ip_ip_table is a list    return html_to_ip_ip_table#thread function#para1: 0 -> get 1->attackdef get_one(url_get,good_ip,cnt,para1,para2):    #open url timeout    socket.setdefaulttimeout(5)    try:        #assemble ip like {'http':'1.1.1.1:8888'}        aseemble_ip = {'http': good_ip}        proxy_support = urllib2.ProxyHandler(aseemble_ip)        openercheck = urllib2.build_opener(proxy_support)        urllib2.install_opener(openercheck)        if para1 == 0:            temp_url_get = url_get+str(random.randint(1,3))        elif para1==1:            temp_url_get = url_get        request = urllib2.Request(temp_url_get)        temp_agent = random.choice(user_agents.user_agents)        request.add_header('User-Agent',temp_agent)        content = urllib2.urlopen(request).read()        if para1 == 0:            # print content            tt = html_to_ip(content)            print 'this one is ok ',temp_url_get            lock.acquire()            #print(global_para.IP_data[i],'is OK')            if len(global_para.IP_data) <= cnt:                global_para.IP_data.extend(tt)            temp_tt = {}.fromkeys(global_para.IP_data).keys()            print 'temp_tt:',len(temp_tt)            global_para.IP_data= temp_tt            lock.release()            openercheck.close()        elif para1 == 1:            openercheck.close()    except Exception as e:        lock.acquire()        if para1 == 0:            good_ip = good_ip        elif para1 ==1:            if len(global_para.IP_data) >= para2:                if good_ip in global_para.IP_error_list:                    global_para.IP_error_list.remove(good_ip)                    if good_ip in global_para.IP_data:                        global_para.IP_data.remove(good_ip)                    print 'now drop ip:%s' %good_ip                    print 'proxy ip left:%d' %len(global_para.IP_data)                else:                    global_para.IP_error_list.append(good_ip)                    print 'ip:%s go to IP_error_list' %good_ip                    print 'IP_error_list:',len(global_para.IP_error_list)        #print('error')        lock.release()def mul_thread_get(url_mul_get,get_counter,get_mode,go_refresh):    threads = []    for i in range(len(global_para.IP_data)):        thread = threading.Thread(target=get_one, args=[url_mul_get,global_para.IP_data[i],get_counter,get_mode,go_refresh,])        threads.append(thread)#        thread.start()        #print "new thread start",i    for thr in threads:        thr.start()    for thread in threads:        thread.join()    if get_mode==0:        if len(global_para.IP_data) >= get_counter:            print 'ok,get ip done'            return 1        else:            print 'getting ip...'            return 0    elif get_mode == 1:        if len(global_para.IP_data) <= go_refresh:            return 0        else:            return 1

global_para.py文件：

#getProxyIP.py ,this file is about global paraIP_data = []IP_data_temp = []IP_data_checked = []findIP = []findPORT = []available_table = []IP_error_list = []csdn_url_cnt = 0error_cnt = 0start_time = 0L

user_agents.py文件：保存访问头，每次访问随机选取一个

#!/usr/bin/python#-*- coding:utf-8 -*-user_agents = [    'Opera/9.25 (Windows NT 5.1; U; en)',    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',    'Mozilla/5.0 (X11; U; linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',    'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9'    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",]

mul_thread_get()函数介绍

#开启获取模式，当获取到不重复的IP大于等于400个时候，停止获取#特点：多线程获取，获取的IP不重复，不进行可用性检验(没有必要)#返回值：是否完成，当前个数大于等于400，返回1，反之，返回0rez = mul_thread_get（'代理网站的代理IP页面',400,0,0）

不过需要先获取至少一个能用的代理IP，让程序“起火”，哈哈，不过你也可以将程序稍加修改，提供思路：先用本机IP访问代理网站，获取“启动IP”后，调用mul_thread_get即可使global_para.IP_data的IP在低于指定值时开启获取，这样在一般情况下，IP是源源不断的，而且不会被网站服务器拉黑。
使用实例：

while True:    rez = mul_thread_get('http://www.***.com/page/',400,0,0)#网址自行脑补，免费的网站就那几个    if rez == 0:        time.sleep(1)        continue    else:        print 'get over!!!!!!!!!'#此时global_para.IP_data里面为获取到的IP

下面是运行后的输出：

当当当当，效果是不是很赞，see u next time~

阅读全文

1 0