Python抓图脚本

来源：互联网发布：怎样制作软件编辑：程序博客网时间：2024/04/30 09:58

实习期太无聊了，前两周才决定开始看一看被大家推崇的高开发效率的python。打算用python来写个东西，正好在python分区看到有人写了类似的demo，好吧，咱就写这个。在周六上午dota连跪几把后才开始写，周日晚上才能运行。在某网站下了几百M，测试还算可行。然后，这两天又修修补补，现在还算比较稳定吧。拿出来给大家批评教育下，毕竟也没写过python。好了废话不多说，上菜！

注：用的是python 3.3

使用了htmlparser 来分析网页数据、抓取链接，然后创建了一个ScratchFactory类来过滤链接、保存图片。ScratchFactory 继承了threading.Thread 是一个独立的线程其中下载每个图片也是并发的。

另外开了两个全局变量 UrlSrc、UrlDiged来存储抓取的链接和遍历过的链接。

但是并发不一定好，因为如果你请求太快，网站服务器会认为是ddos攻击，拒绝连接的。所以我在主线程里限制了线程数量，还有给每个请求设置时间间隔，防止出现DDos类似效果。

主线程中控制速度：

01while True:
02        iflen(threading.enumerate()) > THREAD_NUM:       
03            continue
04        mLock.acquire()
05        if UrlSrc.__len__():
06            temp = UrlSrc.pop(0)
07            t = ScratchFactory(temp)
08            UrlDiged.append(temp)
09            t.start()
10        mLock.release()
11        #打印当前连接数、线程数、urlsrc+urldiged表长
12        print("Conections:",UrlSrc.__len__(),"*****threads:",\
13              len(threading.enumerate()),"****TableLength:",\
14              (len(UrlSrc)+len(UrlDiged))/1000) 
15        if time.localtime().tm_min%2 == 0 \
16        and time.time() - savetime > 60 :
17            save()                 #保存现场
18            savetime = time.time()
19        time.sleep(SLEEP_TIME)

以下是两个主要类:

MyHtmlParser

继承于 HTMLParser，主要用来分析html文本，提取出标题、编码方式、链接和图片链接

ScratchFactory

主要有这几个函数：

addHeader：这个是给提取的是相对路径的链接加上头变成绝对路径

clearData：这个过滤提取的链接，过滤title

saveImage：开线程给save下载图片

run：

因为保存图片的path和文件名是根据抓取页面的title来命名的，因为中文涉及到编码的问题，编码方式一般都是放在http的响应头里面，但我不知道在python里面如何获得响应头(知道的请告诉我一声)，只好在html文本里的<meta>标签里去找了。(这个问题已经解决，采用了@pjx2013 的方法使用了chardet，chardet在python3下运行不了，不过有个大哥把chardet修改了) 感谢 @pjx2013.

http://www.cnblogs.com/dajianshi/archive/2012/12/18/2827083.html

这里的feed是不接受字节类型的，所以强制把它转成 utf-8，然

01conect = urllib.request.urlopen(self.url)    #下载网页数据
02            data = conect.read()
03            conect.close()
04            htmlx = MyHtmlParser()
05            htmlx.feed(data[:500].decode('utf-8','ignore'))       
06            t = htmlx.charset                            #获得html编码
07            if t == '':
08                t = 'gb2312'
09            htmlx.reset()
10            htmlx.feed(data.decode(t,'ignore'))
11            self.title = htmlx.title

总的来说 MyHtmlParser写得很垃圾，还可以改进的。原本偷懒觉得用htmlparser方便，结果发现还是用来很多正则表达式。

下面是源码:

有'<-'的地方可以根据需要修改的，如果不断出现 10054 应该是请求速度过快了

view source
print?
001'''
002Created on 2013-2-1
003@author: 李鹏飞
004@mailto: andres.lee4fun@gmail.com
005运行环境:Python 3
006'''
007#coding:utf-8
008import re
009import urllib.request
010from html.parser import HTMLParser
011from html.parser import HTMLParseError
012import os
013import threading
014import time
015import chardet
016 
017class MyHtmlParser(HTMLParser):
018    def __init__(self):
019        HTMLParser.__init__(self)
020        self.url = []
021        self.img = []
022        self.title = []
023    def handle_starttag(self, tag, attrs):
024        if tag == "a":
025            for i in attrs:
026                if i[0] == "href":
027                    self.url.append(i[1])
028        elif tag == "title":
029            self.title = 1
030        for i in attrs:
031            if re.match('http://.+\.(jpg|jepg|png)',str(i[1])):
032                self.img.append(i[1])
033         
034    def handle_data(self, data):
035        if self.title == 1:
036            self.title = data
037        findimg = re.findall('http://.+?\.jpg',data)
038        for i in range(0,len(findimg)):
039                    findimg[i] = findimg[i]
040        self.img += findimg
041             
042    def handle_startendtag(self, tag, attrs):
043        if tag == "a":
044            for i in attrs:
045                if i[1] == "href":
046                    self.url.append(i[1])
047        for i in attrs:
048            if re.match('http://.+\.(jpg|jepg|png)',str(i[1])):
049                self.img.append(i[1])
050 
051class ScratchFactory(threading.Thread):
052    def __init__(self,url):
053        threading.Thread.__init__(self)
054        self.url = url
055        self.tempImgs = []
056        self.tempUrls = []
057        self.title = []
058        global seed
059        match = re.search(seed + '.*/',url)
060        if match:
061            self.pwd = match.group()
062             
063    def addHeader(self,data):
064        global seed
065        for i in range(0,len(data)):
066            if re.match("http.+", data[i]) == None:
067                if re.match("/.*",data[i]):
068                    data[i] = seed + data[i]
069                elif re.match('./.*',data[i]):
070                    data[i] = self.pwd + data[i][2:]
071                else:
072                    data[i] = self.pwd + data[i]
073        returndata                  
074    def run(self):
075        try:
076            conect = urllib.request.urlopen(self.url)    #下载网页数据
077            data = conect.read()
078            conect.close()
079            htmlx = MyHtmlParser()
080            t = chardet.detect(data)                     #获得html编码
081            if t['encoding']:
082                charset = t['encoding']
083            else:
084                charset = 'utf-8'
085            htmlx.feed(data.decode(charset,'ignore'))
086            self.title = htmlx.title
087            self.tempUrls = self.addHeader(htmlx.url)    #给相对路径链接加上头
088            self.tempImgs = self.addHeader(htmlx.img)
089            htmlx.close()
090            self.clearData()                             #过滤无用链接
091            threading.Thread(target = \
092                             self.saveImages,args = () ).start()  #下载图片
093        except HTMLParseError as e:
094            print("####Error : 1 ######:",e , '--->',  self.url)
095        except Exception as e:
096            print("####Error : 2 ######:",e , '--->' , self.url)
097         
098        global UrlSrc,UrlDiged,mLock
099        mLock.acquire()
100        t = []
101        for temp in self.tempUrls:
102            if not UrlDiged.__contains__(temp):
103                t.append(temp)
104        l = []
105        for temp in t:
106            if not UrlSrc.__contains__(temp):
107                l.append(temp)
108        UrlSrc += l
109        mLock.release()
110     
111    def clearData(self):
112        #去除重复链接
113        self.tempUrls = set(self.tempUrls)
114        self.tempImgs = set(self.tempImgs)
115        global seed
116        t = []
117        for temp in self.tempUrls:                    #<-链接过滤,正则表达式
118            if re.match(seed + "/.*", temp):
119                t.append(temp)
120        self.tempUrls = t
121         
122        t = []
123        for temp in self.tempImgs:                    #<-图片过滤,正则表达式
124            ifre.match(".+.(gif|jpg|png)",temp):               
125                t.append(temp)
126        self.tempImgs = t
127        self.title = re.split('(-|_)',self.title)[0]  #<-页面标题分隔，截取title中关键字
128        #去除title中非法字符
129        self.title =self.title.replace(' ','')              
130        self.title = self.title.replace('/','')
131        self.title = self.title.replace('\\','')
132        self.title = self.title.replace(':','')
133        self.title = self.title.replace('|','')
134        self.title = self.title.replace('?','')
135        self.title = self.title.replace('*','')
136        self.title = self.title.replace('<','')
137        self.title = self.title.replace('>','')
138        self.title = self.title.replace('\r','')
139        self.title = self.title.replace('\n','')
140        self.title = self.title.replace('\t','')
141         
142    def save(self,path,url):
143        global MinSize
144        try:
145            req = urllib.request.Request(url)
146            req.add_header("User-Agent", "Mozilla/5.0 (Windows NT 6.1) \
147            AppleWebKit/537.11 (KHTML, like Gecko) \
148            Chrome/23.0.1271.64 Safari/537.11")
149            req.add_header("Referer",self.url)         #有些网站防盗链，所以自己加上头
150            conect = urllib.request.urlopen(req)
151            t = conect.read()
152            conect.close()
153            ift.__len__() < MinSize:                    
154                return
155            if not os.path.exists(path):
156                os.mkdir(path)
157            f = open(path + "\\" + self.title + time.strftime("%H%M%S",\
158                     time.localtime()) + ".jpg","wb")
159            f.write(t)
160            f.close()
161        except HTMLParseError as e:
162            print("####Error : 3 ######:",e , '--->',  url)
163        except Exception as e:
164            print("####Error : 4 ######:",e , '--->',  url)
165            
166    def saveImages(self):
167        global IMG_TIME
168        global SAVE_PATH
169        if len(self.tempImgs) == 0:
170            return
171        path = SAVE_PATH + '\\' + self.title
172        print("Downdow------->",self.title)
173        while len(self.tempImgs) != 0:
174            t = threading.Thread(target=self.save,args=\
175                                 (path,self.tempImgs.pop(0)))
176            if len(self.tempImgs) != 0:
177                t.start()
178                time.sleep(IMG_TIME)
179            else:
180                t.start()
181                t.join()
182                if os.path.exists(path) and len(os.listdir(path)) == 0:
183                    os.rmdir(path)                      
184 
185 
186def save():
187    global mLock
188    global UrlSrc
189    #global ImgDiged
190    #global iLock
191    global SAVE_PATH
192    mLock.acquire()
193    #iLock.acquire()
194    try:
195        f = open(SAVE_PATH + r"\UrlDiged.txt",'w')
196        for i in UrlDiged:
197            f.write(i + '\n')
198        f.close()
199         
200        f = open(SAVE_PATH +r"\UrlSrc.txt",'w')
201        for i in UrlSrc:
202            f.write(i + '\n')
203        f.close()
204                 
205        print("********************* Saved **********************")
206    except Exception as e:
207        print (e)
208    finally:
209        mLock.release()
210         
211    
212def readBackup():
213    global UrlDiged
214    global UrlSrc
215    try:
216        f = open(SAVE_PATH + r"\UrlDiged.txt",'r')
217        while True:
218            t = f.readline()
219            if t == '':
220                break
221            t = t.replace('\n','')
222            UrlDiged.append(t)
223        f.close()
224        f = open(SAVE_PATH + r"\UrlSrc.txt",'r')
225        while True:
226            t = f.readline()
227            if t == '':
228                break
229            t = t.replace('\n','')
230            UrlSrc.append(t)
231        f.close()
232    except Exception as e:
233        print(e)
234 
235 
236#*****************************start********************************           
237 
238 
239if __name__ == '__main__':
240         
241    #timeout = 20   
242    #socket.setdefaulttimeout(timeout)
243    seed = "http://www.xxxx.com/"    #<-站点的根页面
244    SAVE_PATH = r"e:\scratch"       #<-存储目录
245    THREAD_NUM = 35                 #<-限制线程数，以控制下载速度，防止出现类DDos攻击
246    SLEEP_TIME = 2.5                #<-每次请求链接的时间间隔(秒)，太快不一定好哟！
247    MinSize = 72000                 #<-过滤小图片，初始32k
248    IMG_TIME = 1.5                  #<-下载图片速度，初始1.5秒一张
249    UrlSrc = [seed]                 #存储获得的未遍历过的链接
250    UrlDiged = []                   #存储遍历过的链接
251    mLock = threading.Lock()        #UrlSrc和UrlDiged的同步锁
252    savetime = time.time()
253 
254    if not os.path.exists(SAVE_PATH):
255        os.mkdir(SAVE_PATH)
256    if seed[-1:] == '/':
257        seed = seed[:-1]
258        
259    #读取上一次运行的现场
260    if not os.path.exists(SAVE_PATH + r'\UrlDiged.txt'):
261        try:
262            f = open(SAVE_PATH + r'\UrlDiged.txt','w')
263            f.close()
264            f = open(SAVE_PATH + r'\UrlSrc.txt','w')
265            f.close()
266        except Exception as e:
267            print(e)
268    else:
269        readBackup()
270 
271     
272    while True:
273        iflen(threading.enumerate()) > THREAD_NUM:       
274            continue
275        mLock.acquire()
276        if UrlSrc.__len__():
277            temp = UrlSrc.pop(0)
278            t = ScratchFactory(temp)
279            UrlDiged.append(temp)
280            t.start()
281        mLock.release()
282        #打印当前连接数、线程数、urlsrc+urldiged表长
283        print("Conections:",UrlSrc.__len__(),"*****threads:",\
284              len(threading.enumerate()),"****TableLength:",\
285              (len(UrlSrc)+len(UrlDiged))/1000) 
286        if time.localtime().tm_min%2 == 0 \
287        and time.time() - savetime > 60 :
288            save()                 #保存现场
289            savetime = time.time()
290        time.sleep(SLEEP_TIME)
291