一个简单多线程爬虫的实现

来源：互联网发布：怎么通过关键字查域名编辑：程序博客网时间：2024/05/16 10:40

这几天学习爬虫，虽然网上有了很多的爬虫代码，也有一些优秀的爬虫框架，例如scrapy等，但是为了过过手，所以自己用一天实现了一个简单的爬虫，爬虫简单的说就是抓取，分析，再抓取。当然如果真正做一个的项目，大数据的话就会复杂很多，可能用到分布式，在分析的时候会比较复杂。这个小爬虫只是实现对CSDN的某个人的blog的文章进行下载，比如http://blog.csdn.net/apple_boys?viewmode=contents，不过只能下载第一页的，当然可以通过修改里面的Spider实现所有的都下载。

自己设计的大体框架就是这样

开始将初始的URL放入url列表中，由scheduing调度，scheduing会根据urllist和最大的线程数判断是否生成新的线程，新的线程处理scheduing给其分派的url,期间可能thread会生成新的url假如list中，所以list需要线程安全，当url为空（我是判断分派了url的游标cursor等于list的长度时，表明url全部处理了）且没有线程在运行，则schduing结束调度。

我的类图如下：

还是比较简单所以不多说了，看代码：

setting.py

一般写程序我喜欢写一个setting的类或是文本文件这样有利于配置，当然不喜欢可以不写，我这个配置文件只有一个配置项，就是是否使用代理，以及代理的列表

USE_PROXY=False#USE_PROXY=TruePROXY_LIST=['42.121.28.111:3128','42.121.105.155:8888'

main.py

程序的主入口文件，简单不用说

# -*- coding:utf-8 -*-import Scheduingif __name__=="__main__":#初始url,最大线程数scheduing=Scheduing.Scheduing("http://blog.csdn.net/yueqian_scut?viewmode=contents",10)scheduing.run()

Scheduing.py

调度url的文件，主要是启动新的线程然后分配一个url，让其去下载分析

import mythreadimport timeimport threadinglockthreadNum=threading.RLock()class Scheduing(object):'''every time will create a new thread may be cause low perfermance'''ThreadNumMax=30urllist=[]ThreadCountNow=0global locklistdef __init__(self,url,ThreadNum=10):#,spider,downloader):self.urllist.append(url)self.cursor=0#self.spider=spider#self.downloader=downloaderself.locklist=threading.RLock()if ThreadNum<self.ThreadNumMax:self.ThreadNumMax=ThreadNumself.domainstop=url.find('/',7)self.domain=url[:self.domainstop]print self.domaindef run(self):while(len(self.urllist)!=self.cursor or self.ThreadCountNow!=0):#print len(self.urllist)if(self.ThreadCountNow<self.ThreadNumMax and self.cursor<len(self.urllist)):lockthreadNum.acquire()self.ThreadCountNow+=1;lockthreadNum.release()self.threadid=mythread.mythread(str(self.cursor),self)self.threadid.seturl(self.urllist[self.cursor])self.threadid.start()self.cursor+=1#print 'list length:', len(self.urllist)print 'corsor',self.cursor#print self.cursortime.sleep(1)print self.urllistfile=open('./1.txt','w')for urlliststr in self.urllist:file.write(urlliststr)file.write('\n')file.close()

mythread.py

线程的类，就是调用downloader和spider不用说

import threadingimport Scheduingimport Spiderimport downloaderclass mythread(threading.Thread):def __init__(self,threadname,sched):self.scheduing=schedself.threadname=threadnamethreading.Thread.__init__(self,name=threadname)def seturl(self,urlpath):self.urlpath=urlpathdef run(self):print 'threadid:',self.threadname#print self.threadname,':',self.urlpathweb=downloader.downloader(self.urlpath)reponse=web.downloadWebPage()self.spider=Spider.mySpider(reponse,self.scheduing)self.spider.analysis()Scheduing.lockthreadNum.acquire()self.scheduing.ThreadCountNow-=1#self.scheduing.urllist.appent()Scheduing.lockthreadNum.release()

downloader.py

下载器的类，这个类比较乱，因为如果仅仅下载bolg使用get方法就好了，我因为为了用到别的网站，所以实现了post方法，防止ip问题，实现了代理，所以看起来比较多。

#-*- encoding=utf-8 -*-import urllib2import cookielibimport socketimport urllibimport timeimport settingimport randomimport osfrom urlparse import urlparseclass downloader():webpage=Noneresponse=Nonepageinfo={}def __init__(self,url='',title=''):self.title=titleself.urlpath=urlself.initSession=False def saveToFile(self,content):file=open(self.title,'wb+')file.write(content)file.close()passdef Cookie(self):fristurl='http://www.baidu.com'socket.setdefaulttimeout(20)self.agent='Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)'self.cookie=cookielib.CookieJar()self.cookie_support=urllib2.HTTPCookieProcessor(self.cookie)if setting.USE_PROXY:ip_str=random.choice(setting.PROXY_LIST)print ip_strself.webpage=urllib2.build_opener(self.cookie_support,urllib2.ProxyHandler({'http':ip_str}),urllib2.HTTPRedirectHandler)else :self.webpage=urllib2.build_opener(self.cookie_support,urllib2.HTTPRedirectHandler)self.webpage.addheaders=[('User-agent',self.agent),('Accept','*/*'),('referer',fristurl)]urllib2.install_opener(self.webpage)#response=urllib2.urlopen(self.urlpath)  # frist time get cookiepassdef download(self,values):if not self.initSession:self.Cookie()self.initSession=Truedata = urllib.urlencode(values)#print self.urlpathresquest=urllib2.Request(self.urlpath,data)tryCount=0while tryCount<5:try:#res=self.webpage.open(self.urlpath)res=urllib2.urlopen(resquest)#print res.info()self.content=res.read()self.pageinfo['url']=res.geturl()self.pageinfo['info']=res.info()self.pageinfo['status']=res.getcode()self.pageinfo['body']=self.contentprint 'read over'except Exception as e:print str(e),'try again'tryCount+=1time.sleep(2)else:tryCount=5if self.title!='':self.saveToFile(self.content)else :return#res.close()def downloadFile(self):#post methodvalues={'username':'','password':'','imageField.x':'','imageField.y':''}self.download(values)passdef downloadWebPage(self):self.download(values={})#print self.pageinfo['url']return self.pageinfopassif __name__=='__main__':down=downloader("",'')down.downloadFile()

spider.py

这个类主要是用来分析网页的，可以提取想用的内容，或者将想要分析的url送回scheduing。

import downloaderimport reimport osclass Spider(object):def __init__(self,response,sche):#print responseself.pageinfo=responseself.content=response['body']self.sched=schedef alalysis(self):passdef addTolist(self):self.sched.locklist.acquire()if(self.url_get[0:3]!='http'):self.url_get=self.sched.domain+self.url_get#print self.url_getif self.url_get in self.sched.urllist:#print len(self.sched.urllist)passelse:self.sched.urllist.append(self.url_get)self.sched.locklist.release()# write you spider followclass mySpider(Spider):def analysis(self):#save self.contenttitle=self.pageinfo['url'].replace('?','#').replace('\\','@').replace('/','+').replace(':','-')pwd=os.getcwd()file=open(pwd+"\\"+title,'wb+')file.write(self.content)file.close()self.index=self.content.find('"link_title"')while self.index>0 :self.indexhref=self.content[self.index:].find("href")#print self.indexhrefself.indexurl=self.content[self.indexhref+self.index+1:].find('"')#print self.indexurlself.indexurllast=self.content[self.indexhref+self.index+self.indexurl+2:].find('"')#print self.indexurllastself.index_ga=self.indexhref+self.index+self.indexurl+2self.index_gb=self.indexhref+self.index+self.indexurl+self.indexurllast+2self.url_get=self.content[self.index_ga:self.index_gb]self.addTolist()self.content=self.content[self.indexurllast+1:]self.index=self.content.find('"link_title"')

通过上面的这些，一个简单的爬虫就实现了，运行效果如下图（由于多线程所以打印很乱）：

所有代码下载：http://download.csdn.net/detail/apple_boys/7396509

2014.5.25凌晨于浙大西溪

0 0