[python]抓取啄木鸟社区《活学活用wxPython》内容与图片

来源：互联网发布：蒋介石为何不抵抗知乎编辑：程序博客网时间：2024/04/30 11:16

请参考crifan的博文如何用Python,C#等语言去实现抓取静态网页+抓取动态网页+模拟登陆网站这是我看到的关于爬取与模拟登陆最详尽的一个系列，总结整理了很多，获益不少编辑 20130105 瑾诚

因为元旦放假不一定总能上网，所以决定把《活学活用wxPython》抓下来，顺便练习一下。

感谢啄木鸟和pyug ，我一直是对着社区里的开源书自学的

另外还有两篇博文也列一下AstralWind的Python线程指南和deerchao的正则表达式30分钟入门教程在我写这个py的时候很有帮助

代码写得不是很好，希望看见的诸位能够给予指导

全文代码在github

因为是专为《活学活用wxPython》而写的，针对性很强，所以所有正则表达式也都是专门写的，不具有普适性

思路

1、获得所有章节的url列表

2、获取每一章节中所有图片地址

3、下载各章节：为了不至于重复下载，将url放入字典，{url:True}

4、下载图片：也放在字典里

代码

0、使用到的库

# -*- coding:gb2312 -*-from sgmllib import SGMLParserimport urllib2,refrom urllib import urlretrieveimport threadingimport timeimport os

1、获取章节列表

class URLLister(SGMLParser):    """    urls:chapter url    """    match = ''    dicurl = {}    def start_a(self, attrs):        for k, v in attrs:            if k == 'href' and re.match(self.match,v):                self.dicurl[v]=True

重载SGMLPareser，使用正则找到符合期望的url

2、获取图片列表

class IMGLister(SGMLParser):    search = ''    dicimg = {}    def start_img(self, attrs):        for k, v in attrs:            if k == 'src' and re.search(self.search,v):                self.dicimg[v] = True

3、下载章节内容

def spider(baseurl,dicurl,filepath):    g_mutex = threading.Lock()     #进程锁    url = ''    for k in dicurl:        g_mutex.acquire()        if dicurl[k]:            dicurl[k]=False            url = baseurl + k            break        g_mutex.release()      if url is not '':            content = urllib2.urlopen(url).read()        res = re.split('/',url)        lenth = len(res)        filename = res[lenth-1]        filepath += filename+'.html'        try:            fw = open(filepath,'w')            fw.write(content)            print filepath + '文件输出成功'            fw.close()          except IOError, e:            print e    time.sleep(1)

4、下载图片文件

def spiderimg(baseurl,dicimg,filepath):    g_mutex = threading.Lock()    url = ''    for k in dicimg:        g_mutex.acquire()        if dicimg[k]:            dicimg[k]=False            url = baseurl + k            break        g_mutex.release()    if url is not '':        downfile(url,filepath)    time.sleep(1)    def downfile(netpath,localpath):            filenamerule = re.compile(r'(?<=\btarget\b=)(.*\..*)$')    filenameres = re.search(filenamerule, netpath)    filename = filenameres.group(0)        try:        urlretrieve(netpath,localpath + filename)        print localpath + filename + '保存成功'    except IOError, e:        print e

5、调用

下面写得很乱，主要包括开线程下载章节列表和开线程下载图片，图省事没有写到方法里

#begin url_base = 'http://wiki.woodpecker.org.cn'print '打开网页...'+url_base+'/moin/WxPythonInAction'content = urllib2.urlopen(url_base+'/moin/WxPythonInAction').read()print '开始查找href...'lister=URLLister()lister.match = '/moin/WxPythonInAction/Chapter'lister.feed(content)listimg = IMGLister()listimg.match = ''global dicurlglobal dicimgdicurl = lister.dicurlglobal g_mutexthreadpool = []print '文件保存地址(such as d:\docs\)'filepath = raw_input()'''filepathrule = re.compile(r'\\$')res = re.search(filepathrule,filepath)if res.group(0):filepath += '\\'print filepath'''try:            os.makedirs(filepath)    print '文件夹不存在，已创建'except:    print '文件夹存在，继续执行'for k in lister.dicurl:    th = threading.Thread(target = spider, args = (url_base,dicurl,filepath))    threadpool.append(th)    for th in threadpool:         th.start()for th in threadpool:     threading.Thread.join(th)print '文件下载完成，开始下载图片'folder = 'images\\'os.makedirs(filepath + folder)for k in dicurl:        url = url_base + k    content = urllib2.urlopen(url).read()    imglister = IMGLister()    imglister.search = r'/moin/WxPythonInAction/\bChapter\w+\b\?action=AttachFile\&do=get\&target=(.*\..*)$'    imglister.feed(content)    dicimg = imglister.dicimg    '''    folderrule = re.compile(r'\bChapter\w+\b')    for val in dicimg:                folderres = re.search(folderrule, val)        folder = folderres.group(0)        folder += '\\'        break    if not os.path.exists(filepath + folder):        os.makedirs(filepath + folder)    '''        threadpool2 = []    for val in imglister.dicimg:        th = threading.Thread(target = spiderimg, args = (url_base, dicimg, filepath + folder))        threadpool2.append(th)    for th in threadpool2:        th.start()    for th in threadpool2:        threading.Thread.join(th)    print k + '图片下载完成'

6、问题

1、实际上应该是在下载章节列表的同时下载图片，但是失败了，还需要再研究研究

2、图片应该是放在各个章节的文件夹里，而不是统一放在images文件夹里

3、文件存放地址，只能是d:\docs\而不能是d:\docs，这个判断没有加- -

附录

附一张python的简单正则匹配。修改时间2013-01-15

语法说明示例.匹配除换行符 \n 以外的任意字符b.c 匹配 bac,bdc*匹配前一个字符 0 次或多次b*c 匹配 c，或者 bbbc+匹配前一个字符 1 次或多次b+c 匹配 bc 或者 bbbc？匹配前一个字符 0 或 1 次b?c 匹配 c 或者 bc{m}匹配前一个字符 m 次b{2}c 匹配 bbc{m,n}匹配前一个字符 m 至 n 次b{2,5}c 匹配 bbc 或者 bbbbc[abc]匹配 [] 内的任意字符[bc] 匹配 b 或者 c\d匹配数字 [0-9]b\dc 匹配 b1c 等\D匹配非数字，等价于 [^\d]b\Dc 匹配 bAc\s匹配空白字符b\sc 匹配 b c\S匹配非空白字符 [\^s]b\Sc 匹配 bac\w匹配 [A-Za-z0-9_]b\wc 匹配 bAc 等\W等价于 [^\w]b\Wc 匹配 b c\转义字符，b\\c 匹配 b\c^匹配字符串开头^bc 匹配句首的 bc$匹配字符串末尾bc$ 匹配以 bc 结尾的字符串\A仅匹配字符串开头\Abc 匹配字符串开头的 bc\Z仅仅匹配字符串末尾bc\Z 匹配字符串末尾的 bc|匹配左右表达式任意一个b|c 匹配 b 或者 c