Python+Ghost抓取动态网页图片，并模拟页面Get请求

来源：互联网发布：英格拉姆赛季数据编辑：程序博客网时间：2024/05/01 17:26

好，上次我们说了怎么抓取豆瓣妹子和暴走漫画页面的图片，但是这些页面都是静态页面，几行代码就解决问题了，因为图片的src在页面的原始html中（具体暴走漫画和糗事百科是怎么自动形成静态页面的，有待讨论），静态页面的好处就是加载速度奇快。

但是，并非所有的网页抓取都是这么简单的，有些网页就是动态网页，指的是，页面中的图片元素是通过js生成出来的，原本的html中并没有图片的src信息，所以我们希望Python能够模拟浏览器加载js，并且返回执行js后的页面，这样就能看到src信息了。我们知道图片存在什么地方，不就能下载到本地了么（其实，有链接你也可能抓不下来，咱们后面说）。

有些网站为了不让别人把图片弄下来，或者说知识产权吧，有很多的方法，比如漫画网站，爱漫画和腾讯漫画，前者就是我说的动态网页生成的图片，所以当你打开一个有漫画页面的时候，图片会加载的很慢，因为是js生成的（毕竟不会让你轻易的抓下来）。后者就比较棘手了，或者使用的Flash加载的图片，如果要抓下来，那就需要Python模拟Flash了，这部分以后研究吧。

接着上面的说，就算我现在实现了Python加载带js的页面，并且获得了图片元素的src，但是当我访问这个src的时候会说404，比如这个链接，这是爱漫画全职猎人中的一个漫画页，当我使用浏览的的F12功能时，我找到了图片的这个src属性，当我把这个链接复制到浏览器后，他告诉我404错误，页面不存在，什么原因，明明是这个地址啊，而且多次刷新页面的地址也不变的啊（不要和我说你能看到这个图片，那是因为浏览器缓存，你清空下缓存试试呢，骚年）？那是因为，如果你对网页加载进行抓包时，你会发现获取页面图片的Get请求有以下的信息：

GET /Files/Images/76/59262/imanhua_001.jpg HTTP/1.1
Accept image/png, image/svg+xml, image/*;q=0.8, */*;q=0.5
Referer

http://www.imanhua.com/comic/76/list_59262.html
Accept Language zh-CN
User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
Accept-Encoding gzip, deflate
Host t6.mangafiles.com
Connection Keep-Alive

这里，你只需要模拟他的Get请求就可以获取到图片了，因为网站对Get进行了过滤，只有是自己网站的请求才会返回图片，所以我们要在请求的header中添加上面的信息，经测试，只需要添加Referer

http://www.imanhua.com/comic/76/list_59262.html信息就可以了。其中的URL是当前网页的URL。

我们把具体实现的原理讲了一遍，下面看具体用到什么包：

1. BeautifulSoup包，用来根据URL获取静态页面中的元素信息，我们使用它获取爱漫画网站中某个漫画的所有章节url，根据章节的url获取该章节的页面总数并获取每个页面的url，参考资料

2. Ghost包，用来根据每个页面的url动态加载js，获取加载之后的页面代码，并且得到图片标签的src属性，Ghost官网，参考资料

3. urllib2包，模拟Get请求，使用add_header添加Referer参数，获取返回的图片

4. chardet包，解决页面乱码问题

我们依次将上面的四个步骤进行举例，还是以抓取爱漫画网站的漫画为例：

1. 输入漫画编号，通过 BeautifulSoup获取所有章节和章节下面的子页面url

webURL = 'http://www.imanhua.com/'cartoonNum = raw_input("请输入漫画编号:")basicURL = webURL + u'comic/' + cartoonNum#获取漫画名称soup = BeautifulSoup(html)cartoonName = soup.find('div',class_='share').find_next_sibling('h1').get_text()print u'正在下载漫画： ' + cartoonName#创建文件夹path = os.getcwd()        # 获取此脚本所在目录new_path = os.path.join(path,cartoonName)if not os.path.isdir(new_path):os.mkdir(new_path)#解析所有章节的URLchapterURLList = []chapterLI_all = soup.find('ul',id = 'subBookList').find_all('a')for chapterLI in chapterLI_all:    chapterURLList.append(chapterLI.get('href'))    #print chapterLI.get('href')#遍历章节的URLfor chapterURL in chapterURLList:    chapter_soup = BeautifulSoup(urllib2.urlopen(webURL+str(chapterURL),timeout=120).read())    chapterName = chapter_soup.find('div',id = 'title').find('h2').get_text()    print u'正在下载章节： ' + chapterName    #根据最下行的最大页数获取总页数    allChapterPage = chapter_soup.find('strong',id = 'pageCurrent').find_next_sibling('strong').get_text()    print allChapterPage    #然后遍历所有页，组合成url，保存图片    currentPage = 1    fetcher = FetcherCartoon()    uurrll = str(webURL+str(chapterURL))    imgurl = fetcher.getCartoonUrl(uurrll)    if imgurl is not None:        while currentPage <= int(allChapterPage):            wholeurl = str(webURL+str(chapterURL)+u'?p='+str(currentPage))            page = "%03d"%(currentPage)            url = str(imgurl[:-7] + str(page) + imgurl[-4:])            print wholeurl            print url            GetImageContent(wholeurl,url)            currentPage += 1

2. 根据第一步动态获取到的页面url，使用Ghost动态加载页面，传入url获取页面图片的src

#通过Ghost模拟js获取动态网页生成的图片srcclass FetcherCartoon:    def getCartoonUrl(self,url):                if url is None:            return false        #todo many decide about url        try:            ghost = Ghost()            #open webkit            ghost.open(url)            #exceute javascript and get what you want            page, resources = ghost.wait_for_page_loaded()            result, resources = ghost.evaluate("document.getElementById('comic').getAttribute('src');", expect_loading=True)            del resources        except Exception,e:            print e            return None        return result

3. urllib2模拟Get请求，写入图片

#传入url模拟Get请求，获取图片内容def GetImageContent(wholeurl,imgurl):    time.sleep(0.1)    req = urllib2.Request(imgurl)    req.add_header('Referer', wholeurl)    content = urllib2.urlopen(req).read()    rstr = r"[\/\\\:\*\?\"\<\>\|]"  # '/\:*?"<>|'    new_title = re.sub(rstr, "", str(imgurl)[-20:])    with open(cartoonName+'/'+new_title,'wb') as code:        code.write(content)

4. chardet解决乱码问题

#解决乱码问题html_1 = urllib2.urlopen(basicURL,timeout=120).read()mychar = chardet.detect(html_1)bianma = mychar['encoding']if bianma == 'utf-8' or bianma == 'UTF-8':    html = html_1else :    html = html_1.decode('gb2312','ignore').encode('utf-8')

整体的代码如下：

# -*- coding:utf8 -*-import urllib2,re,os,timeimport chardetimport cookielib,httplib,urllibfrom bs4 import BeautifulSoupfrom ghost import GhostwebURL = 'http://www.imanhua.com/'cartoonNum = raw_input("请输入漫画编号:")basicURL = webURL + u'comic/' + cartoonNum#通过Ghost模拟js获取动态网页生成的图片srcclass FetcherCartoon:    def getCartoonUrl(self,url):                if url is None:            return false        #todo many decide about url        try:            ghost = Ghost()            #open webkit            ghost.open(url)            #exceute javascript and get what you want            page, resources = ghost.wait_for_page_loaded()            result, resources = ghost.evaluate("document.getElementById('comic').getAttribute('src');", expect_loading=True)            del resources        except Exception,e:            print e            return None        return result#解决乱码问题html_1 = urllib2.urlopen(basicURL,timeout=120).read()mychar = chardet.detect(html_1)bianma = mychar['encoding']if bianma == 'utf-8' or bianma == 'UTF-8':    html = html_1else :    html = html_1.decode('gb2312','ignore').encode('utf-8')#获取漫画名称soup = BeautifulSoup(html)cartoonName = soup.find('div',class_='share').find_next_sibling('h1').get_text()print u'正在下载漫画： ' + cartoonName#传入url模拟Get请求，获取图片内容def GetImageContent(wholeurl,imgurl):    #time.sleep(0.1)    req = urllib2.Request(imgurl)    req.add_header('Referer', wholeurl)    content = urllib2.urlopen(req).read()    rstr = r"[\/\\\:\*\?\"\<\>\|]"  # '/\:*?"<>|'    new_title = re.sub(rstr, "", str(imgurl)[-20:])    with open(cartoonName+'/'+new_title,'wb') as code:        code.write(content)#创建文件夹path = os.getcwd()        # 获取此脚本所在目录new_path = os.path.join(path,cartoonName)if not os.path.isdir(new_path):os.mkdir(new_path)#解析所有章节的URLchapterURLList = []chapterLI_all = soup.find('ul',id = 'subBookList').find_all('a')for chapterLI in chapterLI_all:    chapterURLList.append(chapterLI.get('href'))    #print chapterLI.get('href')#遍历章节的URLfor chapterURL in chapterURLList:    chapter_soup = BeautifulSoup(urllib2.urlopen(webURL+str(chapterURL),timeout=120).read())    chapterName = chapter_soup.find('div',id = 'title').find('h2').get_text()    print u'正在下载章节： ' + chapterName    #根据最下行的最大页数获取总页数    allChapterPage = chapter_soup.find('strong',id = 'pageCurrent').find_next_sibling('strong').get_text()    print allChapterPage    #然后遍历所有页，组合成url，保存图片    currentPage = 1    fetcher = FetcherCartoon()    uurrll = str(webURL+str(chapterURL))    imgurl = fetcher.getCartoonUrl(uurrll)    if imgurl is not None:        while currentPage <= int(allChapterPage):            wholeurl = str(webURL+str(chapterURL)+u'?p='+str(currentPage))            page = "%03d"%(currentPage)            url = str(imgurl[:-7] + str(page) + imgurl[-4:])            print wholeurl            print url            GetImageContent(wholeurl,url)            currentPage += 1        print "~~~~~~~~~~~~~~~~~~~~~~~~~~END~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"#为了避免双击的时候直接一闪退出，在最后面加了这么一句raw_input("Press <Enter> To Quit!")

需要改进的地方：

如果js 页面加载太慢，会有TimeoutError错误，目前还没有对timeout进行raise操作，后期改进吧

2 0