python自动下载人人所有好友的相册

来源:互联网 发布:java application用法 编辑:程序博客网 时间:2024/04/29 07:06

作者:华亮

转载请说明出处:http://blog.csdn.net/cedricporter

 

 昨天下午写的自动抓取自己人人相册的python代码,用途貌似只有备份一下自己的相册。于是今天修改了专门针对人人网的爬虫,增加了自动抓取所有好友的功能,然后去他们的空间,把他(她)们的相册都下载回来(比较适合较多美女朋友的同学们..)...

         昨天发的文章有很多标签结果太长了,于是很悲剧地,修改的时候腾讯居然不给提交,XXXXX(省略一万字...)

         人人网是个很类似facebook的东东....为什么会很类似,因为中国特色....

         转入正题,因为怕以后忘了,所以写下来记录一下...

         好,第一点是名词解释。

         爬虫是神马?

         根据百度百科有: “网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动的抓取万维网信息的程序或者脚本。.......传统爬虫从一个或若干初始网页的URL开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件。”

          偶针对人人做了一些特化(换句话说拿到其他网站就没用了),人人网要访问首先得有个帐号,也就是说要先登录,然后服务器就可以根据session或cookie来判断你在其他页面的登录情况,而对人人cookie就好了。当然,我们在一个浏览器登录,在另一个浏览器也可能还得要再登录一下,因为一般情况下他们不共享cookie,除非专门去读某个浏览器的cookie。于是爬虫要爬人人,首先要登录.....然后保存cookie。

         浏览器与服务器之间通讯主要都是Http协议,方法主要有GET和POST,(据《深入理解计算机系统》说,GET方法占了99%的HTTP请求。),GET方法主要向服务器发送比较短的数据,主要将参数写到URL里面,而POST方法则可以发送比较长的数据,例如发这篇文章的话,则是用了POST。想我们可以用"Telnet www.google.com 80",然后键入"Get /"就可以可以收到和我们在浏览器打上"http://www.google.com/"同样的东西。爬虫也一样,就是不断地GET,POST……

          要抓取所有好友的所有可见的相册有两种方法,一种是人工一个好友一个好友一个相册一个相册地下,另一种就是就给计算机让它自己去爬....因为我比较懒,所以选择第二种方法。

          又到了“要怎么怎么样,首先怎么怎么样”的句式了~

          要获取所有好友,可以在登录的情况下访问http://friend.renren.com/myfriendlistx.do,如果有用浏览器登录的话,好友会被javascript分成很多页显示。在网页的某段javascript中有个变量叫friends,保存所有好友的信息,里面都是{"id":254905709,"vip":false,"selected":true,"mo":true,"name":"\u5b89\u8feaAndy","head":"http:\/\/hdn.xnimg.cn\/photos\/hdn321\/20110612\/1600\/h_tiny_zFLc_715e000281932f76.jpg","groups":["\u534e\u5357\u7406\u5de5\u5927\u5b66"]}这种元组,从这里,我们可以获取所有好友的id。

          要获取某个人的所有相册,可以访问http://www.renren.com/profile.do?id=(某人的id)&v=photo_ajax&undefined,这个是怎么找出来的呢?我们登录一个人的主页时,然后点击相册,这个页面并没有刷新,只是由AJAX替换了页面的一部分,它就是去Get那个路径,就返回了网页的一部分代码过来,替换掉现在的。所以我们也可以去Get那个路径,就可以获得包含所有相册id的页面。

         要获取一个相册里面的所有照片,这个要靠人人的一个Bug了,很无意发现的,你可以打开别人相册的排序照片的页面。在排序的页面,一个相册所有的照片都列出来了,通过正则表达式,我们就可以拿到每张照片的id。排序的页面为http://photo.renren.com/photo/(某人的id)/album-(相册id)/reorder。

         经过了三句“要怎么怎么样,首先怎么怎么样”,我们拿到了所有好友的id,所有好友的所有相册的id,和所有好友的所有相册的所有照片的id。为什么都是id呢?这个个人觉得用一个整数作为数据库元组的主码,性能会高些,而且对于一个32位整数,只占4字节,就可以标识4294967296个东西了。加上在客户与服务器之间传送id也方便。

         拥有这些id我们可以做什么,目前什么都做不了,我们访问http://photo.renren.com/photo/(某人的id)/photo-(相片id)就可以在网页中代码中发现AJAX返回的一段代码代码中有一句"largeurl":"http:\/\/fmn.rrimg.com\/fmn049\/20110621\/1520\/p_large_S5jA_37eb000165dc5c3f.jpg",这就是一张照片的真正地址了,然后我们把里面的"\"给删掉就可以下载了。     

          好,于是我们就可以这样写出一个残缺不全的爬虫了..........对于人人的新鲜事,可以把一个页面的url抓出来筛选后放到一个优先队列里,再从优先队列里选一个最优的进入,重复上一步,直到队列为空或者其他情况....呃,传说中的中文伪代码....

           

 

更多代码见:http://code.google.com/p/stupidet/


程序在Ubuntu 11.04和windows 7 x64下测试正常,在windows下请用Idle打开运行。


主程序:

# -*-coding:utf-8-*-# Filename:main.py# 作者:华亮#from Renren import SuperRenrenimport timedef main():    renren = SuperRenren()    if renren.Create('人人帐号', '人人密码'):        #renren.PostMsg(time.asctime())        #renren.PostGroupMsg('387635422', '%s' % time.asctime())        #renren.DownloadAlbum('333982368', 'sss')        renren.DownloadAllFriendsAlbums(threadnumber = 1)    if __name__ == '__main__':    main()    



人人库:

# -*- coding:utf-8 -*-# Filename:Renren.py# 作者:华亮#from HTMLParser import HTMLParserfrom Queue import Emptyfrom Queue import Queuefrom re import matchfrom sys import exitfrom urllib import urlencodeimport osimport reimport socketimport threadingimport timeimport urllibimport urllib2import shelve# 提供给输出的互斥对象GlobalPrintMutex = threading.Lock()# 提供输出config.cfg的互斥对象GlobalWriteConfigMutex = threading.Lock()# 提供保存用户最后更新的互斥对象GlobalShelveMutex = threading.Lock()# 根据平台不同选择不同的路径分割符Delimiter = '/' if os.name == 'posix' else '\\'ConfigFilename = 'config.cfg'           # 每个相册的已经下载的图片idLastUpdatedFileName = 'lastupdated.cfg' # 所有人的最后更新时间UpdateThreashold = 10 * 60                 # 更新时间# 多核情况下的输出def MutexPrint(content):    GlobalPrintMutex.acquire()    print content    GlobalPrintMutex.release()    def MutexWriteFile(file, content):    GlobalWriteConfigMutex.acquire()    file.write(content)    file.flush()    GlobalWriteConfigMutex.release()                # 字符串形式的unicode转成真正的字符def Str2Uni(str):    import re    pat = re.compile(r'\\u(\w{4})')    lst = pat.findall(str)            lst.insert(0, '')    return reduce(lambda x,y: x + unichr(int(y, 16)), lst)    #------------------------------------------------------------------------------ # 下载文件的下载者class Downloader(threading.Thread):    def __init__(self, urlQueue, failedQueue, file=None):        threading.Thread.__init__(self)        self.queue = urlQueue        self.failedQueue = failedQueue        self.file = file                      def run(self):        try:            while not self.queue.empty():                pid, url, filename = self.queue.get()                isfile = os.path.isfile(filename.decode('utf-8'))                #print filename.decode('utf-8')                MutexPrint(("\tDownloading %s" if not isfile else "\tExists %s") % filename.decode('utf-8'))                                            if not isfile: urllib.urlretrieve(url, filename.decode('utf-8'))                MutexWriteFile(self.file, pid + '\r\n')        except Empty:            pass        except Exception, e:            self.failedQueue.put(pid)            MutexPrint('\tError occured when downloading photo which id = %s' % pid)            MutexPrint(e)                                   #------------------------------------------------------------------------------ # 人人相册的解析class RenrenAlbums(HTMLParser):    in_key_div = False    in_ul = False    in_li = False    in_a = False    albumsUrl = []            def handle_starttag(self, tag, attrs):        attrs = dict(attrs)        if tag == 'div' and 'class' in attrs and attrs['class'] == 'big-album album-list clearfix':            self.in_key_div = True        elif self.in_key_div:             if tag == 'ul':                self.in_ul = True            elif self.in_ul and tag == 'li':                self.in_li = True            if self.in_li and tag == 'a' and 'href' in attrs:                self.in_a = True                self.albumsUrl.append(attrs['href'])                        def handle_data(self, data):        pass            def handle_endtag(self, tag):        if self.in_key_div and tag == 'div':            self.in_key_div = False        elif self.in_ul and tag == 'ul':            self.in_ul = False        elif self.in_li and tag == 'li':            self.in_li = False        elif self.in_a and tag == 'a':            self.in_a = False        class RenrenRequester:    '''    人人访问器    '''    LoginUrl = 'http://www.renren.com/PLogin.do'    # 输入用户和密码的元组    def Create(self, username, password):        loginData = {'email':username,                'password':password,                'origURL':'',                'formName':'',                'method':'',                'isplogin':'true',                'submit':'登录'}        postData = urlencode(loginData)        cookieFile = urllib2.HTTPCookieProcessor()        self.opener = urllib2.build_opener(cookieFile)        req = urllib2.Request(self.LoginUrl, postData)        result = self.opener.open(req)        if not (result.geturl() == 'http://www.renren.com/home' or 'http://guide.renren.com/guide'):            return False                  rawHtml = result.read()                # 获取用户id        useridPattern = re.compile(r'user : {"id" : (\d+?)}')        self.userid = useridPattern.search(rawHtml).group(1)                              # 查找requestToken                pos = rawHtml.find("get_check:'")        if pos == -1: return False                rawHtml = rawHtml[pos + 11:]        token = match('-\d+', rawHtml)        if token is None:            token = match('\d+', rawHtml)            if token is None: return False        self.requestToken = token.group()          self.__isLogin = True              return self.__isLogin        def GetRequestToken(self):        return self.requestToken        def GetUserId(self):        return self.userid        def Request(self, url, data = None):        if self.__isLogin:            if data:                encodeData = urlencode(data)                request = urllib2.Request(url, encodeData)            else:                request = urllib2.Request(url)            result = self.opener.open(request)            return result        else:            return None                class RenrenPostMsg:    '''    RenrenPostMsg        发布人人状态    '''    newStatusUrl = 'http://status.renren.com/doing/updateNew.do'        def Handle(self, requester, param):        requestToken, msg = param        statusData = {'content':msg,                    'isAtHome':'1',                    'requestToken':requestToken}        postStatusData = urlencode(statusData)                requester.Request(self.newStatusUrl, statusData)                return True        class RenrenPostGroupMsg:    '''    RenrenPostGroupMsg        发布人人小组状态    '''            newGroupStatusUrl = 'http://qun.renren.com/qun/ugc/create/status'        def Handle(self, requester, param):        requestToken, groupId, msg = param        statusData = {'minigroupId':groupId,                    'content':msg,                    'requestToken':requestToken}        requester.Request(self.newGroupStatusUrl, statusData)class RenrenFriendList:    '''    RenrenFriendList        人人好友列表    '''    def Handler(self, requester, param):             friendUrl = 'http://friend.renren.com/myfriendlistx.do'        rawHtml = requester.Request(friendUrl).read()                    friendInfoPack = re.search(r'var friends=\[(.*?)\];', rawHtml).group(1)                friendIdPattern = re.compile(r'"id":(\d+).*?"name":"(.*?)"')        friendIdList = []        for id, name in friendIdPattern.findall(friendInfoPack):            friendIdList.append((id, Str2Uni(name)))                return friendIdList                class RenrenAlbumDownloader:    '''    AlbumDownloader        相册下载者,记录已经下载的照片id到config.cfg,不会重新下载    '''    threadNumber = 10    # 下载线程数        def Handler(self, requester, param):        self.requester = requester            userid, path = param        self.__DownloadOneAlbum(userid, path)            # 解析html获取人名    def __GetPeopleNameFromHtml(self, rawHtml):        peopleNamePattern = re.compile(r'<h2>(.*?)<span>')        # 取得人名        peopleName = peopleNamePattern.search(rawHtml).group(1).strip()        return peopleName        def __GetAlbumsNameFromHtml(self, rawHtml):        albumUrlPattern = re.compile(r'<a href="(.*?)" stats="album_album"><img.*?/>(.*?)</a>')        albums = []        # 把相册路径定向到排序页面,就可以在那个页面获得该相册下所有的相片的id        for album_url, album_name in albumUrlPattern.findall(rawHtml):            albums.append((album_name.strip(), album_url + '/reorder'))        return albums        def __GetAlbumPhotos(self, userid, albumUrl):        # 匹配的正则表达式        # 照片id        pidPattern = re.compile(r'<li pid="(\d+)".*?>.*?</li>', re.S)                # 访问所有包含所有相册的页面        result = self.requester.Request(albumUrl)        rawHtml = result.read()        photohtmlurl = []   # 每张照片的页面        for pid in pidPattern.findall(rawHtml):            photohtmlurl.append((pid, 'http://photo.renren.com/photo/%s/photo-%s' % (userid, pid)))                        return photohtmlurl                                 def __GetRealPhotoUrls(self, photohtmlurl):        # 访问每个相册,获取所有照片,并修正相片的url        # 照片地址        imgPattern = re.compile(r'"largeurl":"(.*?)"')        imgUrl = [] # id与真实照片的url        for pid, url in photohtmlurl:            result = self.requester.Request(url)            rawHtml = result.read()            for img in imgPattern.findall(rawHtml):                  imgUrl.append((pid, img.replace('\\', '')))                    break                        return imgUrl        def __DownloadAlbum(self, savepath, album_name, imgUrl, file):                      # 下载相册所有图片         # 将下载文件压入队列              queue = Queue()            failedQueue = Queue()          for pid, url in imgUrl:            imgname = url.split('/')[-1]            queue.put((pid, url, savepath + Delimiter + imgname))                              # 启动多线程下载            threads = []        for i in range(self.threadNumber):            downloader = Downloader(queue, failedQueue, file)            threads.append(downloader)            downloader.start()        # 等待所有线程完成        for t in threads:            t.join()         # 返回相片队列              return failedQueue                            # 下载某人的相册                def __DownloadOneAlbum(self, userid, path='albums'):        #if not self.__isLogin: return        if os.path.exists(path.decode('utf-8')) == False: os.mkdir(path.decode('utf-8'))                        albumsUrl = 'http://www.renren.com/profile.do?id=%s&v=photo_ajax&undefined' % userid                                   try:                    # 取出相册和路径                        result = self.requester.Request(albumsUrl)                        rawHtml = result.read()            # 取得人名            peopleName = self.__GetPeopleNameFromHtml(rawHtml).strip()            albums = self.__GetAlbumsNameFromHtml(rawHtml)                        # 根据人名建文件夹            path += Delimiter + peopleName            if os.path.exists(path.decode('utf-8')) == False: os.mkdir(path.decode('utf-8'))                                  # 开始进入相册下载                        MutexPrint('Enter %s' % peopleName.decode('utf-8'))                        for album_name, albumUrl in albums:                    MutexPrint('Downloading Album: %s' % album_name.decode('utf-8'))                # 获取该相册下照片id和照片地址的表                photohtmlurl = self.__GetAlbumPhotos(userid, albumUrl)                                    # 按相册名建文件夹                        album_name = album_name.replace('\\', '')  # 消去特殊符号                  album_name = album_name.replace('/', '')                savepath = path + Delimiter + album_name                              if os.path.exists(savepath.decode('utf-8')) == False: os.mkdir(savepath.decode('utf-8'))                                  #                newDownloadIdSet = set()                finishedIdSet = set()                totalIdSet = set()                for pid, url in photohtmlurl:                    totalIdSet.add(pid)                                configFile = savepath + Delimiter + ConfigFilename                if os.path.isfile(configFile.decode('utf-8')):                      # 读取已经完成的照片以免重复访问获取大图地址的页面                                                  file = open(configFile.decode('utf-8'), 'r')                                        photoIdMap = []                    for line in file.readlines():                        pid = line.strip()                        photoIdMap.append(pid)                                            file.close()                                        finishedIdSet = set(photoIdMap)                                                    newDownloadIdSet = totalIdSet - finishedIdSet                                newDownloadPhotoHtmlUrl = ((pid, url) for pid, url in photohtmlurl if pid in newDownloadIdSet)                                imgUrl = self.__GetRealPhotoUrls(newDownloadPhotoHtmlUrl)                     #imgUrl.sort()                #imgUrl = list(set(imgUrl))                #                for id, url in imgUrl:#                    print id, url                                                                                   # 下载照片                                try:                     file = open(configFile.decode('utf-8'), 'w')                    for id in finishedIdSet:                        file.write(id + '\r\n')                    file.flush()                                            failedQueue = self.__DownloadAlbum(savepath, album_name, imgUrl, file)                                                        except Exception, e:                    print 'Error when downloading.', e                      finally:                    # 取出下载失败的的照片的id                    while not failedQueue.empty():                        totalIdSet.remove(failedQueue.get())                      file.close()                                                    except AttributeError, e:            raise           except Exception, e:                        print 'Error! Please contact QQ: 414112390'            print e    class AutoRenrenDownloader:    '''    AutoRenrenDownloader        自动下载所有好友相册,具有断点续传功能,一次下载为完成,第二次会接着下    '''    def handler(self, requester, param):        self.requester = requester        path, threadnumber = param        self.__DownloadFriendsAlbums(path, threadnumber)                    #------------------------------------------------------------------------------     # 好友相册下载者            class FriendDownloader(threading.Thread):        def __init__(self, requester, queue, file):            threading.Thread.__init__(self)            self.file = file            self.requester = requester            self.queue = queue                def run(self):            try:                                             while not self.queue.empty():                    id, path = self.queue.get()                    downloader = RenrenAlbumDownloader()                       downloader.Handler(self.requester, (id, path))                    GlobalShelveMutex.acquire()                    self.file['TaskList'].remove(id)                    GlobalShelveMutex.release()            except Empty:                pass            except AttributeError, e:                print '有可能已经被人人网认为访问了100个好友,请访问人人网的任意好友的主页输入验证码'                #print e            except ValueError, e:                print id                print e                                         def __DownloadFriendsAlbums(self, path='albums', threadnumber=10):             if not os.path.exists(path.decode('utf-8')): os.mkdir(path.decode('utf-8'))                friendsList = RenrenFriendList().Handler(self.requester, None)                db = shelve.open(LastUpdatedFileName, writeback = True)        if not db.has_key('TaskList'): db['TaskList'] = []        if len(db['TaskList']) == 0:            db['TaskList'] = [id for id, realName in friendsList]                    updateList = db['TaskList']                     i = 1        print "此次需要更新如下:"        # 获取好友列表        queue = Queue()        for id in updateList:            print "%s:\t%s\t" % (i, id),            print dict(friendsList)[id]            i += 1            queue.put((id, path))                    # 下载好友           DownloadersList = []            failedQueue = Queue()        try:            for i in range(threadnumber):                friendDownloader = self.FriendDownloader(self.requester, queue, db)                friendDownloader.start()                DownloadersList.append(friendDownloader)                    for downloader in DownloadersList:                downloader.join()        except Exception, e:            print '-' * 100 + "\nPlease Goto Renren.com\n" + '-' * 100             print e        finally:            db.close()                                       class SuperRenren:    '''    SuperRenren        人人控制器    '''    # 创建    def Create(self, username, password):        self.requester = RenrenRequester()        if self.requester.Create(username, password):            self.userid = self.requester.userid            self.requestToken = self.requester.requestToken            return True        return False    # 发送个人状态    def PostMsg(self, msg):        poster = RenrenPostMsg()        poster.Handle(self.requester, (self.requestToken, msg))    # 发送小组状态            def PostGroupMsg(self, groupId, msg):        poster = RenrenPostGroupMsg()        poster.Handle(self.requester, (self.requestToken, groupId, msg))    # 下载相册    def DownloadAlbum(self, userId, path = 'albums'):               downloader = RenrenAlbumDownloader()        downloader.Handler(self.requester, (userId, path))    # 自动下载所有好友相册    def DownloadAllFriendsAlbums(self, path = 'albums', threadnumber = 10):        downloader = AutoRenrenDownloader()        downloader.handler(self.requester, (path, threadnumber))                     







原创粉丝点击