python脚本自动保存blog页面

来源:互联网 发布:dll文件放大软件 编辑:程序博客网 时间:2024/05/18 03:52

2016.2.24:增加对title中非法字符的处理,解决保存文件时异常退出的问题。


1. 前言

在用Python脚本保存网页一文中,介绍了如何使用python脚本包括某个用户的所有csdn blog页面。如今看来,代码可读性不是很好,所以现在对其进行了重写。


2. 主要结构

2.1 文件列表

目前包括如下几个python文件:

  • export_blog.py:导出blog数据的总入口文件;
  • web_utils.py:根据URL自动保存页面、获取页面内容的几个函数;
  • page_count_parser.py:根据个人blog主页面,获取其博客列表所在的URL列表;
  • blog_item_parser.py:获取每个blog列表页面中,具体每个blog的URL、title。

2.2 主要流程

如下:

  • 1. 确定blog的主页面(称为main page)
  • 2. 从主页面获取所有博客列表所在的URL(称为page lists)
  • 3. 遍历每个page list页面,获取详细的每个blog的URL和Title等信息(每个blog,称为blog item)
  • 4. 在得到了所有的blog详细信息之后,就从这些URL读取数据,且保存到本地。

2.3 domain字典

上一节给出了几个术语,本节通过一些截图予以说明,并给出对应的HTML部分代码。。。。。TODO

3. 脚本的class和function帮助信息

3.1 export_blog.py

NAME    export_blog - #encoding: utf-8FILE    d:\examples\python\export_blog\export_blog.pyFUNCTIONS    export_csdn_blogs(user_name, user_id, blog_saved_path, sleep_len)        Read the main_page_url, and parse all the blog information, then save to blog_saved_path.        e.g.:        user_name = 'a_flying_bird'        user_id = 'u013344915'        blog_saved_path = "D:\examples\python\export_blog\2015-07-25"        sleep_len = 5        export_csdn_blogs(user_name, user_id, blog_saved_path, sleep_len)

3.2 web_utils.py

NAME    web_utilsFILE    d:\examples\python\export_blog\web_utils.pyFUNCTIONS    fix_content(content)        <script type="text/javascript">            var protocol = window.location.protocol;            document.write('<script type="text/javascript" src="' + protocol + '//csdnimg.cn/pubfooter/js/repoAddr2.js?v=' + Math.random() + '"></' + 'script>');        </script>        While parsing the line of  'document.write...', there is some error. So we will delete this line.    get_page_content(url)        Get the web page's content. If filename is specified, save the content to this file.    save_page(url, filename)        Save the web page specified by url.

3.3 page_count_parser.py

NAME    page_count_parser - #encoding: utf-8FILE    d:\examples\python\export_blog\page_count_parser.pyCLASSES    HTMLParser.HTMLParser(markupbase.ParserBase)        PageCountParser    class PageCountParser(HTMLParser.HTMLParser)     |  Get the page count from this 'div'.     |     |  example:     |  <div id="papelist" class="pagelist">     |      <span> 137鏉℃暟鎹? 鍏?0椤?/span>     |      <strong>1</strong>     |      <a href="http://blog.csdn.net/u013344915/article/list/2">2</a>     |      <a href="http://blog.csdn.net/u013344915/article/list/3">3</a>     |      <a href="http://blog.csdn.net/u013344915/article/list/4">4</a>     |      <a href="http://blog.csdn.net/u013344915/article/list/5">5</a>     |      <a href="http://blog.csdn.net/u013344915/article/list/6">...</a>     |      <a href="http://blog.csdn.net/u013344915/article/list/2">涓嬩竴椤?/a>     |      <a href="http://blog.csdn.net/u013344915/article/list/10">灏鹃〉</a>     |  </div>     |     |  Method resolution order:     |      PageCountParser     |      HTMLParser.HTMLParser     |      markupbase.ParserBase     |     |  Methods defined here:     |     |  __init__(self, user_id)     |     |  get_page_count(self)     |     |  get_page_lists(self)     |     |  handle_data(self, text)     |     |  handle_endtag(self, tag)     |     |  handle_starttag(self, tag, attrs)     |     |  save_page_count(self, attrs)     |      Save the pagecount.     |     |      example:     |      <a href="http://blog.csdn.net/u013344915/article/list/6">...</a>     |      <a href="http://blog.csdn.net/u013344915/article/list/2">涓嬩竴椤?/a>     |      <a href="http://blog.csdn.net/u013344915/article/list/10">灏鹃〉</a>     |     |      Windowns 8, Firefox 38.0.6     |      <a href="/u013344915/article/list/2">     |     |  ----------------------------------------------------------------------     |  Methods inherited from HTMLParser.HTMLParser:     |     |  .......FUNCTIONS    get_page_lists(content, user_id)        Get the page lists' url.

3.4 blog_item_parser.py

NAME    blog_item_parser - #encoding: utf-8FILE    d:\examples\python\export_blog\blog_item_parser.pyCLASSES    HTMLParser.HTMLParser(markupbase.ParserBase)        BlogItemsParser    __builtin__.object        BlogItem    class BlogItem(__builtin__.object)     |  Methods defined here:     |     |  __init__(self, id)     |     |  dump(self)     |     |  ----------------------------------------------------------------------     |  Data descriptors defined here:     |     |  __dict__     |      dictionary for instance variables (if defined)     |     |  __weakref__     |      list of weak references to the object (if defined)    class BlogItemsParser(HTMLParser.HTMLParser)     |  Get all the article's url and title.     |     |  Method resolution order:     |      BlogItemsParser     |      HTMLParser.HTMLParser     |      markupbase.ParserBase     |     |  Methods defined here:     |     |  __init__(self, user_name)     |     |  get_blog_items(self)     |     |  handle_data(self, text)     |     |  handle_endtag(self, tag)     |     |  handle_starttag(self, tag, attrs)     |     |  save_article_title(self, attrs)     |      Save the article_title.     |     |      example:     |      <a href="http://blog.csdn.net/a_flying_bird/article/details/47028939">     |          Linux鐜涓嬪垪鍑烘寚瀹氱洰褰曚笅鐨勬墍鏈夋枃浠?     |      </a>     |     |  ----------------------------------------------------------------------     |  Methods inherited from HTMLParser.HTMLParser:     |     |  ........FUNCTIONS    get_article_items(content, user_name)

4. python脚本

为了简化这一步,我们把python脚本直接打包,放到下载页面http://download.csdn.net/detail/u013344915/8935181。——这个链接被莫名其妙删除了。。。。所以直接拷贝这里的代码。或者访问网盘:http://pan.baidu.com/s/1pJYo2ZD

如果用户没有登录等导致无法下载,也可以直接从这里拷贝。

4.1 export_blog.py

<pre name="code" class="python">#!/usr/bin/env python  #encoding: utf-8    ''''' Export csdn's blog.  e.g.: 1. Linux:  ./export_blog.py a_flying_bird ./2015-07-25 5 2. Windows python export_blog.py 2005-07-27 5 '''    import time   import os   import re   import sys    import web_utils  import page_count_parser  import blog_item_parser    def get_user_id(content):      '''''Get user id from the content of main page.          e.g.:     <script type="text/javascript">         var username = "u013344915";         var _blogger = username;         var blog_address = "http://blog.csdn.net/a_flying_bird";         var static_host = "http://static.blog.csdn.net";         var currentUserName = "u013344915";       </script>     '''      username_pattern = '^var\s+username\s+=\s+\"(u[\d]+)\";$'      lines = content.split('\n')      for line in lines:          #print line          line = line.strip()          matched = re.match(username_pattern, line)          if matched:              return matched.group(1)            return None  # Create a file name.# In fact, we delete the invalid characters in the blog's title.# e.g. C/C++ -> CC++def replace_invalid_filename_char(title, replaced_char='_'):    '''Replace the invalid characaters in the filename with specified characater.    The default replaced characater is '_'.    e.g.     C/C++ -> C_C++    '''    valid_filename = title    invalid_characaters = '\\/:*?"<>|'    for c in invalid_characaters:        #print 'c:', c        valid_filename = valid_filename.replace(c, replaced_char)            return valid_filename     def export_csdn_blogs(user_name, blog_saved_path, sleep_len):      '''''     Read the main_page_url, and parse all the blog information, then save to blog_saved_path.          e.g.:     user_name = 'a_flying_bird'     user_id = 'u013344915'     blog_saved_path = "D:\\examples\\python\\export_blog\\2015-07-25"     sleep_len = 5     export_csdn_blogs(user_name, user_id, blog_saved_path, sleep_len)     '''      step = 1            print "Step %d: mkdir the destination directory: %s" % (step, blog_saved_path)      step = step + 1       if not os.path.exists(blog_saved_path):          os.makedirs(blog_saved_path)                print "Step %d: Retrieve the main page's content." % (step,)      step = step + 1       main_page_url = 'http://blog.csdn.net/%s/' % (user_name,)      content = web_utils.get_page_content(main_page_url)            print "Step %d: Get user id from the main page." % (step,)      step = step + 1      user_id = get_user_id(content)      if user_id is None:          print "Can not get user id from the main page. Correct it first."          return      else:          print "user id: ", user_id            print "Step %d: Get the pagelist's URLs." % (step,)      step = step + 1       page_lists = page_count_parser.get_page_lists(content, user_id)            print "Step %d: Read all of the article information, includes: url, title." % (step,)      step = step + 1             articles = []      for page_list in page_lists:          print "current pagelist: ", page_list          page_list_content = web_utils.get_page_content(page_list)          the_articles = blog_item_parser.get_article_items(page_list_content, user_name)          articles.extend(the_articles)          time.sleep(sleep_len)            print "Step %d: Save the articles." % (step,)      step = step + 1             total_article_count = len(articles)      print "Total count:", total_article_count       index = 1      for article in articles:          print "%d/%d: %s, %s ..." % (index, total_article_count, article.url, article.title)          index = index + 1          web_utils.save_page(article.url, os.path.join(blog_saved_path, replace_invalid_filename_char(article.title) + ".htm"))          time.sleep(sleep_len)        def usage(process_name):        print "Usage: %s user_name saved_path sleep_len" % (process_name,)        print "For example:"        print "    user_name: a_flying_bird"        print "    savedDirectory: /home/csdn/"      print "    sleep_len: 5"      if __name__ == "__main__":      argc = len(sys.argv)        if argc != 4:          usage(sys.argv[0])          sys.exit(-1)            user_name = sys.argv[1]        blog_saved_path = sys.argv[2]        sleep_len = int(sys.argv[3])            export_csdn_blogs(user_name, blog_saved_path, sleep_len)            print "DONE!!!"     


4.2 web_utils.py

import urllib2 def fix_content(content):    '''    <script type="text/javascript">        var protocol = window.location.protocol;        document.write('<script type="text/javascript" src="' + protocol + '//csdnimg.cn/pubfooter/js/repoAddr2.js?v=' + Math.random() + '"></' + 'script>');    </script>        While parsing the line of  'document.write...', there is some error. So we will delete this line.    '''    #error_string = '''document.write('<script type="text/javascript" src="' + protocol + '//csdnimg.cn/pubfooter/js/repoAddr2.js?v=' + Math.random() + '"></' + 'script>');'''    #content.replace(error_string, "")    fixed_content = ""    lines = content.split('\n')        for index in range(0, len(lines)):        if lines[index].find('window.location.protocol') > 0:            #print "find the error string."            lines.remove(lines[index + 1])            break         content = ""    for line in lines:        content = content + line + '\n'        #print content    return content     def save_page(url, filename):    '''    Save the web page specified by url.    '''    content = get_page_content(url)        f = open(filename, "wt")      f.write(content)      f.close() def get_page_content(url):    '''    Get the web page's content. If filename is specified, save the content to this file.    '''    headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}      req = urllib2.Request(url, headers = headers)      content = urllib2.urlopen(req).read() # 'UTF-8'    content = fix_content(content)    return content     def _test():    url = 'http://blog.csdn.net/a_flying_bird'    filename = "main_page.htm"    save_page(url, filename)    def _test_error_string():    filename = "a.htm"    content = open(filename, "r").read()    content = fix_content(content)        f = open("fix.htm", "wt")    f.write(content)    f.close()    if __name__ == '__main__':    #_test()    _test_error_string()    


4.3 page_count_parser.py

#!/usr/bin/env python#encoding: utf-8import htmllibimport urllib2from HTMLParser import HTMLParserimport reclass PageCountParser(HTMLParser):    '''    Get the page count from this 'div'.    example:    <div id="papelist" class="pagelist">        <span> 137条数据  共10页</span>        <strong>1</strong>        <a href="http://blog.csdn.net/u013344915/article/list/2">2</a>        <a href="http://blog.csdn.net/u013344915/article/list/3">3</a>        <a href="http://blog.csdn.net/u013344915/article/list/4">4</a>        <a href="http://blog.csdn.net/u013344915/article/list/5">5</a>        <a href="http://blog.csdn.net/u013344915/article/list/6">...</a>        <a href="http://blog.csdn.net/u013344915/article/list/2">下一页</a>        <a href="http://blog.csdn.net/u013344915/article/list/10">尾页</a>    </div>    '''    def __init__(self, user_id):        HTMLParser.__init__(self)        self.is_page_list = False        self.page_count = 1        self.page_list_url_header = "http://blog.csdn.net/u013344915/article/list/"                # Windows 7        #self.prefix = ""        #self.pattern = "^http://blog.csdn.net/u013344915/article/list/([\\d]+)$"                # Windows 8, Firefox 38.0.6        self.prefix = "http://blog.csdn.net"        self.pattern = "^/%s/article/list/([\d]+)$" % (user_id,)    def _is_page_list(self, tag, attrs):        '''        Whether the tag is responding to article_title.        e.g.:        <div id="papelist" class="pagelist">        '''        if tag != 'div': return False        for attr in attrs:            name, value = attr            if name == 'id' and value == 'papelist':  # Oooh, it is papelist, not the pagelist!                print "enter pagelist"                return True        return False    def save_page_count(self, attrs):        '''        Save the pagecount.        example:        <a href="http://blog.csdn.net/u013344915/article/list/6">...</a>        <a href="http://blog.csdn.net/u013344915/article/list/2">下一页</a>        <a href="http://blog.csdn.net/u013344915/article/list/10">尾页</a>                Windowns 8, Firefox 38.0.6        <a href="/u013344915/article/list/2">        '''        for attr in attrs:            name, value = attr            if name == 'href':                matched = re.match(self.pattern, value)                #print "matched:", matched                if matched:                    count = int(matched.group(1))                    #print "count:", count                    if count > self.page_count: self.page_count = count                    return    def handle_starttag(self, tag, attrs):        #print "start tag(), tag:", tag        #print "attrs:", attrs        if self._is_page_list(tag, attrs):            self.is_page_list = True            return        if self.is_page_list:            if tag == 'a':                self.save_page_count(attrs)    def handle_endtag(self, tag):        #print "end tag(), tag:", tag        if self.is_page_list and tag == 'div':            self.is_page_list = False    def handle_data(self, text):        #print "handle data(), text:", text        pass     def get_page_count(self):        return self.page_count            def get_page_lists(self):        page_lists = []        for index in range(1, self.page_count + 1):            page_lists.append(self.page_list_url_header + str(index))                    return page_lists def get_page_lists(content, user_id):    '''    Get the page lists' url.    '''    parser = PageCountParser(user_id)    parser.feed(content)    parser.close()    page_count = parser.get_page_count()    print "page count: ", page_count        page_lists = parser.get_page_lists()    for page_list in page_lists:        print page_list         return page_lists def _test():    content = open('main_page.htm', 'r').read()    get_page_lists(content, 'u013344915')    if __name__ == "__main__":    _test()    


4.4 blog_item_parser.py

   <pre name="code" class="python">#!/usr/bin/env python#encoding: utf-8import htmllibimport urllib2from HTMLParser import HTMLParserimport reimport platform '''article_list={"list_item article_item"}+"list_item article_item"={article_title} + {article_description} + {article_manage} + {clear}<div id="article_list" class="list">    <div class="list_item article_item">        <div class="article_title">               <span class="ico ico_type_Original"></span>            <h1>                <span class="link_title">                    <a href="http://blog.csdn.net/a_flying_bird/article/details/47028939">                        Linux环境下列出指定目录下的所有文件                                </a>                </span>            </h1>        </div>        <div class="article_description">            递归方式列出指定目录下的所有子目录和文件。...                </div>        <div class="article_manage">            <span class="link_postdate">2015-07-23 21:27</span>            <span class="link_view" title="阅读次数">                <a href="http://blog.csdn.net/a_flying_bird/article/details/47028939" title="阅读次数">                    阅读                </a>                (4)            </span>            <span class="link_comments" title="评论次数">                <a                    href="http://blog.csdn.net/a_flying_bird/article/details/47028939#comments"                    title="评论次数"                    onclick="_gaq.push(['_trackEvent','function', 'onclick', 'blog_articles_pinglun'])"                    >                    评论                </a>                (0)            </span>            <span class="link_edit">                <a href="http://write.blog.csdn.net/postedit/47028939" title="编辑">编辑</a>            </span>            <span class="link_delete">                <a                    href="javascript:void(0);"                    onclick="javascript:deleteArticle(47028939);return false;"                    title="删除">                    删除                </a>            </span>        </div>        <div class="clear"></div>    </div>    <div class="list_item article_item">        <div class="article_title">        ..........The key hierarchy of blog's title:<div id="article_list" class="list">    <div class="list_item article_item">        <div class="article_title">               <span class="ico ico_type_Original"></span>            <h1>                <span class="link_title">                    <a href="http://blog.csdn.net/a_flying_bird/article/details/47028939">                        Linux环境下列出指定目录下的所有文件                                </a>                </span>Furthermore, only the div of 'article_title' is enough!'''class BlogItem(object):    def __init__(self, id):        self.id = id        self.url = None        self.title = None    def dump(self):        print "(%s, %s, %s)" % (self.id, self.url, self.title)class BlogItemsParser(HTMLParser):    '''    Get all the article's url and title.    '''        def __init__(self, user_name):        HTMLParser.__init__(self)        self.is_article_title = False        self.ready_for_article_title = False # having reading the tag 'a', ready for handle the 'data'.        self.current_article_id = None        self.blogItems = {}                self.is_windows_platform = False         if platform.system() == 'Windows':            self.is_windows_platform = True                 #self.prefix = ""        #self.pattern = "^http://blog.csdn.net/a_flying_bird/article/details/([\\d]+)$" # windows 7                # windows 8, Firefox 38.0.6        self.prefix = "http://blog.csdn.net"        self.pattern = "^/%s/article/details/([\d]+)$" % (user_name,)    def _is_start_tag_of_article_title(self, tag, attrs):        '''        Whether the tag is responding to article_title.        e.g.:        <div class="article_title">        '''        if tag != 'div': return False        for attr in attrs:            name, value = attr            if name == 'class' and value == 'article_title': return True        return False    def save_article_title(self, attrs):        '''        Save the article_title.        example:        <a href="http://blog.csdn.net/a_flying_bird/article/details/47028939">            Linux环境下列出指定目录下的所有文件                    </a>        '''        for attr in attrs:            name, value = attr            if name == 'href':                matched = re.match(self.pattern, value)                if matched:                    id = matched.group(1)                    self.current_article_id = id                    blogItem = BlogItem(id)                    blogItem.url = self.prefix + value                    blogItem.title = None                    self.blogItems[id] = blogItem                    self.ready_for_article_title = True                    return    def handle_starttag(self, tag, attrs):        #print "start tag(), tag:", tag        #print "attrs:", attrs        if self._is_start_tag_of_article_title(tag, attrs):            self.is_article_title = True            return        if self.is_article_title:            if tag == 'a':                self.save_article_title(attrs)    def handle_endtag(self, tag):        #print "end tag(), tag:", tag        if self.is_article_title and tag == 'div':            self.is_article_title = False    def handle_data(self, text):        #print "handle data(), text:", text        if self.ready_for_article_title:            self.ready_for_article_title = False                        title = text.strip()            if self.is_windows_platform:                title = title.decode('UTF-8').encode('MBCS')                            self.blogItems[self.current_article_id].title = title            assert(self.blogItems[self.current_article_id].id                   == self.current_article_id)            return    def get_blog_items(self):        return self.blogItems         def get_article_items(content, user_name):    parser = BlogItemsParser(user_name)    parser.feed(content)    parser.close()        blogItems = parser.get_blog_items()    print "article's count:", len(blogItems)    for blogItem in blogItems.values():        blogItem.dump()        return blogItems.values()def _test():    content = open('main_page.htm', 'r').read()    get_article_items(content)    if __name__ == "__main__":    _test()    


5. TODO

以上脚本是在Windows8上面验证通过,还需要在Linux环境上验证。

需要保存博客内容中的图片。

0 0