python脚本自动保存blog页面
来源:互联网 发布:dll文件放大软件 编辑:程序博客网 时间:2024/05/18 03:52
2016.2.24:增加对title中非法字符的处理,解决保存文件时异常退出的问题。
1. 前言
在用Python脚本保存网页一文中,介绍了如何使用python脚本包括某个用户的所有csdn blog页面。如今看来,代码可读性不是很好,所以现在对其进行了重写。
2. 主要结构
2.1 文件列表
目前包括如下几个python文件:
- export_blog.py:导出blog数据的总入口文件;
- web_utils.py:根据URL自动保存页面、获取页面内容的几个函数;
- page_count_parser.py:根据个人blog主页面,获取其博客列表所在的URL列表;
- blog_item_parser.py:获取每个blog列表页面中,具体每个blog的URL、title。
2.2 主要流程
如下:
- 1. 确定blog的主页面(称为main page)
- 2. 从主页面获取所有博客列表所在的URL(称为page lists)
- 3. 遍历每个page list页面,获取详细的每个blog的URL和Title等信息(每个blog,称为blog item)
- 4. 在得到了所有的blog详细信息之后,就从这些URL读取数据,且保存到本地。
2.3 domain字典
上一节给出了几个术语,本节通过一些截图予以说明,并给出对应的HTML部分代码。。。。。TODO
3. 脚本的class和function帮助信息
3.1 export_blog.py
NAME export_blog - #encoding: utf-8FILE d:\examples\python\export_blog\export_blog.pyFUNCTIONS export_csdn_blogs(user_name, user_id, blog_saved_path, sleep_len) Read the main_page_url, and parse all the blog information, then save to blog_saved_path. e.g.: user_name = 'a_flying_bird' user_id = 'u013344915' blog_saved_path = "D:\examples\python\export_blog\2015-07-25" sleep_len = 5 export_csdn_blogs(user_name, user_id, blog_saved_path, sleep_len)
3.2 web_utils.py
NAME web_utilsFILE d:\examples\python\export_blog\web_utils.pyFUNCTIONS fix_content(content) <script type="text/javascript"> var protocol = window.location.protocol; document.write('<script type="text/javascript" src="' + protocol + '//csdnimg.cn/pubfooter/js/repoAddr2.js?v=' + Math.random() + '"></' + 'script>'); </script> While parsing the line of 'document.write...', there is some error. So we will delete this line. get_page_content(url) Get the web page's content. If filename is specified, save the content to this file. save_page(url, filename) Save the web page specified by url.
3.3 page_count_parser.py
NAME page_count_parser - #encoding: utf-8FILE d:\examples\python\export_blog\page_count_parser.pyCLASSES HTMLParser.HTMLParser(markupbase.ParserBase) PageCountParser class PageCountParser(HTMLParser.HTMLParser) | Get the page count from this 'div'. | | example: | <div id="papelist" class="pagelist"> | <span> 137鏉℃暟鎹? 鍏?0椤?/span> | <strong>1</strong> | <a href="http://blog.csdn.net/u013344915/article/list/2">2</a> | <a href="http://blog.csdn.net/u013344915/article/list/3">3</a> | <a href="http://blog.csdn.net/u013344915/article/list/4">4</a> | <a href="http://blog.csdn.net/u013344915/article/list/5">5</a> | <a href="http://blog.csdn.net/u013344915/article/list/6">...</a> | <a href="http://blog.csdn.net/u013344915/article/list/2">涓嬩竴椤?/a> | <a href="http://blog.csdn.net/u013344915/article/list/10">灏鹃〉</a> | </div> | | Method resolution order: | PageCountParser | HTMLParser.HTMLParser | markupbase.ParserBase | | Methods defined here: | | __init__(self, user_id) | | get_page_count(self) | | get_page_lists(self) | | handle_data(self, text) | | handle_endtag(self, tag) | | handle_starttag(self, tag, attrs) | | save_page_count(self, attrs) | Save the pagecount. | | example: | <a href="http://blog.csdn.net/u013344915/article/list/6">...</a> | <a href="http://blog.csdn.net/u013344915/article/list/2">涓嬩竴椤?/a> | <a href="http://blog.csdn.net/u013344915/article/list/10">灏鹃〉</a> | | Windowns 8, Firefox 38.0.6 | <a href="/u013344915/article/list/2"> | | ---------------------------------------------------------------------- | Methods inherited from HTMLParser.HTMLParser: | | .......FUNCTIONS get_page_lists(content, user_id) Get the page lists' url.
3.4 blog_item_parser.py
NAME blog_item_parser - #encoding: utf-8FILE d:\examples\python\export_blog\blog_item_parser.pyCLASSES HTMLParser.HTMLParser(markupbase.ParserBase) BlogItemsParser __builtin__.object BlogItem class BlogItem(__builtin__.object) | Methods defined here: | | __init__(self, id) | | dump(self) | | ---------------------------------------------------------------------- | Data descriptors defined here: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined) class BlogItemsParser(HTMLParser.HTMLParser) | Get all the article's url and title. | | Method resolution order: | BlogItemsParser | HTMLParser.HTMLParser | markupbase.ParserBase | | Methods defined here: | | __init__(self, user_name) | | get_blog_items(self) | | handle_data(self, text) | | handle_endtag(self, tag) | | handle_starttag(self, tag, attrs) | | save_article_title(self, attrs) | Save the article_title. | | example: | <a href="http://blog.csdn.net/a_flying_bird/article/details/47028939"> | Linux鐜涓嬪垪鍑烘寚瀹氱洰褰曚笅鐨勬墍鏈夋枃浠? | </a> | | ---------------------------------------------------------------------- | Methods inherited from HTMLParser.HTMLParser: | | ........FUNCTIONS get_article_items(content, user_name)
4. python脚本
为了简化这一步,我们把python脚本直接打包,放到下载页面http://download.csdn.net/detail/u013344915/8935181。——这个链接被莫名其妙删除了。。。。所以直接拷贝这里的代码。或者访问网盘:http://pan.baidu.com/s/1pJYo2ZD
如果用户没有登录等导致无法下载,也可以直接从这里拷贝。
4.1 export_blog.py
<pre name="code" class="python">#!/usr/bin/env python #encoding: utf-8 ''''' Export csdn's blog. e.g.: 1. Linux: ./export_blog.py a_flying_bird ./2015-07-25 5 2. Windows python export_blog.py 2005-07-27 5 ''' import time import os import re import sys import web_utils import page_count_parser import blog_item_parser def get_user_id(content): '''''Get user id from the content of main page. e.g.: <script type="text/javascript"> var username = "u013344915"; var _blogger = username; var blog_address = "http://blog.csdn.net/a_flying_bird"; var static_host = "http://static.blog.csdn.net"; var currentUserName = "u013344915"; </script> ''' username_pattern = '^var\s+username\s+=\s+\"(u[\d]+)\";$' lines = content.split('\n') for line in lines: #print line line = line.strip() matched = re.match(username_pattern, line) if matched: return matched.group(1) return None # Create a file name.# In fact, we delete the invalid characters in the blog's title.# e.g. C/C++ -> CC++def replace_invalid_filename_char(title, replaced_char='_'): '''Replace the invalid characaters in the filename with specified characater. The default replaced characater is '_'. e.g. C/C++ -> C_C++ ''' valid_filename = title invalid_characaters = '\\/:*?"<>|' for c in invalid_characaters: #print 'c:', c valid_filename = valid_filename.replace(c, replaced_char) return valid_filename def export_csdn_blogs(user_name, blog_saved_path, sleep_len): ''''' Read the main_page_url, and parse all the blog information, then save to blog_saved_path. e.g.: user_name = 'a_flying_bird' user_id = 'u013344915' blog_saved_path = "D:\\examples\\python\\export_blog\\2015-07-25" sleep_len = 5 export_csdn_blogs(user_name, user_id, blog_saved_path, sleep_len) ''' step = 1 print "Step %d: mkdir the destination directory: %s" % (step, blog_saved_path) step = step + 1 if not os.path.exists(blog_saved_path): os.makedirs(blog_saved_path) print "Step %d: Retrieve the main page's content." % (step,) step = step + 1 main_page_url = 'http://blog.csdn.net/%s/' % (user_name,) content = web_utils.get_page_content(main_page_url) print "Step %d: Get user id from the main page." % (step,) step = step + 1 user_id = get_user_id(content) if user_id is None: print "Can not get user id from the main page. Correct it first." return else: print "user id: ", user_id print "Step %d: Get the pagelist's URLs." % (step,) step = step + 1 page_lists = page_count_parser.get_page_lists(content, user_id) print "Step %d: Read all of the article information, includes: url, title." % (step,) step = step + 1 articles = [] for page_list in page_lists: print "current pagelist: ", page_list page_list_content = web_utils.get_page_content(page_list) the_articles = blog_item_parser.get_article_items(page_list_content, user_name) articles.extend(the_articles) time.sleep(sleep_len) print "Step %d: Save the articles." % (step,) step = step + 1 total_article_count = len(articles) print "Total count:", total_article_count index = 1 for article in articles: print "%d/%d: %s, %s ..." % (index, total_article_count, article.url, article.title) index = index + 1 web_utils.save_page(article.url, os.path.join(blog_saved_path, replace_invalid_filename_char(article.title) + ".htm")) time.sleep(sleep_len) def usage(process_name): print "Usage: %s user_name saved_path sleep_len" % (process_name,) print "For example:" print " user_name: a_flying_bird" print " savedDirectory: /home/csdn/" print " sleep_len: 5" if __name__ == "__main__": argc = len(sys.argv) if argc != 4: usage(sys.argv[0]) sys.exit(-1) user_name = sys.argv[1] blog_saved_path = sys.argv[2] sleep_len = int(sys.argv[3]) export_csdn_blogs(user_name, blog_saved_path, sleep_len) print "DONE!!!"
4.2 web_utils.py
import urllib2 def fix_content(content): ''' <script type="text/javascript"> var protocol = window.location.protocol; document.write('<script type="text/javascript" src="' + protocol + '//csdnimg.cn/pubfooter/js/repoAddr2.js?v=' + Math.random() + '"></' + 'script>'); </script> While parsing the line of 'document.write...', there is some error. So we will delete this line. ''' #error_string = '''document.write('<script type="text/javascript" src="' + protocol + '//csdnimg.cn/pubfooter/js/repoAddr2.js?v=' + Math.random() + '"></' + 'script>');''' #content.replace(error_string, "") fixed_content = "" lines = content.split('\n') for index in range(0, len(lines)): if lines[index].find('window.location.protocol') > 0: #print "find the error string." lines.remove(lines[index + 1]) break content = "" for line in lines: content = content + line + '\n' #print content return content def save_page(url, filename): ''' Save the web page specified by url. ''' content = get_page_content(url) f = open(filename, "wt") f.write(content) f.close() def get_page_content(url): ''' Get the web page's content. If filename is specified, save the content to this file. ''' headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} req = urllib2.Request(url, headers = headers) content = urllib2.urlopen(req).read() # 'UTF-8' content = fix_content(content) return content def _test(): url = 'http://blog.csdn.net/a_flying_bird' filename = "main_page.htm" save_page(url, filename) def _test_error_string(): filename = "a.htm" content = open(filename, "r").read() content = fix_content(content) f = open("fix.htm", "wt") f.write(content) f.close() if __name__ == '__main__': #_test() _test_error_string()
4.3 page_count_parser.py
#!/usr/bin/env python#encoding: utf-8import htmllibimport urllib2from HTMLParser import HTMLParserimport reclass PageCountParser(HTMLParser): ''' Get the page count from this 'div'. example: <div id="papelist" class="pagelist"> <span> 137条数据 共10页</span> <strong>1</strong> <a href="http://blog.csdn.net/u013344915/article/list/2">2</a> <a href="http://blog.csdn.net/u013344915/article/list/3">3</a> <a href="http://blog.csdn.net/u013344915/article/list/4">4</a> <a href="http://blog.csdn.net/u013344915/article/list/5">5</a> <a href="http://blog.csdn.net/u013344915/article/list/6">...</a> <a href="http://blog.csdn.net/u013344915/article/list/2">下一页</a> <a href="http://blog.csdn.net/u013344915/article/list/10">尾页</a> </div> ''' def __init__(self, user_id): HTMLParser.__init__(self) self.is_page_list = False self.page_count = 1 self.page_list_url_header = "http://blog.csdn.net/u013344915/article/list/" # Windows 7 #self.prefix = "" #self.pattern = "^http://blog.csdn.net/u013344915/article/list/([\\d]+)$" # Windows 8, Firefox 38.0.6 self.prefix = "http://blog.csdn.net" self.pattern = "^/%s/article/list/([\d]+)$" % (user_id,) def _is_page_list(self, tag, attrs): ''' Whether the tag is responding to article_title. e.g.: <div id="papelist" class="pagelist"> ''' if tag != 'div': return False for attr in attrs: name, value = attr if name == 'id' and value == 'papelist': # Oooh, it is papelist, not the pagelist! print "enter pagelist" return True return False def save_page_count(self, attrs): ''' Save the pagecount. example: <a href="http://blog.csdn.net/u013344915/article/list/6">...</a> <a href="http://blog.csdn.net/u013344915/article/list/2">下一页</a> <a href="http://blog.csdn.net/u013344915/article/list/10">尾页</a> Windowns 8, Firefox 38.0.6 <a href="/u013344915/article/list/2"> ''' for attr in attrs: name, value = attr if name == 'href': matched = re.match(self.pattern, value) #print "matched:", matched if matched: count = int(matched.group(1)) #print "count:", count if count > self.page_count: self.page_count = count return def handle_starttag(self, tag, attrs): #print "start tag(), tag:", tag #print "attrs:", attrs if self._is_page_list(tag, attrs): self.is_page_list = True return if self.is_page_list: if tag == 'a': self.save_page_count(attrs) def handle_endtag(self, tag): #print "end tag(), tag:", tag if self.is_page_list and tag == 'div': self.is_page_list = False def handle_data(self, text): #print "handle data(), text:", text pass def get_page_count(self): return self.page_count def get_page_lists(self): page_lists = [] for index in range(1, self.page_count + 1): page_lists.append(self.page_list_url_header + str(index)) return page_lists def get_page_lists(content, user_id): ''' Get the page lists' url. ''' parser = PageCountParser(user_id) parser.feed(content) parser.close() page_count = parser.get_page_count() print "page count: ", page_count page_lists = parser.get_page_lists() for page_list in page_lists: print page_list return page_lists def _test(): content = open('main_page.htm', 'r').read() get_page_lists(content, 'u013344915') if __name__ == "__main__": _test()
4.4 blog_item_parser.py
<pre name="code" class="python">#!/usr/bin/env python#encoding: utf-8import htmllibimport urllib2from HTMLParser import HTMLParserimport reimport platform '''article_list={"list_item article_item"}+"list_item article_item"={article_title} + {article_description} + {article_manage} + {clear}<div id="article_list" class="list"> <div class="list_item article_item"> <div class="article_title"> <span class="ico ico_type_Original"></span> <h1> <span class="link_title"> <a href="http://blog.csdn.net/a_flying_bird/article/details/47028939"> Linux环境下列出指定目录下的所有文件 </a> </span> </h1> </div> <div class="article_description"> 递归方式列出指定目录下的所有子目录和文件。... </div> <div class="article_manage"> <span class="link_postdate">2015-07-23 21:27</span> <span class="link_view" title="阅读次数"> <a href="http://blog.csdn.net/a_flying_bird/article/details/47028939" title="阅读次数"> 阅读 </a> (4) </span> <span class="link_comments" title="评论次数"> <a href="http://blog.csdn.net/a_flying_bird/article/details/47028939#comments" title="评论次数" onclick="_gaq.push(['_trackEvent','function', 'onclick', 'blog_articles_pinglun'])" > 评论 </a> (0) </span> <span class="link_edit"> <a href="http://write.blog.csdn.net/postedit/47028939" title="编辑">编辑</a> </span> <span class="link_delete"> <a href="javascript:void(0);" onclick="javascript:deleteArticle(47028939);return false;" title="删除"> 删除 </a> </span> </div> <div class="clear"></div> </div> <div class="list_item article_item"> <div class="article_title"> ..........The key hierarchy of blog's title:<div id="article_list" class="list"> <div class="list_item article_item"> <div class="article_title"> <span class="ico ico_type_Original"></span> <h1> <span class="link_title"> <a href="http://blog.csdn.net/a_flying_bird/article/details/47028939"> Linux环境下列出指定目录下的所有文件 </a> </span>Furthermore, only the div of 'article_title' is enough!'''class BlogItem(object): def __init__(self, id): self.id = id self.url = None self.title = None def dump(self): print "(%s, %s, %s)" % (self.id, self.url, self.title)class BlogItemsParser(HTMLParser): ''' Get all the article's url and title. ''' def __init__(self, user_name): HTMLParser.__init__(self) self.is_article_title = False self.ready_for_article_title = False # having reading the tag 'a', ready for handle the 'data'. self.current_article_id = None self.blogItems = {} self.is_windows_platform = False if platform.system() == 'Windows': self.is_windows_platform = True #self.prefix = "" #self.pattern = "^http://blog.csdn.net/a_flying_bird/article/details/([\\d]+)$" # windows 7 # windows 8, Firefox 38.0.6 self.prefix = "http://blog.csdn.net" self.pattern = "^/%s/article/details/([\d]+)$" % (user_name,) def _is_start_tag_of_article_title(self, tag, attrs): ''' Whether the tag is responding to article_title. e.g.: <div class="article_title"> ''' if tag != 'div': return False for attr in attrs: name, value = attr if name == 'class' and value == 'article_title': return True return False def save_article_title(self, attrs): ''' Save the article_title. example: <a href="http://blog.csdn.net/a_flying_bird/article/details/47028939"> Linux环境下列出指定目录下的所有文件 </a> ''' for attr in attrs: name, value = attr if name == 'href': matched = re.match(self.pattern, value) if matched: id = matched.group(1) self.current_article_id = id blogItem = BlogItem(id) blogItem.url = self.prefix + value blogItem.title = None self.blogItems[id] = blogItem self.ready_for_article_title = True return def handle_starttag(self, tag, attrs): #print "start tag(), tag:", tag #print "attrs:", attrs if self._is_start_tag_of_article_title(tag, attrs): self.is_article_title = True return if self.is_article_title: if tag == 'a': self.save_article_title(attrs) def handle_endtag(self, tag): #print "end tag(), tag:", tag if self.is_article_title and tag == 'div': self.is_article_title = False def handle_data(self, text): #print "handle data(), text:", text if self.ready_for_article_title: self.ready_for_article_title = False title = text.strip() if self.is_windows_platform: title = title.decode('UTF-8').encode('MBCS') self.blogItems[self.current_article_id].title = title assert(self.blogItems[self.current_article_id].id == self.current_article_id) return def get_blog_items(self): return self.blogItems def get_article_items(content, user_name): parser = BlogItemsParser(user_name) parser.feed(content) parser.close() blogItems = parser.get_blog_items() print "article's count:", len(blogItems) for blogItem in blogItems.values(): blogItem.dump() return blogItems.values()def _test(): content = open('main_page.htm', 'r').read() get_article_items(content) if __name__ == "__main__": _test()
5. TODO
以上脚本是在Windows8上面验证通过,还需要在Linux环境上验证。
需要保存博客内容中的图片。
0 0
- python脚本自动保存blog页面
- 自动下载并保存博客 Python脚本
- Le4F'Blog-Python脚本
- Python自动化脚本【1】url提取及自动打开页面
- 脚本自动跳转页面
- Unity自动保存场景脚本
- Python自动登录脚本
- python自动执行脚本
- Python自动操作脚本
- 用Python脚本保存网页
- 定时自动刷新页面脚本
- js脚本页面自动刷新
- CSDN Blog显示页面中的脚本BUG
- linux oracle 自动备份脚本 保存一周
- Android 自动截屏并保存脚本
- monkey脚本实例(log自动保存)
- MATLAB脚本中画图并自动保存
- python stackeless 自动安装脚本
- C++位运算详解
- phaser制作跑酷游戏
- XAMPP重要文件目录及配置
- Java NIO:浅析I/O模型
- HDU 4268 Alice and Bob(贪心+Multiset的应用)
- python脚本自动保存blog页面
- 偶数支足球队进行单循环比赛,按照指定算法打印每轮的对阵形势
- LAMP环境搭建教程
- Android 开源框架Universal-Image-Loader完全解析(一)--- 基本介绍及使用
- 最简单的基于FFmpeg的移动端例子:IOS 视频转码器
- UI___UIImageView
- Android 开源框架Universal-Image-Loader完全解析(二)--- 图片缓存策略详解
- 如何测试一个网页登陆界面
- Linux 下编译安装 PHP 5.6