Python 爬取CSDN博客频道

来源:互联网 发布:西安三星项目瘫痪 知乎 编辑:程序博客网 时间:2024/05/16 07:43

初次接触python,写的很简单,开发工具PyCharm,python 3.4很方便

python 部分模块安装时需要其他的附属模块之类的,可以先

pip install wheel

然后可以直接下载whl文件进行安装

pip install lxml-3.5.0-cp34-none-win32.whl

定义一个类,准备保存的类型

复制代码
class CnblogArticle:    def __init__(self):        self.num=''        self.category=''        self.title=''        self.author=''        self.postTime=''        self.articleComment=''        self.articleView=''
复制代码

因为CSDN博客频道只有18页,所以解析18页,有多线程解析(main注释部分)及普通解析,在main方法里

注意事项:每个item以class=blog_list区分,部分item下有class=category,少部分没有,所有要注意,否则会报错

复制代码
<div class="blog_list">        <h1>                    <a href="/other/index.html" class="category">[综合]</a>            <a name="49786427" href="http://blog.csdn.net/matrix_space/article/details/49786427" target="_blank">Python: scikit-image canny 边缘检测</a>                    <img src="http://static.blog.csdn.net/images/icon-zhuanjia.gif" class="blog-icons" alt="专家" title="专家">        </h1>                <dl>        <dt>            <a href="http://blog.csdn.net/matrix_space" target="_blank">                <img src="http://avatar.csdn.net/F/9/7/3_shinian1987.jpg" alt="shinian1987" />            </a>        </dt>                  <dd>这个用例说明canny 边缘检测的用法import numpy as npimport matplotlib.pyplot as pltfrom scipy import ndimage as ndifrom skimage import feature# Generate noisy image of a squareim = np.zeros((128, 128))im[3...</dd>        </dl>        <p>            <a class="tag" href="/tag/details.html?tag=python" target="_blank">python</a>        </p>        <div class="about_info">                <span class="fr digg" id="digg_49786427" blog="1164951" digg="0" bury="0"></span>            <span class="fl">                <a href="http://blog.csdn.net/matrix_space" target="_blank" class="user_name">shinian1987</a>                <span class="time">3小时前</span>                <a href="http://blog.csdn.net/matrix_space/article/details/49786427" target="_blank" class="view">阅读(104)</a>                <a href="http://blog.csdn.net/matrix_space/article/details/49786427#comments" target="_blank" class="comment">评论(0)</a>            </span>        </div>    </div>
复制代码
复制代码
<div class="blog_list">        <h1>            <a name="50524490" href="http://blog.csdn.net/u010579068/article/details/50524490" target="_blank">STL_算法 for_each 和 transform 比较</a>        </h1>                <dl>        <dt>            <a href="http://blog.csdn.net/u010579068" target="_blank">                <img src="http://avatar.csdn.net/9/9/B/3_u010579068.jpg" alt="u010579068" />            </a>        </dt>                  <dd>C++ Primer 学习中。。。&#160;简单记录下我的学习过程&#160;(代码为主)所有容器适用/**----------------------------------------------------------------------------------for_each &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;速度快 &#160; &#160; &#160; &#160; &#160; &#160; &#160;...</dd>        </dl>        <p>            <a class="tag" href="/tag/details.html?tag=STL_算法" target="_blank">STL_算法</a>            <a class="tag" href="/tag/details.html?tag=for_each" target="_blank">for_each</a>            <a class="tag" href="/tag/details.html?tag=transform" target="_blank">transform</a>            <a class="tag" href="/tag/details.html?tag=STL" target="_blank">STL</a>        </p>        <div class="about_info">                <span class="fr digg" id="digg_50524490" blog="1499803" digg="0" bury="0"></span>            <span class="fl">                <a href="http://blog.csdn.net/u010579068" target="_blank" class="user_name">u010579068</a>                <span class="time">3小时前</span>                <a href="http://blog.csdn.net/u010579068/article/details/50524490" target="_blank" class="view">阅读(149)</a>                <a href="http://blog.csdn.net/u010579068/article/details/50524490#comments" target="_blank" class="comment">评论(0)</a>            </span>        </div>    </div>
复制代码

Beautiful Soup 4.2.0 文档 可以去官网直接查看

复制代码
# -*- coding:utf-8 -*-from bs4 import BeautifulSoupimport urllib.requestimport osimport sysimport timeimport threadingclass CnblogUtils(object):    def __init__(self):        self.headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'}        self.contentAll=set()    def getPage(self,url=None):        request=urllib.request.Request(url,headers=self.headers)        response=urllib.request.urlopen(request)        soup=BeautifulSoup(response.read(),"lxml")        return soup    def parsePage(self,url=None,page_num=None):        soup=self.getPage(url)        itemBlog=soup.find_all('div','blog_list')        cnArticle=CnblogUtils        for i,itemSingle in enumerate(itemBlog):            cnArticle.num=i            cnArticle.author=itemSingle.find('a','user_name').string            cnArticle.postTime=itemSingle.find('span','time').string            cnArticle.articleComment=itemSingle.find('a','comment').string            cnArticle.articleView=itemSingle.find('a','view').string            if itemSingle.find('a').has_attr('class'):                cnArticle.category=itemSingle.find('a','category').string                cnArticle.title=itemSingle.find('a',attrs={'name':True}).string            else:                cnArticle.category="None"                cnArticle.title=itemSingle.find('a').string            self.contentAll.add(str(cnArticle.author))            self.writeFile(page_num,cnArticle.num,cnArticle.author,cnArticle.postTime,cnArticle.articleComment,cnArticle.articleView,cnArticle.category,cnArticle.title)    def writeFile(self,page_num,num,author,postTime,articleComment,articleView,category,title):        f=open("a.txt",'a+')        f.write(str('page_num is {}'.format(page_num))+'\t'+str(num)+'\t'+str(author)+'\t'+str(postTime)+'\t'+str(articleComment)+'\t'+str(articleView)+'\t'+str(category)+'\t'+str(title)+'\n')        f.close()def main(thread_num):    start=time.clock()    cnblog=CnblogUtils()    '''    thread_list = list();    for i in range(0, thread_num):        thread_list.append(threading.Thread(target = cnblog.parsePage, args = ('http://blog.csdn.net/?&page={}'.format(i),i+1,)))    for thread in thread_list:        thread.start()    for thread in thread_list:        thread.join()    print(cnblog.contentAll)    '''    for i in range(0,18):        cnblog.parsePage('http://blog.csdn.net/?&page={}'.format(i),i+1)    end=time.clock()    print('time = {}'.format(end-start))if __name__ == '__main__':    main(18)
复制代码

 

程序运行结果:

复制代码
page_num is 1    0    foruok    18分钟前    评论(0)    阅读(0)    [编程语言]    Windows下从源码编译SKIApage_num is 1    1    u013467442    31分钟前    评论(0)    阅读(3)    [编程语言]    Cubieboard学习资源page_num is 1    2    tuke_tuke    32分钟前    评论(0)    阅读(15)    [移动开发]    UI组件之AdapterView及其子类关系,Adapter接口及其实现类关系page_num is 1    3    xiaominghimi    53分钟前    评论(0)    阅读(51)    [移动开发]    【COCOS2D-X 备注篇】ASSETMANAGEREX使用异常解决备注->CHECK_JNI/CC‘JAVA.LANG.NOCLASSDEFFOUNDERROR’page_num is 1    4    shinian1987    1小时前    评论(0)    阅读(64)    [综合]    Python: scikit-image canny 边缘检测page_num is 1    5    u010579068    1小时前    评论(0)    阅读(90)    None    STL_算法 for_each 和 transform 比较page_num is 1    6    u013467442    1小时前    评论(0)    阅读(94)    [编程语言]    OpenGLES2.0着色器语言glslpage_num is 1    7    u013467442    1小时前    评论(0)    阅读(89)    [编程语言]    OpenGl 坐标转换page_num is 1    8    AaronGZK    1小时前    评论(0)    阅读(95)    [编程语言]    bzoj4390【Usaco2015 Dec】Max Flowpage_num is 1    9    AaronGZK    1小时前    评论(0)    阅读(95)    [编程语言]    bzoj1036【ZJOI2008】树的统计Countpage_num is 1    10    danhuang2012    1小时前    评论(0)    阅读(90)    [编程语言]    Node.js如何处理健壮性page_num is 1    11    EbowTang    1小时前    评论(0)    阅读(102)    [编程语言]    <LeetCode OJ> 121. Best Time to Buy and Sell Stockpage_num is 1    12    cartzhang    2小时前    评论(0)    阅读(98)    [架构设计]    给虚幻4添加内存跟踪功能page_num is 1    13    u013595419    2小时前    评论(0)    阅读(93)    [综合]    第2章第1节练习题3 共享栈的基本操作page_num is 1    14    ghostbear    2小时前    评论(0)    阅读(115)    [系统运维]    Dynamics CRM 2016 Series: Overviewpage_num is 1    15    u014723529    2小时前    评论(0)    阅读(116)    [编程语言]    将由BeanUtils的getProperty方法返回的Date对象的字符串表示还原为对象page_num is 1    16    Evankaka    2小时前    评论(1)    阅读(142)    [架构设计]    Jenkins详细安装与构建部署使用教程page_num is 1    17    Evankaka    2小时前    评论(0)    阅读(141)    [编程语言]    Ubuntu安装配置JDK、Tomcat、SVN服务器
复制代码

 网速不好时多线程可能报错

获取了数据了就可以进行数据分析,或者深度搜索,根据author去获取author对应的blog等

0 0
原创粉丝点击