csdn博客爬虫更新

来源:互联网 发布:淘宝网商贷款逾期了 编辑:程序博客网 时间:2024/03/29 20:36

几天没上csdn博客,不知道为什么给我csdn首页改了, 不是以前的网页布局了,所以之前写的csdn博客爬虫也就宣告失效,所以今天修改了下之前写的xpath爬虫,正则爬虫就没改了,改的有点麻烦

# -*- coding:gbk -*-import sysimport requestsimport refrom lxml import etreefrom lxml import html as htdef download(url):    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"}    html=requests.get(url,headers=headers).text    return htmlhtml2=download("http://blog.csdn.net/Joliph")selector2=etree.HTML(html2)pagelist=selector2.xpath('//*[@id="papelist"]/a[last()-2]/text()')[0]#这里有有个潜在的问题,在我博客写到5页以上时出现...后无法判断页数pagelist=int(pagelist)for page in range(1,pagelist+1):    url="http://blog.csdn.net/Joliph/article/list/"+str(page)    html=download(url)    selector=etree.HTML(html)    titlelist=selector.xpath('//*[@class="link_title"]/a/text()')    datelist=selector.xpath('//*[@class="article_manage"]/span[1]/text()')    #/text()!!!!!!!!!!!!!!!!!!!!!!!    number=len(titlelist)    for i in range(1,number+1):        tree=ht.fromstring(titlelist[i-1])        strcom=tree.xpath('string(.)')        print(datelist[i-1]+"----"+strcom)
原创粉丝点击