python爬虫爬取豆瓣电影榜单

来源:互联网 发布:离坚白 知乎 编辑:程序博客网 时间:2024/05/17 05:05

起因

起因就是博主自己头脑发热想要爬取豆瓣的电影榜单,然后把里面的文字部分提取出来
自己之前看别人的网上的教程写到了如何爬取某个网页的图片并将其下载,自己试了好使但是感觉很不过瘾,于是自己也花了一两天的时间研究爬虫,总算是研究明白了,给大家分享.

爬虫原理

爬虫原理很简单,你访问一个网页,网页会回给你一个HTML文档,你通过python的正则表达式也好,beautifulSoup库也好,Xpath也好,通过这几种技术方式得到你想要的元素,然后把该元素提取出来即可,非常简单.总而言之爬虫真的是一个非常简单的东西,大家千万不要把它想象的太难了.

用到的技术

多路复用的爬取技术
beautifulSoup库的文档解析技术

代码如下

实现了解析豆瓣电影榜单,并且能够提取出来文字部分,值得注意的是,爬虫如果仅仅是简单的爬虫,确实不难,我觉得更重点的是放在爬取内容的解析上面!!!这个才是有含金量的工作

#!/usr/bin/python# encoding: utf-8from bs4 import BeautifulSoupimport urllibimport grequestshtml = """<html><head><title>The Dormouse's story</title></head><body><p class="title" name="dromouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""list = []def findINFO(url):    if url in list:        return    list.append(url)    html1 = urllib.urlopen(url).read()    vvv = ReturnList(html1)    for vv in vvv:        findINFO(vv)    print "测试"    # for pp in soup.find(class_="paginator").find_all("a"):    #     findINFO(pp.attrs["href"])def find_group(urls):    urls = [url for url in urls if url not in list]    if len(urls) == 0:        return    new_urls_set = set([])    rs = (grequests.get(u) for u in urls)    for res in grequests.map(rs):        list.append(res.url)        for url in ReturnList(res.text):            new_urls_set.add(url)    find_group(new_urls_set)def ReturnList(HTML):    list  = []    soup = BeautifulSoup(HTML)    #分析页面的代码    for link in soup.find_all(class_="doulist-item"):        print link.find(class_="title").a.get_text()        print link.find(class_="rating").find(class_="rating_nums").get_text()        print link.find(class_="rating").find_all("span")[2].get_text()        print link.find(class_="abstract").get_text()    for pp in soup.find(class_="paginator").find_all("a"):        list.append(pp.attrs["href"])    return list#findINFO("https://www.douban.com/doulist/240962/")#print listfind_group(["https://www.douban.com/doulist/240962/"])