网络爬虫篇(一)

来源:互联网 发布:农大网络教学综合平台 编辑:程序博客网 时间:2024/05/16 07:36

网络爬虫篇(一)

现在的网络爬虫使用的越来越多,在数据挖掘领域也是很重要的一部分内容,Python作为一个脚本语言,在爬虫领域发挥了重要的作用。网上关于网络爬虫的例子很多,大多基于urllib库或者requests库和bs4库,因为想要做一些关于爬虫的项目,于是找了一些资料来学习,以下是一篇简单的关于网络爬虫的例子。(抓取豆瓣电影的基本信息)以此来说明这几个库的基本的函数和实现方式。

基本实现代码


写这部分代码时间是:2017年3月23日23:06:32 因为各大网站都会不定期的进行改版,因此这里不保证以下代码在后期还能正常生效,这里主要提供一些基本的思路和自己的一些学习技巧。

import requestsfrom bs4 import BeautifulSoup as bsfrom urllib import parseimport timeURL_GET = 'https://movie.douban.com/subject_search'def url_api():    """    Build the url for requests, we can change the rang then get more page.    :return: a generator of url.    """    for number in range(0, 50):        page = number * 15        param = {'start': page, 'search_text': '科幻'}        url = '?'.join([URL_GET, '%s']) % parse.urlencode(param)        yield urldef get_response(url):    res = []    response = requests.get(url)    if response.status_code == 200:        time.sleep(1)        res.append(response)        return res    else:        raise Exception('RequestError')def local_data(response):    """    Return name actors score and people number of the movie. This method will ignore the    none score.    :param response: class 'requests'    :return: a list contain name actors score and people number.    """    amovie = []    bsobj = bs(response.text, "lxml")    movies = bsobj.find_all('div', {'class': 'pl2'})    for movie in movies:        name = movie.find('a').get_text().strip('\n').replace(' ', '')        actors = movie.find('p').get_text().strip('\n').replace(' ', '')        try:            comment = movie.find('div').find_all('span')            score = comment[1].get_text()            comm = comment[2].get_text()[1:-4]            amovie.append((name, actors, score, comm))        except Exception as e:            pass    return amoviedef local_movie_page(response):    """    Return the name and index page of the movie.    :param response: class 'requests'    :return: a dict contains name and url    """    movie_page = {}    bsobj = bs(response.text, "lxml")    mov_url = bsobj.find_all('table', {'class', ""})    for url in mov_url:        ac = url.find('a')        name = ac.find('img')        movie_page[name['alt']] = ac['href']    return movie_pagedef write_file(filename, diction):    """    Write the movie name and the index url to a file.    :param filename: file of save the movies    :param diction: a dict contains movies and index url    :return:    """    with open(filename, 'a+', encoding='utf-8') as f:        for movie in diction:            f.write(movie)if __name__ == '__main__':    # 定位电影的主页    for url in url_api():        res = get_response(url)        for i in res:            abc = local_movie_page(i)            for key, value in abc.items():                write_file('b.txt', '{0}:{1}'.format(key, value)+'\n')

报错信息说明


TypeError: expected string or buffer

说明:类型错误,传入的参数出问题,应该检查函数需要的参数类型和自己传入参数类型是否一致。

0 0