requests+beautifulsoup4 爬虫实战

来源：互联网发布：穷女生恋爱知乎编辑：程序博客网时间：2024/05/16 02:30

某电影网站手机页面有影视的评分，但不提供排序。为了看高分电影，动手写了爬虫，实现下载影视名称和评分，并输出至文件，后续通过excel处理排序。

#!/usr/bin/python3# -*- coding:utf-8 -*-"""Here is docstring"""# __author__ = c08762import timeimport requestsfrom bs4 import BeautifulSoupnames = []scores = []headers = {'User-Agent':'Mozilla/5.0 (iPhone; U; CPU iPhone OS 5_1_1 like Mac OS X; en) AppleWebKit/534.46.0 (KHTML, like Gecko) CriOS/19.0.1084.60 Mobile/9B206 Safari/7534.48.3'}root_url = 'http://www.dyaihao.com/type/5.html'i = 1print('正在获取 %s' % root_url)resp = requests.get(root_url, headers=headers, timeout=15)while resp.status_code == 200:    print('获取一个页面后暂停5秒\n')    time.sleep(5)    resp.encoding = 'utf-8'    soup = BeautifulSoup(resp.text, 'lxml')    # type(h3s) is list, 获取电影名    h3s = soup.select('li h3')    for h in h3s:        # type(t) is str        th = h.text        names.append(th[3:])    # 获取评分    ps = soup.select('li p')    for p in ps:        tp = p.text        scores.append(tp[:-1])    # 是否有下一页    next_p = soup.find('a', class_="btn btn-primary btn-block")    if next_p is None:        print('恭喜爬取完毕，正在输出至文本...')        name_score = dict(zip(names, scores))        fileObject = open('/home/c08762/sample.txt', 'w')        for k, v in name_score.items():            fileObject.write(str(k))            fileObject.write(",")            fileObject.write(str(v))            fileObject.write('\n')        fileObject.close()        print('文本写入完毕！结束')        break    else:        # 如果有进行地址组装，并跳转        build_url = "http://www.dyaihao.com" + next_p['href']        i += 1        if 0 == i % 20:            print('\n防反爬，暂停30秒\n')            time.sleep(30)        print('正在获取 %s' % build_url)        resp = requests.get(build_url, headers=headers, timeout=60)else:    print('发生页面打开错误')

有待完善：实现每日增量邮件提醒

0 0