[简单爬虫]记录博客流量-day day up

来源：互联网发布：2016中超球员数据编辑：程序博客网时间：2024/06/05 17:44

做了一个小工具，用于记录我的csdn博客每天的流量变化，当程序运行的的时候捕获到一场则发送邮件到我的邮箱，告知我来处理异常。每天的流量会记录在csv文件中，可以使用pandas方便的获取文件内容并绘图。
用到的工具包括

requests
bs4(beautifulsoup4)
csv(buildin)
smtplib(buildin)

更详细的内容请看代码注释

import requests   # 发送网络请求import bs4        # 解析网页内容 import csv        # 读写csv文件from datetime import datetimeimport smtplib    # 用于发送邮件import time       # sleep# csdn 博客地址，一下是我的地址__url = "http://blog.csdn.net/pengjian444?skin=skin-yellow"def send_mail(content, to_address='1823403153@qq.com'):    try:        content = "[博客记录系统出现故障]: " + content        # 这里使用的是网易的smtp服务器        # 默认使用ssl连接，默认的端口号是465        smtp = smtplib.SMTP_SSL(" smtp.163.com", port=465)        # qq 邮箱账号        username = "××××××"        # 邮箱授权码        password = "××××××"        smtp.login(username, password)        smtp.sendmail(username,                      to_address,                      content)    except Exception as base_e:        log_line = "{}:{}".format(datetime.now(), str(base_e))        print(log_line)        with open('log.txt', 'a') as f:            f.write(log_line)def get_page_content(url):    """    获取网页内容    """    headers = {        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '                      'AppleWebKit/537.36 ('                      'KHTML, like Gecko) Chrome/58.0.3029.81 '                      'Safari/537.36',        'Host'      : 'blog.csdn.net'    }    r = requests.get(url, headers=headers)    if r.status_code != 200:        return ""    else:        return r.textdef get_flux(page_content):    if page_content is None:        return None    soup = bs4.BeautifulSoup(page_content, "html.parser")    return int(soup.select("#blog_rank li span")[0].string[:-1])def write_csv(flux_data, file_name='flux.csv'):    """    写入csv文件    :param flux_data:    :param file_name:    :return:    """    row = [str(datetime.now().date()), str(flux_data)]    with open(file_name, 'a') as f:        writer = csv.writer(f)        writer.writerow(row)if __name__ == '__main__':    while True:        try:            content = get_page_content(url=__url)            flux = get_flux(content)            print(flux)            write_csv(flux_data=flux)        except Exception as e:            send_mail(e)        time.sleep(24 * 60 * 60)  # 每天记录一次

这里面综合了一些很小的知识点，包括以下内容
+ 使用csv模块读写csv文件
+ requests库的简单使用（发送get消息，设置headers)
+ 使用bs4解析网页内容
+ 发送邮件

麻雀虽小，五脏俱全。希望大家能对大家有所帮助

0 0