python:爬虫系列-01

来源：互联网发布：河海大学网络教学平台编辑：程序博客网时间：2024/05/21 03:59

看了《Learning Python》有一段时间了，差不多看到类的样子，一直没有去动手实践过。
于是决定动手写点小东西。也不知道该写点什么，于是打算入手爬虫。

参照网上的爬虫教程，写了一个简单爬取网页中链接的小练习。
- common_var.py

#!/usr/bin/env python# -*- coding: utf-8 -*-# @author : cat# @date   : 2017/6/25.user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"headers = {"User-Agent": user_agent}if __name__ == '__main__':    pass

http_file.py

#!/usr/bin/env python# -*- coding: utf-8 -*-# @author : cat# @date   : 2017/6/24.from urllib import requestimport sslfrom web.common_var import headersimport re# regex from djiangoregex = re.compile(    r'^(?:http|ftp)s?://'  # http:// or https://    r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|'  # domain...    r'localhost|'  # localhost...    r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'  # ...or ip    r'(?::\d+)?'  # optional port    r'(?:/?|[/?]\S+)$', re.IGNORECASE)csdn = 'http://www.csdn.com'def get_urls(url_in=csdn, key="href="):    """    通过一个入口的URL爬取其中的全部的URL    :param url_in: 入口的URL    :param key: 'href='    :return: urls set !    """    url_sets = set()    ssl_context = ssl._create_unverified_context()    req = request.Request(url_in, headers=headers)    resp_bytes = request.urlopen(req, context=ssl_context)    for line in resp_bytes:        line_html = line.decode('utf-8')        # print(line_html)        if key in line_html:            # print(line_html)            index = line_html.index(key)            sub_url = line_html[index + len(key):].replace('"', "#").split('#')[1]            match = regex.search(sub_url)            if match:                # print(match.group())                # yield match.group()                url_sets.add(match.group())                # print(url_sets)    return url_setsif __name__ == '__main__':    # print(list(get_urls("http://news.baidu.com/?tn=news")))    baidu_news = "http://news.baidu.com/?tn=news"    urls = get_urls(baidu_news)    # print(urls)    for u in urls:        print(u)    print("total url size in this website({}) = {}"          .format(baidu_news, len(urls)))

代码不算简洁，不过还算是易懂。

输出如下：

/web/http_file.pyhttps://baijia.baidu.com/s?id=1571043179126899http://net.china.cn/chinese/index.htmhttp://newsalert.baidu.com/na?cmd=0http://tech.baidu.com/http://tv.cctv.com/2017/06/24/VIDE9KYKPMTmLLENgIgdhyut170624.shtmlhttp://xinwen.eastday.com/a/170624122900408.htmlhttp://shehui.news.baidu.com/… # 后面还有很多URL，不全部贴出了。…total url size in this website(http://news.baidu.com/?tn=news) = 116Process finished with exit code 0

next step：

下一步打算访问子链接，看看一共包含多少个链接。这似乎是一个浩大的工程，也不清楚会不会去完成…

阅读全文

0 0