【爬虫笔记】爬虫入门

来源：互联网发布：用算法实现100以内质数编辑：程序博客网时间：2024/05/16 05:23

跌跌撞撞算是能够爬一些数据了，也算是半只脚入门了。当然，不可否认的是还仍然有很长的路要走。

因为之前在实习是公司也算是写了一段时间的Python3，然后，就直接从慕课入门爬虫了。给你个链接：爬虫入门的链接。看过这个视频，也就能对爬虫有了一个初步的了解。然后，也查询了许多技术博客。对于，这种比较成熟的技术一般情况下，baidu/google都会有很多的好的可以借鉴的博客的。

python3中只有urllib，而没有urllib2，也不能说没有了，只能说python2中的urllib和urllib2合成了一个包为urlib。更加详细关于他们的区别可以看这个链接：关于python3,python2中urllib的一些区别链接

实践才能把知识理解：

首先，应该对爬虫的总体架构有一些简单的认识，这个非常重要的。因为，这就像你要做一件事情的总体计划，有了这个，你大体路径不会错。

1，url管理器：用于管理你需要爬取/已经爬取/待爬取页面的URL。

2，页面下载器（urllib）：将给定的url的页面的html下载到本地。

3，网页解析器（BeautifulSoup）：结构化解析DOM - document object model，将html/xml网页解析成一种树形结构，从而提取有用的数据。

当然，每一部分都会很多的知识可以怕根问底的。这里仅仅介绍其大体框架。

"""    for crewl http://acm.nyist.net/JudgeOnline/problemset.php  problems' name"""import urllib.requestimport urllib.parsefrom bs4 import BeautifulSoupimport re"""    这里，我把自己写的爬虫写成了一个MySpider类。    1，用set作为url管理器，new_urls就是待爬取页面的url，而old_urls就是爬取过页面的url    2，url_downloader()就是页面下载器。给一个URL，下载来其页面的html/xml。    3，page_resolver()就是页面解析器。给一段html/xml字符串，来解析出来有用的信息。"""class MySpider(object):    new_urls = set()    old_urls = set()    def __init__(self, root_url):        self.new_urls.add(root_url)        def url_downloader(self, url):        req = urllib.request.Request(url)        req.add_header("User-Agent", "Mozilla/5.0........ Firefox/50.0")        req.add_header("GET",url)        req.add_header("Host","acm.nyist.net")         req.add_header("Referer","http://acm.nyist.net/JudgeOnline/problemset.php")        """            对于Request.header的创建，可以通过你自己的浏览器进行看出有用的信息。            Host: acm.nyist.net            User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:50.0) Gecko/20100101 Firefox/50.0            Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8            Accept-Language: zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3            Accept-Encoding: gzip, deflate            Cookie: __utma=1.777807425.1476802115.1485247339.1485251033.16; __utmz=1.1476802115.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); _gscu_771983383=7991431907sfcl37; PHPSESSID=1e816be3352a5e380b670153ccb7f0bd; __utmc=1            Connection: keep-alive            Upgrade-Insecure-Requests: 1            Cache-Control: max-age=0        """        response = urllib.request.urlopen(req)        return response.read()            def page_resolver(self, page_content):        # BeautifulSoup 是一个解析器的工具。        soup = BeautifulSoup(page_content, 'html.parser', from_encoding='utf-8')        problem = soup.find_all('a', href=re.compile(r'problem\.php\?pid=\d+'))        _file = open('problem.txt', 'a+')        for item in problem:            print (item.get_text(), file=_file)        _file.close()        page_url = soup.find_all('a', href=re.compile(r'\?page=\d+'))        print (page_url)        for item in page_url:            newurl = item['href']            newfullurl = urllib.parse.urljoin("http://acm.nyist.net/JudgeOnline/problemset.php", newurl)            if newfullurl not in self.new_urls and newfullurl not in self.old_urls:                self.new_urls.add(newfullurl)            #  crewl 用来调度爬虫，也作为的爬虫的一部分。    def crewl(self):        while len(self.new_urls):            url = self.new_urls.pop()            self.old_urls.add(url)            page = self.url_downloader(url)            self.page_resolver(page)initurl = "http://acm.nyist.net/JudgeOnline/problemset.php?page=1"spider = MySpider(initurl)spider.crewl()

注：

1，关于查看request的信息：

2，关于BeautifulSoup，可以baidu/goolge一些好的技术博客进行入门。

当然，自己写的MySpider还有很多的改进的地方。比如，利用该方法，并不是所有的网站都能爬取，比如一些需要登录信息的网站。比如爬取出的一些代码中有js代码并不能解析出来等等。

简单来说就是，具体需要爬取的页面也是需要不同的方法进行爬取的，不能一以贯之。

总的来说，这仅仅是一个最最基础入门爬虫的文章。爬虫，还有很长的路要走。

1 0