网页抓取方式（六）--python/urllib3/BeautifulSoup

来源：互联网发布：淘宝富光保温杯编辑：程序博客网时间：2024/06/08 13:51

一、简介

本文介绍使用python语言进行网页抓取的方法。在此使用urllib3（urllib2也可以的,但容易被查封）进行网页抓取，

使用BeautifulSoup对抓取的网页进行解析。

二、注意

1、使用BeautifulSoup对html解析时，当使用css选择器，对于子元素选择时，要将nth-child改写为nth-of-type才行，

如 ul:nth-child(1) 应该写为 ul:nth-of-type(1) ,否则会报错Only the following pseudo-classes are implemented: nth-of-type.

二、实例代码

#! /usr/bin/evn pythonfrom bs4 import BeautifulSoupimport urllib3def get_html(url):    try:        userAgent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'        http = urllib3.PoolManager(timeout=2)        response = http.request('get', url, headers={'User_Agent': userAgent})        html = response.data        return html    except Exception, e:        print e        return Nonedef get_soup(url):    if not url:        return None    try:        soup = BeautifulSoup(get_html(url))    except Exception, e:        print e        return None    return soupdef get_ele(soup, selector):    try:        ele = soup.select(selector)        return ele    except Exception, e:        print e    return Nonedef main():    url = 'http://www.ifeng.com/'    soup = get_soup(url)    ele = get_ele(soup, '#headLineDefault > ul > ul:nth-of-type(1) > li.topNews > h1 > a')    headline = ele[0].text.strip()    print headlineif __name__ == '__main__':    main()

阅读全文

0 0