爬虫入门1:urllib.request

来源：互联网发布：淘宝人群标签编辑：程序博客网时间：2024/06/07 04:09

The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.

总计就是说这个库有函数和类来实现打开在基础的，需要验证的，重定向的，缓冲中的URL。

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)¶

大概就能用到第一个参数，后面的现在还没用到。The optional cafile and capath parameters specify a set of trusted CA certificates for HTTPS requests. cafile should point to a single file containing a bundle of CA certificates, whereas capath should point to a directory of hashed certificate files. More information can be found in ssl.SSLContext.load_verify_locations().

后面的ca开头的参数好像与CA验证有关，这里也没遇到

This function always returns an object which can work as a context manager and has methods such as

geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed
info() — return the meta-information of the page, such as headers, in the form of anemail.message_from_string() instance (see Quick Reference to HTTP Headers)
getcode() – return the HTTP status code of the response.

返回值。就是字面上的意思。url，info和code

urllib.request.install_opener(opener)好像没啥用，文档里面也说了可有可无

3.urllib.request.build_opener([handler, …])

Return an OpenerDirector instance这个类在下面说

OpenerDirector Objects

OpenerDirector.add_handler(handler)¶

OpenerDirector.open(url, data=None[, timeout])¶

HTTPResponse Objects这个就是上面的那个urlopen返回的对象，主要是看看方法。

HTTPResponse.read([amt]): Reads and returns the response body, or up to the next amt bytes.

HTTPResponse.readinto(b): Reads up to the next len(b) bytes of the response body into the buffer b. Returns the number of bytes read.
New in version 3.3.

HTTPResponse.getheader(name, default=None): Return the value of the header name, or default if there is no header matching name. If there is more than one header with the name name, return all of the values joined by ‘, ‘. If ‘default’ is any iterable other than a single string, its elements are similarly returned joined by commas.

HTTPResponse.getheaders(): Return a list of (header, value) tuples.

HTTPResponse.fileno(): Return the fileno of the underlying socket.

HTTPResponse.msg

别的一时半会用不到，以后用到再写。

import requestsimport urllib.requestfrom lxml import htmlurl = "http://www.baidu.com"data = urllib.request.urlopen(url).read()data = data.decode('UTF-8')print(data)

简单的读取百度首页的内容

阅读全文

0 0