Python3 之 urllib

来源：互联网发布：心理咨询网络培训班编辑：程序博客网时间：2024/05/17 00:53

Python2中的urllib模块，在Python3中被修改为

20.5. urllib.request — Extensible library for opening URLs20.6. urllib.response — Response classes used by urllib20.7. urllib.parse — Parse URLs into components20.8. urllib.error — Exception classes raised by urllib.request20.9. urllib.robotparser — Parser for robots.txt

这几个模块，常用的urllib.urlopen()方法变成了urllib.request.urlopen()方法

1 在Python3中使用urllib实现http的get操作:

#!/usr/bin/env python3# -*- coding: utf-8 -*-from urllib import requestheader_dic={'User-Agent':'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25'}req = request.Request(url='http://www.qiushibaike.com/',headers=header_dic)with request.urlopen(req) as f:     print('Status:', f.status, f.reason)    for k, v in f.getheaders():        print('%s: %s' % (k, v))    print('Data:', f.read())

2 在Python3中使用urllib实现http的post操作:

#!/usr/bin/env python3# -*- coding: utf-8 -*-from urllib import request,parseprint('Login...')email = input('Email: ')passwd = input('Password: ')header_dic={'User-Agent':'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25'}origin_data = parse.urlencode([    ('username', email),    ('password', passwd),    ('pagerefer', 'http://www.qiushibaike.com/')])login_data = parse.urlencode(origin_data);req = request.Request(url='http://www.qiushibaike.com/',headers=header_dic)with request.urlopen(req,data=login_data.encode('utf-8')) as f:     print('Status:', f.status, f.reason)    for k, v in f.getheaders():        print('%s: %s' % (k, v))    print('Data:', f.read().decode('utf-8'))

3 http 错误

#!/usr/bin/env python3# -*- coding: utf-8 -*-from urllib import request,errorreq = request.Request("http://www.ahcj_11.c0m/")try:    response = request.urlopen(req)except error.HTTPError as e:    print('The server couldn\'t fulfill the request.')    print('Error code: ', e.code)except error.URLError as e:    print('We failed to reach a server.')    print('Reason: ', e.reason)else:    print("good!")    print(response.read().decode("utf8"))

函数原型

The urllib.request module defines the following functions:

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

Open the URL url, which can be either a string or a Request object.data must be a bytes object specifying additional data to be sent to the server, or None if no such data is needed. data may also be an iterable object and in that case Content-Length value must be specified in the headers. Currently HTTP requests are the only ones that use data; the HTTP request will be a POST instead of a GET when the data parameter is provided.data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or sequence of 2-tuples and returns an ASCII text string in this format. It should be encoded to bytes before being used as the data parameter.urllib.request module uses HTTP/1.1 and includes Connection:close header in its HTTP requests.The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt (if not specified, the global default timeout setting will be used). This actually only works for HTTP, HTTPS and FTP connections.If context is specified, it must be a ssl.SSLContext instance describing the various SSL options. See HTTPSConnection for more details.The optional cafile and capath parameters specify a set of trusted CA certificates for HTTPS requests. cafile should point to a single file containing a bundle of CA certificates, whereas capath should point to a directory of hashed certificate files. More information can be found in ssl.SSLContext.load_verify_locations().The cadefault parameter is ignored.This function always returns an object which can work as a context manager and has methods such as    geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed    info() — return the meta-information of the page, such as headers, in the form of an email.message_from_string() instance (see Quick Reference to HTTP Headers)    getcode() – return the HTTP status code of the response.For HTTP and HTTPS URLs, this function returns a http.client.HTTPResponse object slightly modified. In addition to the three new methods above, the msg attribute contains the same information as the reason attribute — the reason phrase returned by server — instead of the response headers as it is specified in the documentation for HTTPResponse.For FTP, file, and data URLs and requests explicitly handled by legacy URLopener and FancyURLopener classes, this function returns a urllib.response.addinfourl object.Raises URLError on protocol errors.Note that None may be returned if no handler handles the request (though the default installed global OpenerDirector uses UnknownHandler to ensure this never happens).In addition, if proxy settings are detected (for example, when a *_proxy environment variable like http_proxy is set), ProxyHandler is default installed and makes sure the requests are handled through the proxy.The legacy urllib.urlopen function from Python 2.6 and earlier has been discontinued; urllib.request.urlopen() corresponds to the old urllib2.urlopen. Proxy handling, which was done by passing a dictionary parameter to urllib.urlopen, can be obtained by using ProxyHandler objects.Changed in version 3.2: cafile and capath were added.Changed in version 3.2: HTTPS virtual hosts are now supported if possible (that is, if ssl.HAS_SNI is true).New in version 3.2: data can be an iterable object.Changed in version 3.3: cadefault was added.Changed in version 3.4.3: context was added.

The following classes are provided:

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

This class is an abstraction of a URL request.url should be a string containing a valid URL.data must be a bytes object specifying additional data to send to the server, or None if no such data is needed. Currently HTTP requests are the only ones that use data; the HTTP request will be a POST instead of a GET when the data parameter is provided. data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or sequence of 2-tuples and returns an ASCII string in this format. It should be encoded to bytes before being used as the data parameter.headers should be a dictionary, and will be treated as if add_header() was called with each key and value as arguments. This is often used to “spoof” the User-Agent header value, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11", while urllib‘s default user agent string is "Python-urllib/2.6" (on Python 2.6).An example of using Content-Type header with data argument would be sending a dictionary like {"Content-Type": "application/x-www-form-urlencoded"}.The final two arguments are only of interest for correct handling of third-party HTTP cookies:origin_req_host should be the request-host of the origin transaction, as defined by RFC 2965. It defaults to http.cookiejar.request_host(self). This is the host name or IP address of the original request that was initiated by the user. For example, if the request is for an image in an HTML document, this should be the request-host of the request for the page containing the image.unverifiable should indicate whether the request is unverifiable, as defined by RFC 2965. It defaults to False. An unverifiable request is one whose URL the user did not have the option to approve. For example, if the request is for an image in an HTML document, and the user had no option to approve the automatic fetching of the image, this should be true.method should be a string that indicates the HTTP request method that will be used (e.g. 'HEAD'). If provided, its value is stored in the method attribute and is used by get_method(). Subclasses may indicate a default method by setting the method attribute in the class itself.Changed in version 3.3: Request.method argument is added to the Request class.Changed in version 3.4: Default Request.method may be indicated at the class level.

【参考】https://docs.python.org/3/library/urllib.request.html

0 0