用Python实现URL Encoding和Decoding

在Python 3.x中,一个str对象可以通过调用encode方法来编码得到一个bytes类型的字节序列。而bytes对象则有一个decode方法来实现字节序列的解码操作。看一个例子:

>>> song = '海阔天空'>>> song_bytes = song.encode('utf-8') # 以UTF-8编码song这个字符串>>> song_bytesb'\xe6\xb5\xb7\xe9\x98\x94\xe5\xa4\xa9\xe7\xa9\xba'>>> song_bytes.decode('utf-8')'海阔天空'



>>> song = '宽恕'>>> song_bytes = song.encode('utf-8')>>> song_bytesb'\xe5\xae\xbd\xe6\x81\x95'



1. 从str类型的字符串: "%E5%AE%BD%E6%81%95"得到一个bytes类型的字节序列:b'\xe5\xae\xbd\xe6\x81\x95';
2. 对第1步中得到的字节序列进行解码,得到一个str类型的“正常”文件名。


bytes.fromhex会把传入的字符串形式的十六进制数字(如:'E5 AE BD E6 81 95')转换成相应的bytes类型字节序列(如:b'\xe5\xae\xbd\xe6\x81')——前者两个十六进制数字对应后者一个字节,并忽略所有空白。具体代码如下:

>>> strange_file_name = "%E5%AE%BD%E6%81%95">>> strange_file_name = strange_file_name.replace('%', '')>>> strange_file_name'E5AEBDE68195'>>> strange_file_name_bytes = bytes.fromhex(_)>>> strange_file_name_bytesb'\xe5\xae\xbd\xe6\x81\x95'>>> _.decode('utf-8')'宽恕'


from re import compile as re_compile_percent_pat = re_compile(r'(?:%[A-Fa-f0-9]{2})+')def percent_decode(string):    for substr in _percent_pat.findall(string):        substr_dec = bytes.fromhex(            substr.replace('%', '')).decode('utf-8')        string = string.replace(substr, substr_dec)    return string



搜索之后发现,生成包含百分号的文件名其实就是所谓的“URL Encoding”或“Percent Encoding(百分号编码)”(我还找到了一个提供在线URL Encoding/Decoding的网站。)。而且Python标准库中已经提供相关模块来实现上面的“编码”与“解码”(示例代码)。(其实,我是在了解了这些之后才把上面我实现的解码函数命名为percent_decode的。)

在Python 3.x中,urllib.parse模块提供了如下几个函数:
urllib.parse.quote(string, safe='/', encoding=None, errors=None)

urllib.parse.quote_plus(string, safe='', encoding=None, errors=None)
同上,不过使用字符'+'替换掉string中的空格字符' ';

urllib.parse.quote_from_bytes(bytes, safe='/')

urllib.parse.unquote(string, encoding='utf-8', errors='replace')
urllib.parse.unquote_plus(string, encoding='utf-8', errors='replace')

urllib.parse.urlencode(query, doseq=False, safe='', encoding=None, errors=None)

该函数能根据query中的数据,通过调用quote_plus生成URL query string。比如,我们在使用用户名、密码登陆某个论坛的时候,或者在某个网站上搜索关键词的时候,urlencode能帮助我们得到最终的查询链接:

>>> from urllib.parse import urlencode>>> query_filter = {'song': '宽恕', 'artist': '王菲'}>>> query_parms = urlencode(query_filter)>>> query_parms'artist=%E7%8E%8B%E8%8F%B2&song=%E5%AE%BD%E6%81%95'>>> query_url = 'http://www.example.com/query?{}'.format(query_parms)>>> query_url'http://www.example.com/query?artist=%E7%8E%8B%E8%8F%B2&song=%E5%AE%BD%E6%81%95'




以下代码代码摘取自Python 3.3.3的urllib.parse模块(其中,以"##"开头的中文注释是我对这部分代码的理解):

## “百分号编码”中,如下ASCII字符在编码过程中保持原样。## 这些字符也是所谓的“未保留字符”(Unreserved Characters)。## 通过quote、quote_plus函数的safe参数,我们可以指定额外的未保留字符。 _ALWAYS_SAFE = frozenset(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'                         b'abcdefghijklmnopqrstuvwxyz'                         b'0123456789'                         b'_.-')_ALWAYS_SAFE_BYTES = bytes(_ALWAYS_SAFE)_safe_quoters = {}## 百分号编码的编码操作就是把bytes形式的字节序列转换成相应的包含百分号的字符串。## 例如: b'\xe5\xae\xbd\xe6\x81\x95' -> %E5%AE%BD%E6%81%95## 该类实际上封装了上述这一功能。具体做法就是以一种类似于字典(不过这里不是使用中括号)的工作方式来提供## 查询操作。如:## quoter = Quoter();## 那么调用quoter(b'\xe5')将得到'%E5'。对“未保留字符”,quoter将返回其字符形式,即:quoter(b'a')将得到字符'a'。class Quoter(collections.defaultdict):    """A mapping from bytes (in range(0,256)) to strings.    String values are percent-encoded byte values, unless the key < 128, and    in the "safe" set (either the specified safe set, or default set).    """    # Keeps a cache internally, using defaultdict, for efficiency (lookups    # of cached keys don't call Python code at all).    def __init__(self, safe):        """safe: bytes object."""        self.safe = _ALWAYS_SAFE.union(safe)    def __repr__(self):        # Without this, will just display as a defaultdict        return "<Quoter %r>" % dict(self)    def __missing__(self, b):        # Handle a cache miss. Store quoted string in cache and return.        ## self.safe是_ALWAYS_SAFE(由“未保留字符”构成的集合)和        ## 在调用quote、quote_plus时通过参数safe额外指定的字符集的并集。        ## 对于存在于self.safe中的字节,返回其字符形式。否则,返回        ## 形如%XX的字符序列(这里的'XX'是该字节的十六进制形式)。        res = chr(b) if b in self.safe else '%{:02X}'.format(b)        self[b] = res        return resdef quote(string, safe='/', encoding=None, errors=None):    """quote('abc def') -> 'abc%20def'    Each part of a URL, e.g. the path info, the query, etc., has a    different set of reserved characters that must be quoted.    RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax lists    the following reserved characters.    reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |                  "$" | ","    Each of these characters is reserved in some component of a URL,    but not necessarily in all of them.    By default, the quote function is intended for quoting the path    section of a URL.  Thus, it will not encode '/'.  This character    is reserved, but in typical usage the quote function is being    called on a path where the existing slash characters are used as    reserved characters.    string and safe may be either str or bytes objects. encoding must    not be specified if string is a str.    The optional encoding and errors parameters specify how to deal with    non-ASCII characters, as accepted by the str.encode method.    By default, encoding='utf-8' (characters are encoded with UTF-8), and    errors='strict' (unsupported characters raise a UnicodeEncodeError).    """    if isinstance(string, str):        if not string:            return string        if encoding is None:            encoding = 'utf-8'        if errors is None:            errors = 'strict'        ## 如果是字符串,先编码成字节序列        string = string.encode(encoding, errors)    else:        if encoding is not None:            raise TypeError("quote() doesn't support 'encoding' for bytes")        if errors is not None:            raise TypeError("quote() doesn't support 'errors' for bytes")    ## 调用quote_from_bytes函数,把字符串编码后生成的字节序列转换成    ## 相应的百分号编码字符串。即:    ## b'\xe5\xae\xbd\xe6\x81\x95' -> %E5%AE%BD%E6%81%95    return quote_from_bytes(string, safe)## 该函数会先保留字符串中的空格字符(通过把空格字符附加到safe集合中,这样## 空格字符就不会被变成%20),然后调用quote函数进行百分号编码操作。## 最后,再把字符串中的空格替换成加号。def quote_plus(string, safe='', encoding=None, errors=None):    """Like quote(), but also replace ' ' with '+', as required for quoting    HTML form values. Plus signs in the original string are escaped unless    they are included in safe. It also does not have safe default to '/'.    """    # Check if ' ' in string, where string may either be a str or bytes.  If    # there are no spaces, the regular quote will produce the right answer.    if ((isinstance(string, str) and ' ' not in string) or        (isinstance(string, bytes) and b' ' not in string)):        return quote(string, safe, encoding, errors)    if isinstance(safe, str):        space = ' '    else:        space = b' '    string = quote(string, safe + space, encoding, errors)    return string.replace(' ', '+')## 通过Quoter类提供的服务,实现实际的转换操作:## 即:b'\xe5\xae\xbd\xe6\x81\x95' -> %E5%AE%BD%E6%81%95def quote_from_bytes(bs, safe='/'):    """Like quote(), but accepts a bytes object rather than a str, and does    not perform string-to-bytes encoding.  It always returns an ASCII string.    quote_from_bytes(b'abc def\x3f') -> 'abc%20def%3f'    """    if not isinstance(bs, (bytes, bytearray)):        raise TypeError("quote_from_bytes() expected bytes")    if not bs:        return ''    if isinstance(safe, str):        # Normalize 'safe' by converting to bytes and removing non-ASCII chars        safe = safe.encode('ascii', 'ignore')    else:        safe = bytes([c for c in safe if c < 128])    ## 如果bs中包含的字节都是要保留的,那么rstrip后将得到一个空的bytes类型序列。    ## 这表明bs中的所有字节都需要保持原样。那么,只需调用decode方法转换一下类型    ## 就可以了。例如,如果bs是b'Beyond',只需返回b'Beyond'.decode,即字符串:'Beyond'。    if not bs.rstrip(_ALWAYS_SAFE_BYTES + safe):        return bs.decode()            ## 构建一个Quoter类型的对象,用以提供类似如下的查询服务:    ## quoter(b'\xe4') 得到:'%E4'    ## quoter(b'A') 得到: 'A'    try:        quoter = _safe_quoters[safe]    except KeyError:        _safe_quoters[safe] = quoter = Quoter(safe).__getitem__    ## 通过列表解析,处理bs每个字节,并连接成字符串返回。    return ''.join([quoter(char) for char in bs])


_unreserved_chars = frozenset(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'                              b'abcdefghijklmnopqrstuvwxyz'                              b'0123456789'                              b'_.-')# A simple implement of "urllib.parse.quote"def percent_encode(string, safe = '/', encoding = 'utf-8', errors = 'strict'):    if not string:        return string    string = string.encode(encoding, errors)    bytes_unchanged = _unreserved_chars.union(        safe.encode('ascii', 'ignore'))    ## 这里,我使用一个lambda函数来实现类似于上面的Quoter类提供的功能。    process_byte = lambda byte: chr(byte) if byte in bytes_unchanged \                   else '%{:02X}'.format(byte)    return ''.join((process_byte(b) for b in string))# A simple implement of "urllib.parse.quote_plus"def percent_encode_plus(string, safe = '', encoding = 'utf-8',                        errors = 'strict'):    safe += ' '    string = percent_encode(string, safe, encoding, errors)    return string.replace(' ', '+')


def percent_decode(string, encoding = 'utf-8'):    for substr in _percent_pat.findall(string):        substr_dec = bytes.fromhex(            substr.replace('%', '')).decode(encoding)        string = string.replace(substr, substr_dec)    return string


>>> from re import compile as re_compile>>> _percent_pat = re_compile(r'(?:%[A-Fa-f0-9]{2})+')>>> def percent_decode(string, encoding = 'utf-8'):    for substr in _percent_pat.findall(string):        substr_dec = bytes.fromhex(            substr.replace('%', '')).decode(encoding)        string = string.replace(substr, substr_dec)    return string>>> song = 'Beyond-海阔天空'>>> from urllib.parse import quote, unquote>>> song_pct_enc = quote(song, encoding = 'utf-8')>>> song_pct_enc'Beyond-%E6%B5%B7%E9%98%94%E5%A4%A9%E7%A9%BA'>>> percent_decode(_, 'utf-8')'Beyond-海阔天空'>>> unquote(song_pct_enc)'Beyond-海阔天空'>>> song_pct_enc_utf16 = quote(song, encoding = 'utf-16')>>> song_pct_enc_utf16'%FF%FEB%00e%00y%00o%00n%00d%00-%00wm%14%96%29Yzz'>>> percent_decode(_, 'utf-16')Traceback (most recent call last):  File "<pyshell#27>", line 1, in <module>    percent_decode(_, 'utf-16')  File "<pyshell#18>", line 4, in percent_decode    substr.replace('%', '')).decode(encoding)UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position 0: truncated data>>> unquote(song_pct_enc_utf16, 'utf-16')'Beyond-海阔天空'



是会出现UnicodeDecodeError异常的。(b'\x00\x00'.decode('utf-16') 是可以的。)


>>> 'B'.encode('utf-16')b'\xff\xfeB\x00' # 小尾(端),包含BOM:FF FE>>> 'B'.encode('utf-16-le')b'B\x00' # 小尾>>> 'B'.encode('utf-16-be')b'\x00B' # 大尾(端)


>>> b'B\x00'.decode('utf-16-le')'B'


>>> b'B'.decode('utf-16-le')Traceback (most recent call last):  File "<pyshell#39>", line 1, in <module>    b'B'.decode('utf-16-le')  File "D:\Program Files\Python33\lib\encodings\utf_16_le.py", line 16, in decode    return codecs.utf_16_le_decode(input, errors, True)UnicodeDecodeError: 'utf16' codec can't decode byte 0x42 in position 0: truncated data>>> b'\x00'.decode('utf-16-le')Traceback (most recent call last):  File "<pyshell#40>", line 1, in <module>    b'\x00'.decode('utf-16-le')  File "D:\Program Files\Python33\lib\encodings\utf_16_le.py", line 16, in decode    return codecs.utf_16_le_decode(input, errors, True)UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position 0: truncated data

前者会得到truncated data的错误提示,即要解码的序列被截断了,无法解码。而后者就更不行了。事实上,b'\x00'和UTF-16中任意一个字符都不对应。即使是ASCII值为0的'\0'字符:

>>> '\0'.encode('utf-16-le')b'\x00\x00'


>>> quote('B', encoding = 'utf-16-le')'B%00'

当我们想要去解码'B%00'这样一个字符串时,应该先想办法把它转换成形如:b'B\x00'这样的字节序列,然后在整个序列上调用bytes的decode方法,这样就不会出现如上错误了。事实上,这正是Python 3.3.3中unquote函数的做法:

import re_asciire = re.compile('([\x00-\x7f]+)')_hexdig = '0123456789ABCDEFabcdef'## 建立如下的对应关系:## b'00': '00'## b'01': '01'## ...## b'FF': 'FF'## 即从单字节到该字节的二位十六进制表现形式。可以看做是## bytes.fromhex的逆操作。_hextobyte = {(a + b).encode(): bytes([int(a + b, 16)])              for a in _hexdig for b in _hexdig}##               def unquote_to_bytes(string):    """unquote_to_bytes('abc%20def') -> b'abc def'."""    # Note: strings are encoded as UTF-8. This is only an issue if it contains    # unescaped non-ASCII characters, which URIs should not.    if not string:        # Is it a string-like object?        ## 下面这句代码好像没用。我感觉放在这里只是起测试作用,即只有string包含        ## split属性的时候,才会return一个空字节序列。        string.split        return b''            ## 如果string是字符串,则转换成字节序列    ## 我认为这里即使使用'ascii'作为encoding类型也可以——    ## 毕竟,一个正常的经过百分号编码算法编码的字符串中    ## 不可能包含除ASCII字符以外的字符。    ## 但Python文档中有这样一句话:    ## The source character set is defined by the encoding declaration; it is UTF-8 if ## no encoding declaration is given in the source file    ## 也就是说,在不包含编码声明的Python脚本中,Python 3.x会    ## 认为其中的字符串字面量是UTF-8编码的。所以,这里使用UTF-8也合理。    if isinstance(string, str):        string = string.encode('utf-8')            ## 以字节b'%'作为分隔符,得到一个由bytes类型对象构成的列表。    bits = string.split(b'%')    if len(bits) == 1:        return string    res = [bits[0]]    append = res.append    for item in bits[1:]:        try:            ## 这里实际上是res.append(_hextobyte[item[:2]])            ## 还拿字符'B'的UTF-16-LE形式的百分号编码'B%00'来说:            ## string是'B%00'            ## bits是[b'B%00']            ## 这里,通过查字典_hextobyte,把b'%00'变成b'\x00'            ## 这样我们得到的res就是:            ## [b'B\x00']            append(_hextobyte[item[:2]])            ## 其它部分,不予处理。            ## 比如,字符'B'的UTF-16-BE的百分号编码为:'%00B'            ## 上面的操作只是把b'%00'变成了b'\x00',而剩余的b'B'            ## 只需要添加到列表res中就行了。            append(item[2:])        except KeyError:            append(b'%')            append(item)    ## 经过b'%XX' -> b'\xXX'这样的映射操作后,连接起来重新得到完整的字符串。    return b''.join(res)def unquote(string, encoding='utf-8', errors='replace'):    """Replace %xx escapes by their single-character equivalent. The optional    encoding and errors parameters specify how to decode percent-encoded    sequences into Unicode characters, as accepted by the bytes.decode()    method.    By default, percent-encoded sequences are decoded with UTF-8, and invalid    sequences are replaced by a placeholder character.    unquote('abc%20def') -> 'abc def'.    """    if '%' not in string:        string.split        return string    if encoding is None:        encoding = 'utf-8'    if errors is None:        errors = 'replace'    ## 我认为这句代码的作用也不大。    bits = _asciire.split(string)    res = [bits[0]]    append = res.append    for i in range(1, len(bits), 2):        ## 对unquote_to_bytes返回的字节序列进行解码操作。        append(unquote_to_bytes(bits[i]).decode(encoding, errors))        append(bits[i + 1])    return ''.join(res)

上面就是Python 3.3.3中的unquote函数的实现思路。

后来,我在自己安装了Python 3.3.2的Debian 7.3上写代码时,吃惊地发现,Python 3.3.2到Python 3.3.3,urllib.parse中的unquote函数的实现方式完全不同。事实上,Python3.3.2中的unquote函数有问题,即当我拿一个中文字符串以某种encoding type(比如:UTF-16)编码(quote)再解码(unquote)后,得到的字符串和原来的不一样了。我用TortoiseSVN提供的diff工具对比了一下从Debian 7上得到的Python 3.3.2中的urllib.parse模块相应的parse.py和Win7下Python 3.3.3中urllib.parse相应的parse.py,发现两者最大的不同之处也就是unqoute、unquote_to_bytes这两个函数实现方式的改变。

以下代码摘自Python 3.3.2中的urllib.parse模块:

def unquote_to_bytes(string):    """unquote_to_bytes('abc%20def') -> b'abc def'."""    # Note: strings are encoded as UTF-8. This is only an issue if it contains    # unescaped non-ASCII characters, which URIs should not.    if not string:        # Is it a string-like object?        string.split        return b''    if isinstance(string, str):        string = string.encode('utf-8')    res = string.split(b'%')    if len(res) == 1:        return string    string = res[0]    for item in res[1:]:        try:            string += bytes([int(item[:2], 16)]) + item[2:]        except ValueError:            string += b'%' + item    return stringdef unquote(string, encoding='utf-8', errors='replace'):    """Replace %xx escapes by their single-character equivalent. The optional    encoding and errors parameters specify how to decode percent-encoded    sequences into Unicode characters, as accepted by the bytes.decode()    method.    By default, percent-encoded sequences are decoded with UTF-8, and invalid    sequences are replaced by a placeholder character.    unquote('abc%20def') -> 'abc def'.    """    if string == '':        return string    res = string.split('%')    if len(res) == 1:        return string    if encoding is None:        encoding = 'utf-8'    if errors is None:        errors = 'replace'    # pct_sequence: contiguous sequence of percent-encoded bytes, decoded    pct_sequence = b''    string = res[0]    for item in res[1:]:        try:            if not item:                raise ValueError            pct_sequence += bytes.fromhex(item[:2])            rest = item[2:]            if not rest:                # This segment was just a single percent-encoded character.                # May be part of a sequence of code units, so delay decoding.                # (Stored in pct_sequence).                continue        except ValueError:            rest = '%' + item        # Encountered non-percent-encoded characters. Flush the current        # pct_sequence.        string += pct_sequence.decode(encoding, errors) + rest        pct_sequence = b''    if pct_sequence:        # Flush the final pct_sequence        string += pct_sequence.decode(encoding, errors)    return string



#! /usr/bin/env python3# -*- coding: utf-8 -*-# By mayadong7349 2014-01-19 19:39from re import compile as re_compile_percent_pat = re_compile(b'((?:%[A-Fa-f0-9]{2})+)')_unreserved_chars = frozenset(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'                              b'abcdefghijklmnopqrstuvwxyz'                              b'0123456789'                              b'_.-')# A simple implement of  "urllib.parse.unquote"def percent_decode(string, encoding = 'utf-8', errors = 'replace'):    str_bytes = string.encode('utf-8')    hex_to_byte = lambda match_ret: \                  bytes.fromhex(                      match_ret.group(0).replace(b'%', b'').decode('utf-8'))    str_bytes = _percent_pat.sub(hex_to_byte, str_bytes)    string = str_bytes.decode(encoding, errors)    return string# A simple implement of "urllib.parse.unquote_plus"def percent_decode_plus(string, encoding = 'utf-8', errors = 'replace'):    return percent_decode(string.replace('+', '%20'), encoding, errors)# A simple implement of "urllib.parse.quote"def percent_encode(string, safe = '/', encoding = 'utf-8', errors = 'strict'):    if not string:        return string    string = string.encode(encoding, errors)    bytes_unchanged = _unreserved_chars.union(        safe.encode('ascii', 'ignore'))    process_byte = lambda byte: chr(byte) if byte in bytes_unchanged \                   else '%{:02X}'.format(byte)    return ''.join((process_byte(b) for b in string))# A simple implement of "urllib.parse.quote_plus"def percent_encode_plus(string, safe = '', encoding = 'utf-8',                        errors = 'strict'):    safe += ' '    string = percent_encode(string, safe, encoding, errors)    return string.replace(' ', '+')if __name__ == '__main__':    import unittest    import urllib.parse    class TestURIParse(unittest.TestCase):        def setUp(self):            pass        def tearDown(self):            pass        def doTest(self, str_, str_with_space, encoding_list):            for en in encoding_list:                # print('Test encoding:', en)                str_enc = percent_encode(str_, encoding = en)                self.assertEqual(                    str_enc, urllib.parse.quote(str_, encoding = en))                str_with_space_enc = percent_encode_plus(                    str_with_space, encoding = en)                self.assertEqual(                    str_with_space_enc,                    urllib.parse.quote_plus(str_with_space, encoding = en))                # print('Test decoding:', en)                self.assertEqual(percent_decode(str_enc, encoding = en),                                 urllib.parse.unquote(str_enc, encoding = en))                self.assertEqual(                    percent_decode(str_with_space_enc, encoding = en),                    urllib.parse.unquote(str_with_space_enc, encoding = en))                self.assertEqual(                    percent_decode_plus(str_with_space_enc, encoding = en),                    urllib.parse.unquote_plus(                        str_with_space_enc, encoding = en))        def testChinese(self):            fn = 'Beyond-海阔天空'            fn_with_space = 'Beyond 海 阔 天 空'            encoding_list = ('utf-8', 'gb2312', 'gbk', 'utf-16', 'utf-16-le',                             'utf-16-be', 'utf-32', 'utf-32-le', 'utf-32-be',                             'gb18030')            self.doTest(fn, fn_with_space, encoding_list)        def testReservedChars(self):            reserved_chars = "!*'();:@&=+$,/?#[]"            encoding_list = ('utf-8', 'gb2312', 'gbk', 'utf-16', 'utf-16-le',                             'utf-16-be', 'utf-32', 'utf-32-le', 'utf-32-be',                             'gb18030')            self.doTest(reserved_chars, reserved_chars, encoding_list)        def testEmptyString(self):            self.doTest('', '', ('utf-8', 'utf-16-be', 'utf-32-le'))        def testURL(self):            url = 'http://www.baidu.com/'            url_with_space = 'http://www.baidu.com/黑 客 帝 国.rmvb'            encoding_list = ('utf-8', 'gb2312', 'gbk', 'utf-16', 'utf-16-le',                             'utf-32', 'utf-32-le', 'gb18030')            self.doTest(url, url_with_space, encoding_list)        def testRealURL(self):            wiki_page = 'http://zh.wikipedia.org/wiki/%E7%99%BE%E5%88%86%E5%8F%B7%E7%BC%96%E7%A0%81'            self.assertEqual(percent_decode(wiki_page),                             urllib.parse.unquote(wiki_page))                    unittest.main()



本文并未包含诸如UTF-8、UTF-16等相关编码知识,那是因为我对它们理解的也不好。所以,请查阅维基百科来了解它们(包括那些和URL Encoding相关的RFC文档,也请自行搜索)。

1. unquote、unquote_plus的第二个参数是encoding,而quote、quote_plus的第三个参数才是encoding。使用的时候注意一下;
2. quote、quote_from_bytes第二个参数safe默认值为'/',而quote_plus第二个参数safe默认为空'',我还不知道这种不一致性的原因,使用的使用也注意一下;
Python 3.3官方文档的The Python Standard Library/4. Built-in Types/4.8. Binary Sequence Types — bytes, bytearray, memoryview/4.8.1. Bytes中有这样一段话:
Only ASCII characters are permitted in bytes literals (regardless of the declared source code encoding). Any binary values over 127 must be entered into bytes literals using the appropriate escape sequence.
1992年7月,X/Open委员会XoJIG开始寻求一个较佳的编码系统。Unix系统实验室(USL)的Dave Prosser为此提出了一个编码系统的建议。它具备可更快速实现的特性,并引入一项新的改进。其中,7比特的ASCII符号只代表原来的意思,所有多字节串行则会包含第8比特的符号,也就是所谓的最高有效比特。

Python 3.x中一个str对象进行utf-8编码后,得到一个bytes类型的字节序列。在这个过程中str中的ASCII字符保持原样,而超出ASCII范围的Unicode字符则被表示成多字节的转义序列(而且序列中每个字节最高bit位必定为1,即这些字节都不会被误判为一个ASCII字符)。因此,最初版本的percent_decode去解码一个以UTF-8进行URL Encoding得到的URL时是可行的。


>>> '海阔天空'.encode('utf-16-le')b'wm\x14\x96)Yzz'>>> quote('海阔天空', encoding = 'utf-16-le')'wm%14%96%29Yzz'>>> for ch in '海阔天空':...     print(repr(quote(ch, encoding = 'utf-16-le')))...     'wm''%14%96''%29Y''zz'



