Python 3.0 编码变动

来源：互联网发布：java程序员分级编辑：程序博客网时间：2024/05/17 03:55

Text Vs. Data Instead Of Unicode Vs. 8-bit

Everything you thought you knew about binary data and Unicode haschanged.

Python 3.0 uses the concepts of text and (binary) data insteadof Unicode strings and 8-bit strings. All text is Unicode; howeverencoded Unicode is represented as binary data. The type used tohold text isstr, the type used to hold data isbytes. The biggest difference with the 2.x situation isthat any attempt to mix text and data in Python 3.0 raisesTypeError, whereas if you were to mix Unicode and 8-bitstrings in Python 2.x, it would work if the 8-bit string happened tocontain only 7-bit (ASCII) bytes, but you would getUnicodeDecodeError if it contained non-ASCII values. Thisvalue-specific behavior has caused numerous sad faces over theyears.
As a consequence of this change in philosophy, pretty much all codethat uses Unicode, encodings or binary data most likely has tochange. The change is for the better, as in the 2.x world therewere numerous bugs having to do with mixing encoded and unencodedtext. To be prepared in Python 2.x, start using unicodefor all unencoded text, andstr for binary or encoded dataonly. Then the2to3 tool will do most of the work for you.
You can no longer use u"..." literals for Unicode text.However, you must useb"..." literals for binary data.
As the str and bytes types cannot be mixed, youmust always explicitly convert between them. Usestr.encode()to go fromstr to bytes, and bytes.decode()to go from bytes to str. You can also usebytes(s,encoding=...) and str(b,encoding=...),respectively.
Like str, the bytes type is immutable. There is aseparatemutable type to hold buffered binary data,bytearray. Nearly all APIs that accept bytes alsoacceptbytearray. The mutable API is based oncollections.MutableSequence.
All backslashes in raw string literals are interpreted literally.This means that'\U' and '\u' escapes in raw strings are nottreated specially. For example,r'\u20ac' is a string of 6characters in Python 3.0, whereas in 2.6,ur'\u20ac' was thesingle “euro” character. (Of course, this change only affects rawstring literals; the euro character is'\u20ac' in Python 3.0.)
The builtin basestring abstract type was removed. Usestr instead. The str and bytes typesdon’t have functionality enough in common to warrant a shared baseclass. The2to3 tool (see below) replaces every occurrence ofbasestring withstr.
Files opened as text files (still the default mode for open())always use an encoding to map between strings (in memory) and bytes(on disk). Binary files (opened with ab in the mode argument)always use bytes in memory. This means that if a file is openedusing an incorrect mode or encoding, I/O will likely fail loudly,instead of silently producing incorrect data. It also means thateven Unix users will have to specify the correct mode (text orbinary) when opening a file. There is a platform-dependent defaultencoding, which on Unixy platforms can be set with theLANGenvironment variable (and sometimes also with some otherplatform-specific locale-related environment variables). In manycases, but not all, the system default is UTF-8; you should nevercount on this default. Any application reading or writing more thanpure ASCII text should probably have a way to override the encoding.There is no longer any need for using the encoding-aware streamsin thecodecs module.
Filenames are passed to and returned from APIs as (Unicode) strings.This can present platform-specific problems because on someplatforms filenames are arbitrary byte strings. (On the other hand,on Windows filenames are natively stored as Unicode.) As awork-around, most APIs (e.g. open() and many functions in theos module) that take filenames accept bytes objectsas well as strings, and a few APIs have a way to ask for abytes return value. Thus, os.listdir() returns alist ofbytes instances if the argument is abytesinstance, andos.getcwdb() returns the current workingdirectory as abytes instance. Note that whenos.listdir() returns a list of strings, filenames thatcannot be decoded properly are omitted rather than raisingUnicodeError.
Some system APIs like os.environ andsys.argv canalso present problems when the bytes made available by the system isnot interpretable using the default encoding. Setting theLANGvariable and rerunning the program is probably the best approach.
PEP 3138: Therepr() of a string no longer escapesnon-ASCII characters. It still escapes control characters and codepoints with non-printable status in the Unicode standard, however.
PEP 3120: The default source encoding is now UTF-8.
PEP 3131: Non-ASCII letters are now allowed in identifiers.(However, the standard library remains ASCII-only with the exceptionof contributor names in comments.)
The StringIO andcStringIO modules are gone. Instead,import theio module and useio.StringIO orio.BytesIO for text and data respectively.
See also the Unicode HOWTO, which was updated for Python 3.0.