Python 3.0 编码变动
来源:互联网 发布:java程序员分级 编辑:程序博客网 时间:2024/05/17 03:55
Text Vs. Data Instead Of Unicode Vs. 8-bit
Everything you thought you knew about binary data and Unicode haschanged.
- Python 3.0 uses the concepts of text and (binary) data insteadof Unicode strings and 8-bit strings. All text is Unicode; howeverencoded Unicode is represented as binary data. The type used tohold text isstr, the type used to hold data isbytes. The biggest difference with the 2.x situation isthat any attempt to mix text and data in Python 3.0 raisesTypeError, whereas if you were to mix Unicode and 8-bitstrings in Python 2.x, it would work if the 8-bit string happened tocontain only 7-bit (ASCII) bytes, but you would getUnicodeDecodeError if it contained non-ASCII values. Thisvalue-specific behavior has caused numerous sad faces over theyears.
- As a consequence of this change in philosophy, pretty much all codethat uses Unicode, encodings or binary data most likely has tochange. The change is for the better, as in the 2.x world therewere numerous bugs having to do with mixing encoded and unencodedtext. To be prepared in Python 2.x, start using unicodefor all unencoded text, andstr for binary or encoded dataonly. Then the2to3 tool will do most of the work for you.
- You can no longer use u"..." literals for Unicode text.However, you must useb"..." literals for binary data.
- As the str and bytes types cannot be mixed, youmust always explicitly convert between them. Usestr.encode()to go fromstr to bytes, and bytes.decode()to go from bytes to str. You can also usebytes(s,encoding=...) and str(b,encoding=...),respectively.
- Like str, the bytes type is immutable. There is aseparatemutable type to hold buffered binary data,bytearray. Nearly all APIs that accept bytes alsoacceptbytearray. The mutable API is based oncollections.MutableSequence.
- All backslashes in raw string literals are interpreted literally.This means that'\U' and '\u' escapes in raw strings are nottreated specially. For example,r'\u20ac' is a string of 6characters in Python 3.0, whereas in 2.6,ur'\u20ac' was thesingle “euro” character. (Of course, this change only affects rawstring literals; the euro character is'\u20ac' in Python 3.0.)
- The builtin basestring abstract type was removed. Usestr instead. The str and bytes typesdon’t have functionality enough in common to warrant a shared baseclass. The2to3 tool (see below) replaces every occurrence ofbasestring withstr.
- Files opened as text files (still the default mode for open())always use an encoding to map between strings (in memory) and bytes(on disk). Binary files (opened with ab in the mode argument)always use bytes in memory. This means that if a file is openedusing an incorrect mode or encoding, I/O will likely fail loudly,instead of silently producing incorrect data. It also means thateven Unix users will have to specify the correct mode (text orbinary) when opening a file. There is a platform-dependent defaultencoding, which on Unixy platforms can be set with theLANGenvironment variable (and sometimes also with some otherplatform-specific locale-related environment variables). In manycases, but not all, the system default is UTF-8; you should nevercount on this default. Any application reading or writing more thanpure ASCII text should probably have a way to override the encoding.There is no longer any need for using the encoding-aware streamsin thecodecs module.
- Filenames are passed to and returned from APIs as (Unicode) strings.This can present platform-specific problems because on someplatforms filenames are arbitrary byte strings. (On the other hand,on Windows filenames are natively stored as Unicode.) As awork-around, most APIs (e.g. open() and many functions in theos module) that take filenames accept bytes objectsas well as strings, and a few APIs have a way to ask for abytes return value. Thus, os.listdir() returns alist ofbytes instances if the argument is abytesinstance, andos.getcwdb() returns the current workingdirectory as abytes instance. Note that whenos.listdir() returns a list of strings, filenames thatcannot be decoded properly are omitted rather than raisingUnicodeError.
- Some system APIs like os.environ andsys.argv canalso present problems when the bytes made available by the system isnot interpretable using the default encoding. Setting theLANGvariable and rerunning the program is probably the best approach.
- PEP 3138: Therepr() of a string no longer escapesnon-ASCII characters. It still escapes control characters and codepoints with non-printable status in the Unicode standard, however.
- PEP 3120: The default source encoding is now UTF-8.
- PEP 3131: Non-ASCII letters are now allowed in identifiers.(However, the standard library remains ASCII-only with the exceptionof contributor names in comments.)
- The StringIO andcStringIO modules are gone. Instead,import theio module and useio.StringIO orio.BytesIO for text and data respectively.
- See also the Unicode HOWTO, which was updated for Python 3.0.
- Python 3.0 编码变动
- python watchdog监控文件系统变动
- Python列表元素的增减变动
- 转:Python: IP变动发送到邮箱
- Swift 3.0一些Api变动
- python-05-03 python3.0 的语法变动
- 树莓派 公网IP变动后 自动邮件通知 python
- python脚本构造有部分变动的重复文本
- Swift 3.0 数组的语法小变动
- 系统编码 python编码
- 【编码】Python编码
- PYTHON 编码
- python 编码
- python 编码
- python 编码
- python 编码
- Python 编码
- python编码
- 如何计算地球球面上两个坐标点之间的弧度
- Oracle用子查询创建临时表的问题总结
- hdu 4006 The kth great number 优先级队列
- StatSVN插件使用方法总结-项目代码分析工具 工作量图表生成工具
- ORACLE DBA常用SQL
- Python 3.0 编码变动
- 安装Visual Studio.net 2003总是会收到“一个安装程序检测到另一个程序需要重新启动计算机”的错误消息
- myEclipse编辑器大小写转换
- Windows 8 开发版下载
- DB2 CASE/IF 条件控制语句
- 《重构--改善代码的既有设计》阅读笔记之代码的坏味道
- 狼
- RCP应用启动关闭顺序
- 我该怎么安排下属的工作-项目经理如何分配任务