Converting Unicode Strings to 8-bit Strings 转换unicode到utf-8
来源:互联网 发布:剑灵龙族捏脸数据图 编辑:程序博客网 时间:2024/06/15 10:57
A Unicode string holds characters from the Unicode character set.
If you want an 8-bit string, you need to decide what encoding you want to use. Common encodings are US-ASCII (which is the default if you convert from Unicode to 8-bit strings in Python), ISO-8859-1 (aka Latin-1), and UTF-8 (a variable-width encoding that can represent all Unicode strings).
For example, if you want Latin-1 strings, you can use one of:
s = u.encode("iso-8859-1") # fail if some character cannot be converted
s = u.encode("iso-8859-1", "replace") # instead of failing, replace with ?
s = u.encode("iso-8859-1", "ignore") # instead of failing, leave it out
If you want an ASCII string, replace “iso-8859-1” above with “ascii” or “us-ascii”.
If you want to output the data to a web browser or an XML file, you can use:
import cgi
s = cgi.escape(u).encode("ascii", "xmlcharrefreplace")
The cgi.escape function converts reserved characters (< > and &) to character entities (<, > and &), and the xmlcharrefreplace flag tells the encoder to use character references (&#nn;) for any character that cannot be encoded in the given encoding. The browser (or XML parser) at the other end will convert things back to Unicode.
Note that cgi.escape doesn’t escape quotes by default. To use the value in an attribute, you need to pass in an extra flag to escape, and put the result in double quotes:
s = 'attr="%s"' % cgi.escape(u,1).encode("ascii", "xmlcharrefreplace")
The unaccent.py script shows how to strip off accents from latin characters:
import unicodedata, sys
CHAR_REPLACEMENT = {
# latin-1 characters that don't have a unicode decomposition
0xc6: u"AE", # LATIN CAPITAL LETTER AE
0xd0: u"D", # LATIN CAPITAL LETTER ETH
0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE
0xde: u"Th", # LATIN CAPITAL LETTER THORN
0xdf: u"ss", # LATIN SMALL LETTER SHARP S
0xe6: u"ae", # LATIN SMALL LETTER AE
0xf0: u"d", # LATIN SMALL LETTER ETH
0xf8: u"oe", # LATIN SMALL LETTER O WITH STROKE
0xfe: u"th", # LATIN SMALL LETTER THORN
}
##
# Translation dictionary. Translation entries are added to this
# dictionary as needed.
class unaccented_map(dict):
##
# Maps a unicode character code (the key) to a replacement code
# (either a character code or a unicode string).
def mapchar(self, key):
ch = self.get(key)
if ch is not None:
return ch
de = unicodedata.decomposition(unichr(key))
if de:
try:
ch = int(de.split(None, 1)[0], 16)
except (IndexError, ValueError):
ch = key
else:
ch = CHAR_REPLACEMENT.get(key, key)
self[key] = ch
return ch
if sys.version >= "2.5":
# use __missing__ where available
__missing__ = mapchar
else:
# otherwise, use standard __getitem__ hook (this is slower,
# since it's called for each character)
__getitem__ = mapchar
if __name__ == "__main__":
text = u"""
"Jo, når'n da ha gått ett stôck te, så kommer'n te e å,
å i åa ä e ö."
"Vasa", sa'n.
"Å i åa ä e ö", sa ja.
"Men va i all ti ä dä ni säjer, a, o?", sa'n.
"D'ä e å, vett ja", skrek ja, för ja ble rasen, "å i åa
ä e ö, hörer han lite, d'ä e å, å i åa ä e ö."
"A, o, ö", sa'n å dämmä geck'en.
Jo, den va nôe te dum den.
(taken from the short story "Dumt fôlk" in Gustaf Fröding's
"Räggler å paschaser på våra mål tå en bonne" (1895).
"""
print text.translate(unaccented_map())
# note that non-letters are passed through as is; you can use
# encode("ascii", "ignore") to get rid of them. alternatively,
# you can tweak the translation dictionary to return None for
# characters >= "/x80".
map = unaccented_map()
print repr(u"12/xbd inch".translate(map))
print repr(u"12/xbd inch".translate(map).encode("ascii", "ignore"))
Comment:
1. I'm not sure if "eth" should be converted into "d" or "dh", and the "capital O with stroke" into "OE" or "Oe", but you as a Scandinavian surely know better. 2. Please don't confine the translation to Latin-1 only. I especially miss the "l with stroke", which is very frequent in Polish. Here is a fragment of my program performing the same task with additional non-decomposable characters that you may consider to add:
# non-decomposable characters from Latin-1 and Latin Extended A
charmap = {
u'/N{Latin capital letter AE}': 'AE',
u'/N{Latin small letter ae}': 'ae',
u'/N{Latin capital letter Eth}': 'Dh',
u'/N{Latin small letter eth}': 'dh',
u'/N{Latin capital letter O with stroke}': 'Oe',
u'/N{Latin small letter o with stroke}': 'oe',
u'/N{Latin capital letter Thorn}': 'Th',
u'/N{Latin small letter thorn}': 'th',
u'/N{Latin small letter sharp s}': 'ss',
u'/N{Latin capital letter D with stroke}': 'Dj',
u'/N{Latin small letter d with stroke}': 'dj',
u'/N{Latin capital letter H with stroke}': 'H',
u'/N{Latin small letter h with stroke}': 'h',
u'/N{Latin small letter dotless i}': 'i',
u'/N{Latin small letter kra}': 'q',
u'/N{Latin capital letter L with stroke}': 'L',
u'/N{Latin small letter l with stroke}': 'l',
u'/N{Latin capital letter Eng}': 'Ng',
u'/N{Latin small letter eng}': 'ng',
u'/N{Latin capital ligature OE}': 'Oe',
u'/N{Latin small ligature oe}': 'oe',
u'/N{Latin capital letter T with stroke}': 'Th',
u'/N{Latin small letter t with stroke}': 'th',
}
- Converting Unicode Strings to 8-bit Strings 转换unicode到utf-8
- utf-8 to unicode
- UTF-8到Unicode的编码转换
- UTF-8到Unicode的编码转换
- UTF-8到Unicode的编码转换
- Unicode到UTF-8的转换详解
- unicode到utf-8的转换
- java utf-8 to unicode
- java unicode to utf-8
- Unicode转换String UTF-8
- Unicode与UTF-8转换
- utf-8和Unicode转换
- UTF-8向UNICODE转换
- Unicode与UTF-8转换
- unicode与utf-8转换
- JavaScript:Converting Strings to Numbers
- UTF-8到Unicode的编码转换(转贴)
- MFC下Unicode到UTF-8格式的转换
- AotuBoxDemo.java
- ghost制作过程
- 一个简单的问题
- [转载]《Windows用户态程序高效排错》 中涉及到的链接http://hi.baidu.com/killbug2004/blog/item/7537f74ef450d2cfd1c86af7.html
- 提升管理员权限:启动脚本法
- Converting Unicode Strings to 8-bit Strings 转换unicode到utf-8
- 为四川汶川地震贡献自己的微薄之力
- 批量获得kaspersky(卡巴斯基)key的方法
- PreparedStatement应用记录
- Linux内核2.4.x的网络接口源码的结构
- 微狗驱动模拟程序(MASM源码)
- ASP编程入门进阶:Application
- 还原系统保护技术原理和攻防
- 将rose中的图导出的方法(转)