[学习笔记]Python_编码（20171118）

来源：互联网发布：模拟网络攻击软件编辑：程序博客网时间：2024/06/06 09:15

常见编码
ASCII（英文，单字节，65A,91Z,256个字符）
GB2312（简体中文）
GBK（GB2312扩展，简繁体中文，含日文假名）
BIG5（台湾香港繁体字）
ANSI（文本保存ANSI表示GBK）
Unicode（全球所有字符编码）
UTF-8（可变长编码，保存1-3个字节，完全兼容acsii码的256个字符）
cp936：中文Windows中的cmd默认codepage是系统里第936号编码格式，即GB2312的编码

#encoding=utf-8s=raw_input("请输入一句话".decode("utf-8").encode("gbk"))print s

中文使用

1.保存utf-8格式
2.头申明 #encoding=utf-8
3.前加u u”中国”
*字符串有两种类型：字符型字符串，Unicode型字符串；内存存储时都是Unicode型字符串（服务器存储unicode）, 应用程序输出utf-8再显示到浏览器
中文前加u和不加u是不一样的格式
在python中打印与不打印调用不同方法： “repr“;” str“; 显示中文需要打印出来

这里写图片描述

编码及转换
其他字符编码及转换成unicode，使用decode
Unicode转换其他字符编码级，使用encode

chardet
检测字符的编码级
pip install chardet进行安装
import chardet引入

#encoding=utf-8import chardetprint type("中国".decode("utf-8"))print "中国".decode("utf-8").encode("gbk")print type("中国".decode("utf-8").encode("gbk"))print chardet.detect("中国人民富裕了，开始走向伟大复兴".decode("utf-8").encode("gbk"))#encoding=utf-8import chardetimport sysprint sys.getdefaultencoding()reload(sys)sys.setdefaultencoding("utf-8") #使用这个方法需要reload(sys)print "中国".encode("gbk")   #编码转换的时候，会自动调用默认字符集去decode方法#去decode 转换为unicode字符。>>>"s".decode("gbk")u's'>>>"s".decode("gbk").encode("utf-8")'s'>>> u"s".decode("gbk")u's'>>>u"中国".decode("gbk")Traceback (most recent call last):  File "<stdin>", line 1, in <module>UnicodeDecodeError: 'gbk' codec can't decode byibyte sequence#字符串相互转换示范>>>"a".decode("gbk")u'a'>>>"a".decode("gbk").encode("gbk")'a'>>>type("a".decode("gbk").encode("gbk"))<type 'str'>

文件存写过程中编码转换实例：
先创建一个txt文件含中文，存成ANSI格式，然后执行以下代码

#-*- coding: UTF-8 -*-fp1 = open('d:\\testfile.txt', 'r') #手工创建文件为ANSI编码保存（gbk）info1 = fp1.read()# 已知是 GBK 编码，解码成 Unicode、文件内容要有中文tmp = info1.decode('GBK')fp2 = open('d:\\testfile.txt', 'w')   # 编码成 UTF-8 编码的 strinfo2 = tmp.encode('UTF-8') fp2.write(info2)  #写入utf8字符，并进行保存fp2.close()       #文件会变为utf-8编码保存

如何判断是否是字符串
if isinstance(s,str):pass #无法判断unicdoe字符串情况
if isinstance(s，basestring):#True for both Unicode and byte strings pass

#判断类型实例#-*- coding: UTF-8 -*-s = "hello normal string"u=u"unicode"if isinstance( s, basestring ):    print u"是字符串"if isinstance( u, basestring ):    T   isinstance("s",str)True#练习isinstance(u"s",str)Falseisinstance(u"s",unicode)Trueisinstance(u"s",basestring)Trueisinstance("s",True

实例：

import chardet  import urllib  #根据需要选择不同的数据类型TestData = urllib.urlopen('http://www.baidu.com/').read()  print chardet.detect(TestData)

常量
1.保存文件const.py和test.py在同一目录
2.运行test.py会提示错误
const.py

#-*-coding:UTF-8-*-#Filename: const.py # 定义一个常量类实现常量的功能 # # 该类定义了一个方法__setattr()__，和一个异常ConstError, ConstError类继承 # 自类TypeError. 通过调用类自带的字典__dict__, 判断定义的常量是否包含在字典 # 如果字典中包含此变量，将抛出异常，否则，给新创建的常量赋值。 # 最后两行代码的作用是把const类注册到sys.modules这个全局字典中。 class _const:     class ConstError(TypeError):pass     def __setattr__(self, name, value):         if self.__dict__.has_key(name):             raise self.ConstError, "Can't rebind const (%s)" %name         self.__dict__[name]=value import sys

test.py

import constconst.magic = 23 print const.magiccconst.magic = 33

练习
练习一：生成所有小写字母，大写字母，大小写混合字母

>>> lower_cases ="">>> for i in range(97,97+26):...     lower_cases+=chr(i)...>>> print lower_casesabcdefghijklmnopqrstuvwxyzupper_cases="">>> for i in range(65,65+26):...     upper_cases+=chr(i)...>>> print upper_casesABCDEFGHIJKLMNOPQRSTUVWXYZ>>> letters="">>> for s in range(65,65+26):...     letters+=chr(s)+chr(s+32)...>>> print lettersAaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz>>>

练习二：判断一个句子中包含多少字母

#方法1>>> letters="">>> for s in range(65,65+26):...     letters+=chr(s)+chr(s+32)...>>> print letters>>> content=raw_input("please input a sentence:")please input a sentence:I am a boy!>>> letter_count=0>>> for s in content:...     if s in letters:...         letter_count+=1...>>> print letter_count#方法2>>> import string>>> letter_count=0>>> a=raw_input("请输入：")请输入：ad123 9e>>> for i in a:...    if i in string.letter:...         letter_count+=1...>>>print letter_count

练习三：加密与解密

letters = raw_input("please input some letter to encode:")encoded_letters=""for s in letters:    if (s >= 'a' and s <"w") or  (s >= 'A' and s <"W"):        encoded_letters+=chr(ord(s)+4)    elif s>="w" and s<="z":        encoded_letters+=chr(ord(s)-ord("w")+97)    elif s>="W" and s<="Z":        encoded_letters+=chr(ord(s)-ord("W")+65)    else:        print "some content is not letter!please try again！"        continueprint encoded_letters

参考
ASCII表 http://tool.oschina.net/commons?type=4
Python编码问题整理https://www.cnblogs.com/fnng/p/5008884.html

阅读全文

0 0