python 检测编码 Universal Encoding Detector
来源:互联网 发布:网络摄像机哪个牌子好 编辑:程序博客网 时间:2024/06/03 21:27
用python检测文件的编码
Universal Encoding Detector是一个很好的工具,网址是:http://chardet.feedparser.org/
用起来很方便
Usage
[link] Basic usage
The easiest way to use the Universal Encoding Detector library is with the detect function.
[link]
Example: Using the detect function
The detect function takes one argument, a non-Unicode string. It returns a dictionary containing the auto-detected character encoding and a confidence level from 0 to 1.
>>> import urllib
>>> rawdata = urllib.urlopen('http://yahoo.co.jp/').read()
>>> import chardet
>>> chardet.detect(rawdata)
{'encoding': 'EUC-JP', 'confidence': 0.99}
[link] Advanced usage
If you’re dealing with a large amount of text, you can call the Universal Encoding Detector library incrementally, and it will stop as soon as it is confident enough to report its results.
Create a UniversalDetector object, then call its feed method repeatedly with each block of text. If the detector reaches a minimum threshold of confidence, it will set detector.done to True.
Once you’ve exhausted the source text, call detector.close(), which will do some final calculations in case the detector didn’t hit its minimum confidence threshold earlier. Then detector.result will be a dictionary containing the auto-detected character encoding and confidence level (the same as the chardet.detect function returns).
[link]
Example: Detecting encoding incrementally
import urllib
from chardet.universaldetector import UniversalDetector
usock = urllib.urlopen('http://yahoo.co.jp/')
detector = UniversalDetector()
for line in usock.readlines():
detector.feed(line)
if detector.done: break
detector.close()
usock.close()
print detector.result
{'encoding': 'EUC-JP', 'confidence': 0.99}
If you want to detect the encoding of multiple texts (such as separate files), you can re-use a single UniversalDetector object. Just call detector.reset() at the start of each file, call detector.feed as many times as you like, and then call detector.close() and check the detector.result dictionary for the file’s results.
[link]
Example: Detecting encodings of multiple files
import glob
from chardet.universaldetector import UniversalDetector
detector = UniversalDetector()
for filename in glob.glob('*.xml'):
print filename.ljust(60),
detector.reset()
for line in file(filename, 'rb'):
detector.feed(line)
if detector.done: break
detector.close()
print detector.result
- python 检测编码 Universal Encoding Detector
- [python]Huffman Encoding哈夫曼编码
- python 输出encoding编码格式
- Python 定义源码编码 (Source Encoding)
- 【原创】python encoding中文编码
- Detector 检测内存泄露
- Python字符编码检测 -- chardet
- python编码检测模块chardet
- Python 3.5 检测文件编码
- python Encoding
- python - encoding
- 【EMGUCV】simpleblob detector 斑点检测
- zju2478编码Encoding
- encoding编码问题
- Encoding.GetEncoding 编码列表
- Object encoding编码方式
- Encoding.GetEncoding 编码列表 .
- Encoding编码(1020)
- 给我一个画点函数,我能创造整个世界
- Ubuntu 下修改文件权限
- Ubuntu 下修改文件权限
- Ubuntu下用Wine完美运行QQ2010的方法
- 生日那天的消费反思
- python 检测编码 Universal Encoding Detector
- AIX 下xlC编译可共用主程序全局变量的动态库
- C#系统、硬件目录
- fcntl()函数
- C#.NET数据库链接字符串
- Android开发指导文档(译)---Intent and Intent Fliter
- C#媒体目录
- 我开始学习计算机语言了
- Java的数据库连接编程(JDBC)技术