用chardet module自动识别文件编码

来源:互联网 发布:打印机控制软件app31 编辑:程序博客网 时间:2024/05/21 09:04

http://chardet.feedparser.org/
返回encoding和confidence
试了下很有效 

Example: Using the detect function

The detect function takes one argument, a non-Unicode string. It returns a dictionary containing the auto-detected character encoding and a confidence level from 0 to 1.

>>> import urllib>>> rawdata = urllib.urlopen('http://yahoo.co.jp/').read()>>> import chardet>>> chardet.detect(rawdata){'encoding': 'EUC-JP', 'confidence': 0.99}

Example: Detecting encoding incrementally

import urllibfrom chardet.universaldetector import UniversalDetectorusock = urllib.urlopen('http://yahoo.co.jp/')detector = UniversalDetector()for line in usock.readlines():    detector.feed(line)    if detector.done: breakdetector.close()usock.close()print detector.result
{'encoding': 'EUC-JP', 'confidence': 0.99}

Example: Detecting encodings of multiple files

import globfrom charset.universaldetector import UniversalDetectordetector = UniversalDetector()for filename in glob.glob('*.xml'):    print filename.ljust(60),    detector.reset()    for line in file(filename, 'rb'):        detector.feed(line)        if detector.done: break    detector.close()    print detector.result
原创粉丝点击