用chardet module自动识别文件编码

来源：互联网发布：打印机控制软件app31 编辑：程序博客网时间：2024/05/21 09:04

http://chardet.feedparser.org/
返回encoding和confidence
试了下很有效

Example: Using the `detect` function

The detect function takes one argument, a non-Unicode string. It returns a dictionary containing the auto-detected character encoding and a confidence level from 0 to 1.

>>> import urllib>>> rawdata = urllib.urlopen('http://yahoo.co.jp/').read()>>> import chardet>>> chardet.detect(rawdata){'encoding': 'EUC-JP', 'confidence': 0.99}

Example: Detecting encoding incrementally
import urllibfrom chardet.universaldetector import UniversalDetectorusock = urllib.urlopen('http://yahoo.co.jp/')detector = UniversalDetector()for line in usock.readlines():    detector.feed(line)    if detector.done: breakdetector.close()usock.close()print detector.result
{'encoding': 'EUC-JP', 'confidence': 0.99}

Example: Detecting encodings of multiple files
import globfrom charset.universaldetector import UniversalDetectordetector = UniversalDetector()for filename in glob.glob('*.xml'):    print filename.ljust(60),    detector.reset()    for line in file(filename, 'rb'):        detector.feed(line)        if detector.done: break    detector.close()    print detector.result

用chardet module自动识别文件编码

Example: Using the detect function

Example: Detecting encoding incrementally

Example: Detecting encodings of multiple files

Example: Using the `detect` function