文件存储格式转换(ASCII <-> UTF-8)

来源：互联网发布：联合国创意城市网络编辑：程序博客网时间：2024/05/15 15:31

文件存储格式转换(ASCII&UTF-8)

在用 Source Insight[version 3.50.0080] 看用在 Linux 上的代码时发现对中文注释的支持很不友好，看到网上又说要改注释字体为“新宋体”（/“宋体”）的，但我没弄成。就想着直接把编码为 UTF-8 的文件存为 ASCII，首先想到的是“记事本”中的“另存为”，但当文件太多时显然不行。
搜了好多，发现一个写的还不错的sourceinsight中文显示乱码问题彻底解决办法，
简单明了，不过好似有点问题–会把原本是 ASCII 的文件给弄坏了，将改进了一点（在命令行输入目标文件夹，并不能修复关于 ASCII 的问题 -.-。另外，记事本存为 UTF-8 时其实是 “UTF-8 with BOM”，这也带来了不少问题）的贴在下边：

@echo offset DIR=%1%if "%DIR%"=="" (  echo "Should input the dictionary name") else (    for /R %DIR% %%i in (*.h *.c *.cpp *.cs *.mak *.java) do (    echo %%i    native2ascii -encoding UTF-8 %%i %DIR%\temp    native2ascii -reverse %DIR%\temp %%i    )echo ALL DONEpause)

关于 native2ascii 的一些参考资料：

1.native2ascii命令

2.native2ascii命令详解

所以，就自己写了个 python 程序来实现所需功能：ASCII 与 UTF-8 互相转换：

注：需要自行安装 chardet 模块，且我的 python 环境是 2.7

使用方式： python transformFormat.py fileOrDirName toUTF_8(True/False) fileExtensions(c,cpp,h,cs,mak)[optional]

比如：python transformFormat.py H:\test True c cpp h

就可以将 H:\test 文件夹下的所有后缀为 .c/.cpp/.h 的文件转为 UTF-8 模式（原来的格式并不牵扯）

"""transFormat.py, aim to transform the codec of the file,especially between the ASCII andUTF-8."""class Transform(object):    def listFiles(self, root=''):        allFiles = []                import os        #s = os.sep        #root = "d:" + s + "ll" + s        if os.path.isfile(root): #root is just a file            allFiles.append(root)            return allFiles        for i in os.listdir(root):  #root is a dictionary            f = os.path.join(root,i)            if os.path.isdir(f):                allFiles += self.listFiles(root= f)            elif os.path.isfile(f):                allFiles.append(f)        return allFiles    def transform(self, fileName, toUTF_8):        import chardet        import codecs        with open(fileName, 'r') as f:            data = f.read()            if data[:3] == codecs.BOM_UTF8: # In case of UTF-8 with BOM                       data = data[3:]                    try:            print('Transform begin, file: ' + root + ';toUTF_8: ' + str(toUTF_8))            encodeType = chardet.detect(data)['encoding'].upper()            print(fileName, encodeType)            alreadyUTF_8 = (encodeType.find('UTF') != -1) #already utf-8            if (toUTF_8 and alreadyUTF_8) or (not toUTF_8 and not alreadyUTF_8): #Do not need to transform,already OK                print (fileName + ' Already')                return            if toUTF_8: #meet the require to change to utf-8                data = data.decode('gbk','ignore').encode('utf-8')            else:                data = data.decode('utf-8', 'ignore').encode('gbk')            #write back the content            with open(fileName, 'w') as f:                f.write(data)            print(fileName + ' OK')        except Exception as e:            print('WRONG with ' + fileName)            print(e)    def main(self, root='', toUTF_8=True, fileExtensions=''):        #print('Transform begin, root: ' + root + ';toUTF_8: ' + str(toUTF_8))        allFiles = self.listFiles(root=root)        allFiles2 = []        for f in allFiles:            fends = f.split('.')[-1]            if fends in fileExtensions:                allFiles2.append(f)              if len(allFiles2) == 0:            print('No file to transform')            return        for f in allFiles2:            self.transform(f, toUTF_8)#t = Transform()#root = 'H:\leetcode\wingide\he'#fE = ['c','cpp','h','cs','mak','txt']#t.main(root=root,toUTF_8=False, fileExtensions = fE)#exit()if __name__ == '__main__':    print('Usage: python transformFormat.py fileOrDirName toUTF_8(True/False)  fileExtensions(c,cpp,h,cs,mak)[optional]')    import sys    #print(sys.argv)    if len(sys.argv) < 2:        print("No file name!")        exit()            if len(sys.argv) == 2:        print('Should give toUTF_8')        exit()    root = sys.argv[1]    if len(sys.argv) >= 3:        if sys.argv[2] == 'True':            toUTF_8 = True        elif sys.argv[2] == 'False':            toUTF_8 = False        else:            print('toUTF should be True or False')    fileExtensions = ['c','cpp','h','cs','mak']    if len(sys.argv) > 3:        fileExtensions = sys.argv[3:]    print('Transform begin, root: ' + root + ';toUTF_8: ' + str(toUTF_8) + ';fileExtensions:' + str(fileExtensions))    t = Transform()    t.main(root=root, toUTF_8=toUTF_8,fileExtensions=fileExtensions)    print('Transform Over')

参考资料：

1.python 中文乱码问题深入分析

2.字符编码笔记：ASCII，Unicode和UTF-8

0 0