python处理文本

来源：互联网发布：access数据库实验心得编辑：程序博客网时间：2024/05/29 09:11

最近在进行一些实验，需要进行文本处理，提取文本中关键的字段数据，得到表格，进行分析。在此简要的进行记录。

一、需求是这样的：

得到的GPGPU-Sim运行的程序文本文档。那么我现在需要提取目标对应的键值。比如文本中有如下：

A1 = B1

A2 = B2

A3 = B3

.....

A5 = B5

我现在需要提取出A2和A5对应的键值B2以及B5，按照"B2 B5"这样的格式写入到文本中去。如何用Python代码来实现？

需要提取的字段为：

[plain] view plain copy
 
'gpu_sim_insn',  
'gpu_ipc',  
'L1I_total_cache_accesses',  
'L1D_total_cache_accesses',  
'gpgpu_n_tot_thrd_icount',  
'gpgpu_n_tot_w_icount',  
'gpgpu_n_mem_read_local',  
'gpgpu_n_mem_write_local',  
'gpgpu_n_mem_read_global',  
'gpgpu_n_mem_write_global',  
'gpgpu_n_mem_texture',  
'gpgpu_n_mem_const',  
'gpgpu_n_load_insn',  
'gpgpu_n_store_insn',  
'gpgpu_n_shmem_insn',  
'gpgpu_n_tex_insn',  
'gpgpu_n_const_mem_insn',  
'gpgpu_n_param_mem_insn'  

代码如下：

[python] view plain copy
 
import re  
import sys  
import os,glob  
  
#定义目录：目录下有多个文件需要处理  
path = 'D:\\GPUClusters\\Stargazer-master\\EXP_RESULT'  
#定义输出文件  
fout = open("res.txt",'w')  
  
x = [  
     'gpu_sim_insn',  
     'gpu_ipc',  
     'L1I_total_cache_accesses',  
     'L1D_total_cache_accesses',  
     'gpgpu_n_tot_thrd_icount',  
     'gpgpu_n_tot_w_icount',  
     'gpgpu_n_mem_read_local',  
     'gpgpu_n_mem_write_local',  
     'gpgpu_n_mem_read_global',  
     'gpgpu_n_mem_write_global',  
     'gpgpu_n_mem_texture',  
     'gpgpu_n_mem_const',  
     'gpgpu_n_load_insn',  
     'gpgpu_n_store_insn',  
     'gpgpu_n_shmem_insn',  
     'gpgpu_n_tex_insn',  
     'gpgpu_n_const_mem_insn',  
     'gpgpu_n_param_mem_insn'  
     ]  
  
#改变路径  
os.chdir(path)  
  
#遍历目录下的所有文件  
for filename in os.listdir():  
    fs = open(filename,'r+')  
    #处理文件中的每一行数据  
    for line in fs.readlines():  
        a = line.split()  
        if a != [] and a[0] in x:  
            fout.write(a[-1]+'\t')  
            if a[0] == 'gpgpu_n_param_mem_insn':  
                fout.write('\n')  
                break  
                  
fout.write('\n')    
fout.close()  

解释一下代码中的几个问题：

1.在一个目录下有多个文件，每个文件都要读取一次，并进行文本处理，如何实现？

[python] view plain copy
 
#比如d:\work下面是你要读取的文件，代码可以这样写:  
import os  
path = 'd:\\work' #or path = r'd:\work'  
os.chdir(path)  
for filename in os.listdir():  
    file = open(filename,'r')  
    for eachline in file.readlines():  
        #process eachline  

2.Python中.read(), .readline(), .readlines()区别？

Python 将文本文件的内容读入可以操作的字符串变量非常容易。文件对象提供了三个“读”方法： .read()、.readline() 和 .readlines()。每种方法可以接受一个变量以限制每次读取的数据量，但它们通常不使用变量。 .read() 每次读取整个文件，它通常用于将文件内容放到一个字符串变量中。然而 .read() 生成文件内容最直接的字符串表示，但对于连续的面向行的处理，它却是不必要的，并且如果文件大于可用内存，则不可能实现这种处理。

.readline() 和 .readlines() 非常相似。它们都在类似于以下的结构中使用：

Python .readlines() 示例

[python] view plain copy
 
fh = open('c:\\autoexec.bat')  
for line in fh.readlines():  
    print line  

.readline() 和 .readlines() 之间的差异是后者一次读取整个文件，象 .read() 一样。.readlines() 自动将文件内容分析成一个行的列表，该列表可以由 Python 的 for ... in ... 结构进行处理。另一方面，.readline() 每次只读取一行，通常比 .readlines() 慢得多。仅当没有足够内存可以一次读取整个文件时，才应该使用 .readline()。

3.split方法：http://www.w3cschool.cc/python/att-string-split.html

二、再举一个简单的例子：

有如下文本"record.txt":

[html] view plain copy
 
boy:what's your name?  
girl:my name is lebaishi,what about you?  
boy:my name is wahaha.  
girl:i like your name.  
==============================================  
girl:how old are you?  
boy:I'm 16 years old,and you?  
girl:I'm 14.what is your favorite color?  
boy:My favorite is orange.  
girl:I like orange too!  
==============================================  
boy:where do you come from?  
girl:I come from SH.  
boy:My home is not far from you,I live in Jiangsu province.  
girl:Let's be good friends.  
boy:OK!  

需求：将文件（record.txt）中的数据进行分割并按照以下规律保存起来：

--boy的对话单独保存为boy_*.txt的文件（去掉"boy:"）

--girl的对话单独保存为girl_*.txt的文件（去掉"girl:"）

--文件中总共有三段对话，分别保存为boy_1.txt,girl_1.txt,boy_2.txt,girl_2.txt,boy_3.txt,girl_3.txt共六个文件（文件中的不同的对话已经用"======="分割）。

代码：

[python] view plain copy
 
boy_log = []  
girl_log = []  
version = 1  
  
def save_to_file(boy_log,girl_log,version):  
    filename_boy = 'boy_' + str(version) + ".txt"  
    filename_girl = 'girl_' + str(version)  + ".txt"  
    fb = open(filename_boy,"w")  
    fg = open(filename_girl,"w")  
    fb.writelines(boy_log)  
    fg.writelines(girl_log)  
              
    fb.close()  
    fg.close()  
  
def process(filename):  
    file = open(filename,"r")  
    for eachline in file.readlines():  
        if eachline[:6] != "======":  
            mylist = eachline.split(":")  
            if mylist[0] == "boy":  
                global boy_log  
                boy_log.append(mylist[-1])  
            else:  
                global girl_log  
                girl_log.append(mylist[-1])  
        else:  
            global version  
            save_to_file(boy_log,girl_log,version)  
            version += 1  
            boy_log = []  
            girl_log = []  
              
    save_to_file(boy_log,girl_log,version)  
  
if __name__ == "__main__":  
    fn = "record.txt"  
    process(fn)  

两个例子都是非常基础也很使用的，记录下来以便以后查阅。

再来一个简单的需求，我需要获取Linux上的ipv4的eth0地址，代码如下：

[python] view plain copy
 
#/usr/bin/python  
  
import sys  
import os  
  
os.system("ifconfig > ip.info")  
  
fs = open("ip.info",'r+')  
  
flag = 0  
  
def get_ip():  
    for line in fs.readlines():  
        a = line.split()  
        if a != [] and a[0] == "eth0":  
            flag = 1  
        if a != [] and a[0] == "lo":  
            flag = 0  
  
        if flag == 0:  
            continue  
        else:  
            for item in a:  
                if a[0] == "inet" and item[0:5] == "addr:":  
                    return item[5:]  
  
ip = get_ip()  
print ip  

注明出处：http://blog.csdn.net/lavorange/article/details/41647091

0 0