python处理文本
来源:互联网 发布:access数据库实验心得 编辑:程序博客网 时间:2024/05/29 09:11
最近在进行一些实验,需要进行文本处理,提取文本中关键的字段数据,得到表格,进行分析。在此简要的进行记录。
一、需求是这样的:
得到的GPGPU-Sim运行的程序文本文档。那么我现在需要提取目标对应的键值。比如文本中有如下:
A1 = B1
A2 = B2
A3 = B3
.....
A5 = B5
我现在需要提取出A2和A5对应的键值B2以及B5,按照"B2 B5"这样的格式写入到文本中去。如何用Python代码来实现?
需要提取的字段为:
- 'gpu_sim_insn',
- 'gpu_ipc',
- 'L1I_total_cache_accesses',
- 'L1D_total_cache_accesses',
- 'gpgpu_n_tot_thrd_icount',
- 'gpgpu_n_tot_w_icount',
- 'gpgpu_n_mem_read_local',
- 'gpgpu_n_mem_write_local',
- 'gpgpu_n_mem_read_global',
- 'gpgpu_n_mem_write_global',
- 'gpgpu_n_mem_texture',
- 'gpgpu_n_mem_const',
- 'gpgpu_n_load_insn',
- 'gpgpu_n_store_insn',
- 'gpgpu_n_shmem_insn',
- 'gpgpu_n_tex_insn',
- 'gpgpu_n_const_mem_insn',
- 'gpgpu_n_param_mem_insn'
代码如下:
- import re
- import sys
- import os,glob
-
-
- path = 'D:\\GPUClusters\\Stargazer-master\\EXP_RESULT'
-
- fout = open("res.txt",'w')
-
- x = [
- 'gpu_sim_insn',
- 'gpu_ipc',
- 'L1I_total_cache_accesses',
- 'L1D_total_cache_accesses',
- 'gpgpu_n_tot_thrd_icount',
- 'gpgpu_n_tot_w_icount',
- 'gpgpu_n_mem_read_local',
- 'gpgpu_n_mem_write_local',
- 'gpgpu_n_mem_read_global',
- 'gpgpu_n_mem_write_global',
- 'gpgpu_n_mem_texture',
- 'gpgpu_n_mem_const',
- 'gpgpu_n_load_insn',
- 'gpgpu_n_store_insn',
- 'gpgpu_n_shmem_insn',
- 'gpgpu_n_tex_insn',
- 'gpgpu_n_const_mem_insn',
- 'gpgpu_n_param_mem_insn'
- ]
-
-
- os.chdir(path)
-
-
- for filename in os.listdir():
- fs = open(filename,'r+')
-
- for line in fs.readlines():
- a = line.split()
- if a != [] and a[0] in x:
- fout.write(a[-1]+'\t')
- if a[0] == 'gpgpu_n_param_mem_insn':
- fout.write('\n')
- break
-
- fout.write('\n')
- fout.close()
解释一下代码中的几个问题:1.在一个目录下有多个文件,每个文件都要读取一次,并进行文本处理,如何实现?
-
- import os
- path = 'd:\\work'
- os.chdir(path)
- for filename in os.listdir():
- file = open(filename,'r')
- for eachline in file.readlines():
-
2.Python中.read(), .readline(), .readlines()区别?Python 将文本文件的内容读入可以操作的字符串变量非常容易。文件对象提供了三个“读”方法: .read()、.readline() 和 .readlines()。每种方法可以接受一个变量以限制每次读取的数据量,但它们通常不使用变量。 .read() 每次读取整个文件,它通常用于将文件内容放到一个字符串变量中。然而 .read() 生成文件内容最直接的字符串表示,但对于连续的面向行的处理,它却是不必要的,并且如果文件大于可用内存,则不可能实现这种处理。
.readline() 和 .readlines() 非常相似。它们都在类似于以下的结构中使用:
Python .readlines() 示例
- fh = open('c:\\autoexec.bat')
- for line in fh.readlines():
- print line
.readline() 和 .readlines() 之间的差异是后者一次读取整个文件,象 .read() 一样。.readlines() 自动将文件内容分析成一个行的列表,该列表可以由 Python 的 for ... in ... 结构进行处理。另一方面,.readline() 每次只读取一行,通常比 .readlines() 慢得多。仅当没有足够内存可以一次读取整个文件时,才应该使用 .readline()。
3.split方法:http://www.w3cschool.cc/python/att-string-split.html
二、再举一个简单的例子:
有如下文本"record.txt":
- boy:what's your name?
- girl:my name is lebaishi,what about you?
- boy:my name is wahaha.
- girl:i like your name.
- ==============================================
- girl:how old are you?
- boy:I'm 16 years old,and you?
- girl:I'm 14.what is your favorite color?
- boy:My favorite is orange.
- girl:I like orange too!
- ==============================================
- boy:where do you come from?
- girl:I come from SH.
- boy:My home is not far from you,I live in Jiangsu province.
- girl:Let's be good friends.
- boy:OK!
需求:将文件(record.txt)中的数据进行分割并按照以下规律保存起来:--boy的对话单独保存为boy_*.txt的文件(去掉"boy:")
--girl的对话单独保存为girl_*.txt的文件(去掉"girl:")
--文件中总共有三段对话,分别保存为boy_1.txt,girl_1.txt,boy_2.txt,girl_2.txt,boy_3.txt,girl_3.txt共六个文件(文件中的不同的对话已经用"======="分割)。
代码:
- boy_log = []
- girl_log = []
- version = 1
-
- def save_to_file(boy_log,girl_log,version):
- filename_boy = 'boy_' + str(version) + ".txt"
- filename_girl = 'girl_' + str(version) + ".txt"
- fb = open(filename_boy,"w")
- fg = open(filename_girl,"w")
- fb.writelines(boy_log)
- fg.writelines(girl_log)
-
- fb.close()
- fg.close()
-
- def process(filename):
- file = open(filename,"r")
- for eachline in file.readlines():
- if eachline[:6] != "======":
- mylist = eachline.split(":")
- if mylist[0] == "boy":
- global boy_log
- boy_log.append(mylist[-1])
- else:
- global girl_log
- girl_log.append(mylist[-1])
- else:
- global version
- save_to_file(boy_log,girl_log,version)
- version += 1
- boy_log = []
- girl_log = []
-
- save_to_file(boy_log,girl_log,version)
-
- if __name__ == "__main__":
- fn = "record.txt"
- process(fn)
两个例子都是非常基础也很使用的,记录下来以便以后查阅。
再来一个简单的需求,我需要获取Linux上的ipv4的eth0地址,代码如下:
-
-
- import sys
- import os
-
- os.system("ifconfig > ip.info")
-
- fs = open("ip.info",'r+')
-
- flag = 0
-
- def get_ip():
- for line in fs.readlines():
- a = line.split()
- if a != [] and a[0] == "eth0":
- flag = 1
- if a != [] and a[0] == "lo":
- flag = 0
-
- if flag == 0:
- continue
- else:
- for item in a:
- if a[0] == "inet" and item[0:5] == "addr:":
- return item[5:]
-
- ip = get_ip()
- print ip
注明出处:http://blog.csdn.net/lavorange/article/details/41647091
0 0