PSSM特征-从生成到处理
来源:互联网 发布:电气专业单词软件下载 编辑:程序博客网 时间:2024/06/07 06:37
以下代码均为个人原创,如有疑问,欢迎交流。新浪微博:拾毅者
本节内容:
- pssm生成
- pssm简化
- 标准的pssm构建
- 滑动pssm生成
在基于蛋白质序列的相关预测中,使用PSSM打分矩阵会得将预测效果大大提高,同时,如果使用滑动的PSSM,效果又会进一步提高。这里主要以分享代码为主,下面介绍下PSSM从生成到处理的全过程。
1.PSSM的生成
PSSM的生成有多种方式,这里使用的psiblast软件,ncbi-blast-2.2.28+/bin/psiblast,下载地址:http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastNews#1 使用方法,输入一个序列,加上相关参数,直接输出PSSM文件
代码
#一个命令函数,根据pdb文件,使用blast生成pssm文件def command_pssm(content, output_file,pssm_file): os.system('/ifs/share/lib/blast/ncbi-blast-2.2.28+/bin/psiblast \ -query %s \ -db /ifs/data/database/blast_data/nr \ -num_iterations 3 \ -out %s \ -out_ascii_pssm %s &' %(content, output_file,pssm_file))
上面是执行的命令,封装成函数,下面为实际代码:
#对每个序列进行PSSM打分def pssm(proseq,outdir): inputfile = open(proseq,'r') content = '' input_file = '' output_file = '' pssm_file = '' chain_name = [] for eachline in inputfile: if '>' in eachline: if len(content): temp_file = open(outdir + '/fasta/' + chain_name,'w') temp_file.write(content) input_file = outdir + '/fasta/' + chain_name output_file = outdir + '/' + chain_name + '.out' pssm_file = outdir + '/' + chain_name + '.pssm' command_pssm(input_file, output_file,pssm_file) temp_file.close content = '' chain_name = eachline[1:5] + eachline[6:7] content += ''.join(eachline) #print content #print chain_name if len(content): temp_file = open(outdir + '/fasta/' + chain_name,'w') temp_file.write(content) input_file = outdir + '/fasta/' + chain_name output_file = outdir + '/' + chain_name + '.out' pssm_file = outdir + '/' + chain_name + '.pssm' command_pssm(input_file, output_file,pssm_file) temp_file.close inputfile.close()
测试用例:
''' #生成pssm文件,迭代次数为3 proseq = '/ifs/home/liudiwei/experiment/step2/data/protein.seq'outdir = '/ifs/home/liudiwei/experiment/step2/pssm'pssm(proseq,outdir)'''
PSSM输出样例:
2.简化PSSM数据
通常我们需要的只是前面的20列
下面通过代码来实现上面的功能:
#格式化pssm每行数据def formateachline(eachline): col = eachline[0:5].strip() col += '\t' + eachline[5:8].strip() begin = 9 end = begin +3 for i in range(20): begin = begin end = begin + 3 col += '\t' + eachline[begin:end].strip() begin = end col += '\n' return col
简化pssm,只要得到前面的20个氨基酸的打分值
def simplifypssm(pssmdir,newdir): listfile = os.listdir(pssmdir) for eachfile in listfile: with open(pssmdir + '/' + eachfile,'r') as inputpssm: with open(newdir + '/' + eachfile,'w') as outfile: count = 0 for eachline in inputpssm: count +=1 if count <= 3: continue if not len(eachline.strip()): break oneline = formateachline(eachline) outfile.write(''.join(oneline))''' Test examplepssmdir = '/ifs/home/liudiwei/experiment/step2/pssm/oldpssm'newdir = '/ifs/home/liudiwei/experiment/step2/pssm/newpssm'simplifypssm(pssmdir, newdir)'''
3.得到标准的PSSM
通过上面抽取出来的PSSM,下面通过代码来获得一个滑动的PSSM
#标准的pssm,直接根据标准的pssm滑动def standardPSSM(window_size,pssmdir,outdir): listfile = os.listdir(pssmdir) for eachfile in listfile: outfile = open(outdir + '/' + eachfile, 'w') with open(pssmdir + '/' + eachfile, 'r') as inputf: inputfile = inputf.readlines() for linenum in range(len(inputfile)): content = [] first = [];second = [];third=[];last=[] if linenum < window_size/2: for i in range((window_size/2 - linenum)*20): second.append('\t0') if window_size/2 - linenum > 0: countline = window_size - (window_size/2 - linenum) else: countline = window_size #get needed line count linetemp = 0 for eachline in inputfile: if linetemp < linenum-window_size/2: linetemp += 1 continue if linetemp == linenum: thisline = eachline.split('\t') for j in range(0,2): if j>0: first.append('\t') first.append(thisline[j].strip()) if countline > 0: oneline = eachline.split('\t') for j in range(2,len(oneline)): third.append('\t' + oneline[j].strip()) countline -=1 else: break linetemp += 1 while countline: for i in range(20): last.append('\t0') countline -=1 content += first + second + third + last outfile.write(''.join(content) + '\n') outfile.close()'''Test examplepssmdir = '/ifs/home/liudiwei/experiment/step2/pssm/newpssm'newdir = '/ifs/home/liudiwei/experiment/step2/pssm/standardpssm'window_size = 5standardPSSM(window_size,pssmdir, newdir)'''
4.根据滑动窗口求出滑动的PSSM
#根据窗口大小,计算出滑动后的20个氨基酸打分值def computedPSSM(window_size,pssmdir,outdir): listfile = os.listdir(pssmdir) for eachfile in listfile: outfile = open(outdir + '/' + eachfile, 'w') with open(pssmdir + '/' + eachfile, 'r') as inputf: inputfile = inputf.readlines() for linenum in range(len(inputfile)): content = [] first = [];second = [] if window_size/2 - linenum > 0: countline = window_size - (window_size/2 - linenum) else: countline = window_size #get needed line count linetemp = 0 for eachline in inputfile: if linetemp < linenum-window_size/2: linetemp += 1 continue if linetemp == linenum: thisline = eachline.split('\t') for j in range(0,2): if j>0:first.append('\t') first.append(thisline[j].strip()) if countline > 0: oneline = eachline.split('\t')[2:len(eachline)] tline = [] for i in range(len(oneline)): tline.append(int(oneline[i])) if len(second)==0: second += tline else: second = list(map(lambda x: x[0]+x[1], zip(second, tline))) countline -=1 else: break linetemp += 1 format_second = [] for i in range(len(second)): format_second.append('\t' + str(second[i])) content += first + format_second outfile.write(''.join(content) + '\n') outfile.close()''' pssmdir = '/ifs/home/liudiwei/experiment/step2/pssm/newpssm'newdir = '/ifs/home/liudiwei/experiment/step2/pssm/computedpssm'window_size = 5computedPSSM(window_size,pssmdir, newdir)'''
平滑的PSSM,只是pssmdir不同,直接调用standardPSSM函数
def smoothedPSSM(window_size,pssmdir,outdir): standardPSSM(window_size,pssmdir, outdir)'''Test examplepssmdir = '/ifs/home/liudiwei/experiment/step2/pssm/computedpssm'newdir = '/ifs/home/liudiwei/experiment/step2/pssm/smoothedpssm'window_size = 5smoothedPSSM(window_size,pssmdir,newdir)'''
最后得到的是一个滑动的PSSM矩阵,特征的维数随窗口的大小逐渐增减。
1 0
- PSSM特征-从生成到处理
- 使用Blast本地数据库获得PSSM特征矩阵
- ARToolKit从图片生成特征点数据
- PSI-Blast最新版单机安装,批量生成Pssm打分矩阵
- 从特征表示到深度学习
- 从数据预处理到特征工程
- 特征生成
- mean shift:从图像分割到特征空间分析
- 第三次:从稀疏到深度的视觉特征表示
- BLAST(2004版)本地化安装与使用,生成PSSM打分矩阵
- 特征处理
- Ajax 从XML到生成表格
- 从POI 生成EXCEL 到二进制下载
- 分类:从生成模型到判别模型
- 生成对抗网络:从架构到训练
- ASP生成静态全过程研究,从数据库到页面生成
- Java从入门到精通 - 异常处理
- 并发处理模型, 从 Reactor 到 Coproc
- mybatis最好用的分页插件
- 二分查找算法
- Android控件之ViewStub的使用技巧
- AspNet WebApi 中应用fo-dicom抛出异常:No codec registered for tranfer syntax:
- 数据库性能优化二:数据库表优化
- PSSM特征-从生成到处理
- Android TextView内容过长加省略号
- Java&Scala混合编程
- Qt中main函数详解
- 【Hdu】1080 Human Gene Functions
- 数据库性能优化三:程序操作优化
- DicomIoException: Requested 132 bytes past end of fixed length stream.
- bootstrap-table(三)
- 设置mysql中自增列的初始值和增加步长