基于NCBI的blast小程序

来源:互联网 发布:张爱玲 红楼梦魇知乎 编辑:程序博客网 时间:2024/06/05 03:09

第一次尝试用markdown来写博客
很新鲜的感觉

任务

首先老板给我一个任务
把一堆序列给我,让我比对,找出他们当中和cilia最为相关的。

这里写图片描述
是fasta格式的。

解决方案1

  • 人工高通量
    就是把它们上传到NCBI网站上,然后比对。
    fasta
    然后挨个点开看它们的结果,网速慢的话很让人抓狂

解决方案2

  • 本地blast
    这需要一台不错的电脑,我自己的固态硬盘一共才240G。我可以用我们的服务器,但是当时我们没有空间了,得去找管理员。我找不到他,然后发邮件也不回复。所以我用了第三种方法

解决方案3

  • 第一步

    直接用了这个博客里的代码
    http://www.yelinsky.com/blog/archives/298.html

    import osos.chdir("c:/Users/**/Documents")from Bio.Blast import NCBIWWWfrom Bio import SeqIOimport timeSeqNumber = 0for record in SeqIO.parse("cilia.seq", "fasta"):    SeqNumber += 1    try:        result_handle = NCBIWWW.qblast("blastn", "nr", record.seq)        save_file = open('xml\\'+str(SeqNumber)+'.xml', 'w')        save_file.write(result_handle.read())        save_file.close()        print (SeqNumber,' OK!')    except:        print (SeqNumber,' Error! Will try again later!')        time.sleep(600)        SeqNumber -=1print ("Done!")

运行效果如下:
1 OK!
2 OK!
3 OK!
4 OK!
5 OK!
6 OK!
7 OK!
8 OK!
9 OK!
10 OK!
11 OK!
12 OK!
13 OK!
14 OK!
15 Error! Will try again later!
15 OK!
16 OK!
17 OK!
18 OK!
19 OK!
20 OK!
21 OK!
22 OK!
23 OK!
24 OK!
25 OK!
26 OK!
27 OK!
28 OK!
29 OK!
30 OK!
31 Error! Will try again later!
31 Error! Will try again later!
31 OK!
32 OK!
33 OK!
34 OK!
35 OK!
36 OK!
Done!
还是不错的
但是问题有几个:
1. 得到的文件为
[‘1.xml’, ‘10.xml’, ‘11.xml’, ‘12.xml’, ‘13.xml’, ‘14.xml’, ‘15.xml’, ‘16.xml’, ‘17.xml’, ‘18.xml’, ‘19.xml’, ‘2.xml’, ‘20.xml’, ‘21.xml’, ‘22.xml’, ‘23.xml’, ‘24.xml’, ‘25.xml’, ‘26.xml’, ‘27.xml’, ‘28.xml’, ‘29.xml’, ‘3.xml’, ‘30.xml’, ‘31.xml’, ‘32.xml’, ‘33.xml’, ‘34.xml’, ‘35.xml’, ‘36.xml’, ‘37.xml’, ‘38.xml’, ‘4.xml’, ‘5.xml’, ‘6.xml’, ‘7.xml’, ‘8.xml’, ‘9.xml’]
这种数字的文件名。
我想以fasta文件中序列
/>NODE_567_length_6110_cov_150.381.g1148这种标签为文件名该怎么办呢

2.xml文件数目好像与fasta文件中序列的条数不一致啊,这又是为什么

  • 第二步
    xml文件读取参考了这个博客
    http://blog.csdn.net/hm11290219/article/details/52325690
    剩下的代码通过阅读ElementTree文档写出来的。

    #-*- coding: UTF-8 -*- from xml.etree import ElementTree as ETimport osos.chdir("c:/Users/**/Documents/ift"L=[]for files in os.walk("./"):    for file in files:        L.append(file)print (L[2])def hit():    for i in L[2]:        #per=ET.parse("30.xml")        per=ET.parse(str(i))        root = per.getroot()        hit = root.findall(".//Hit_def")        for x in hit:            print (str(i),x.text)            #returnif __name__ == "__main__":    hit()       

得到:
1.xml Paramecium tetraurelia hypothetical protein (GSPATT00038674001) partial mRNA
1.xml Paramecium tetraurelia hypothetical protein (GSPATT00029951001) partial mRNA
1.xml Tetrahymena thermophila SB210 dynein heavy chain 2, putative partial mRNA
1.xml PREDICTED: Nicrophorus vespilloides dynein heavy chain 2, axonemal (LOC108559906), mRNA
1.xml Trichomonas vaginalis G3 Dynein heavy chain family protein (TVAG_162980) partial mRNA
1.xml PREDICTED: Jaculus jaculus dynein, axonemal, heavy chain 3 (Dnah3), mRNA
1.xml Paramecium tetraurelia hypothetical protein (GSPATT00004520001) partial mRNA
1.xml PREDICTED: Ciona intestinalis dynein heavy chain 3, axonemal-like (LOC100178770), mRNA
1.xml Paramecium tetraurelia hypothetical protein (GSPATT00019219001) partial mRNA
1.xml PREDICTED: Mus musculus dynein, axonemal, heavy chain 3 (Dnah3), mRNA
1.xml PREDICTED: Nestor notabilis dynein, axonemal, heavy chain 3 (DNAH3), partial mRNA
1.xml PREDICTED: Picoides pubescens dynein, axonemal, heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Parus major dynein axonemal heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Python bivittatus dynein axonemal heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Sturnus vulgaris dynein, axonemal, heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Tauraco erythrolophus dynein, axonemal, heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Merops nubicus dynein, axonemal, heavy chain 3 (DNAH3), mRNA
1.xml M.musculus mRNA for axonemal dynein heavy chain (partial, ID mdhc8)
1.xml PREDICTED: Fulmarus glacialis dynein, axonemal, heavy chain 3 (DNAH3), partial mRNA
1.xml PREDICTED: Phalacrocorax carbo dynein, axonemal, heavy chain 3 (DNAH3), partial mRNA
1.xml PREDICTED: Crocodylus porosus dynein axonemal heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Lepidothrix coronata dynein axonemal heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Pseudopodoces humilis dynein, axonemal, heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Gallus gallus dynein, axonemal, heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Loxodonta africana dynein, axonemal, heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Caprimulgus carolinensis dynein heavy chain 3, axonemal-like (LOC104529236), partial mRNA
1.xml Ichthyophthirius multifiliis hypothetical protein (IMG5_109350) mRNA, partial cds
1.xml Physcomitrella patens subsp. patens predicted protein (PHYPADRAFT_83783) mRNA, complete cds
1.xml PREDICTED: Gavialis gangeticus dynein axonemal heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Aptenodytes forsteri dynein axonemal heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Ochotona princeps dynein, axonemal, heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Pygoscelis adeliae dynein, axonemal, heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Calidris pugnax dynein, axonemal, heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Pterocles gutturalis dynein, axonemal, heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Gavia stellata dynein, axonemal, heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Melopsittacus undulatus dynein, axonemal, heavy chain 3 (LOC101880524), mRNA
1.xml PREDICTED: Phaethon lepturus dynein, axonemal, heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Apis mellifera dynein heavy chain 7, axonemal-like (LOC725593), mRNA
1.xml PREDICTED: Tupaia chinensis dynein, axonemal, heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Tinamus guttatus dynein, axonemal, heavy chain 7 (DNAH7), mRNA
1.xml PREDICTED: Anoplophora glabripennis dynein heavy chain 2, axonemal (LOC108913077), mRNA
1.xml PREDICTED: Trachymyrmex septentrionalis dynein heavy chain 1, axonemal-like (LOC108746866), transcript variant X3, mRNA
1.xml PREDICTED: Trachymyrmex septentrionalis dynein heavy chain 1, axonemal-like (LOC108746866), transcript variant X2, mRNA
1.xml PREDICTED: Trachymyrmex septentrionalis dynein heavy chain 1, axonemal-like (LOC108746866), transcript variant X1, mRNA
1.xml PREDICTED: Corvus brachyrhynchos dynein axonemal heavy chain 3 (DNAH3), transcript variant X2, mRNA
1.xml PREDICTED: Corvus brachyrhynchos dynein axonemal heavy chain 3 (DNAH3), transcript variant X1, mRNA
1.xml PREDICTED: Apis cerana dynein heavy chain 7, axonemal-like (LOC107999697), mRNA
1.xml PREDICTED: Pelodiscus sinensis dynein, axonemal, heavy chain 3 (DNAH3), mRNA
1.xml PREDICTED: Halyomorpha halys dynein heavy chain 3, axonemal-like (LOC106689148), transcript variant X3, mRNA
1.xml PREDICTED: Halyomorpha halys dynein heavy chain 3, axonemal-like (LOC106689148), transcript variant X2, mRNA
10.xml PREDICTED: Amphimedon queenslandica dynein heavy chain 1, axonemal (LOC100635199), mRNA
10.xml Guillardia theta CCMP2712 hypothetical protein (GUITHDRAFT_102137) mRNA, complete cds
10.xml PREDICTED: Acanthisitta chloris dynein, axonemal, heavy chain 3 (DNAH3), mRNA
10.xml Marsilea vestita axonemal inner arm dynein heavy chain 2 (DHC) mRNA, complete cds
10.xml PREDICTED: Calidris pugnax dynein, axonemal, heavy chain 3 (DNAH3), mRNA
10.xml PREDICTED: Pterocles gutturalis dynein, axonemal, heavy chain 3 (DNAH3), mRNA
10.xml PREDICTED: Lepidothrix coronata dynein axonemal heavy chain 3 (DNAH3), mRNA
10.xml PREDICTED: Poecilia reticulata dynein axonemal heavy chain 3 (dnah3), transcript variant X2, mRNA
10.xml PREDICTED: Poecilia formosa dynein axonemal heavy chain 3 (dnah3), mRNA
10.xml PREDICTED: Poecilia mexicana dynein, axonemal, heavy chain 3 (dnah3), mRNA
10.xml PREDICTED: Ochotona princeps dynein, axonemal, heavy chain 7 (DNAH7), mRNA
10.xml PREDICTED: Caprimulgus carolinensis dynein heavy chain 3, axonemal-like (LOC104529236), partial mRNA
10.xml PREDICTED: Picoides pubescens dynein, axonemal, heavy chain 3 (DNAH3), mRNA
10.xml PREDICTED: Poecilia reticulata dynein axonemal heavy chain 3 (dnah3), transcript variant X1, mRNA
10.xml PREDICTED: Sinocyclocheilus rhinocerous dynein axonemal heavy chain 1 (dnah1), mRNA
10.xml Eimeria maxima Dynein heavy chain 3, axonemal, related partial mRNA

问题还是有几个:
1. 最后还是要把结果写入excel文件,然后自动排序。得到一个完美的结果。
2. 把以上封装成exe便于小白执行

原创粉丝点击