两个文件的合并

来源:互联网 发布:怎样看待周小平 知乎 编辑:程序博客网 时间:2024/05/17 06:47

两个文件的合并

分别有文件一和文件二,文件一和文件二都有一列内容是相同的。根据此列内容可将两者结合起来。

文件一的格式如下

AEGTA04333 LOC_Os01g33520|LOC_Os01g33520.1 2.70E-25 LOC_Os01g33520|LOC_Os01g33520.1 PF02902 Ulp1 protease family, C-terminal catalytic domain0 Correction_xiaoyan_191014 Bradi5g13350|Bradi5g13350.2 0.00E+00 Bradi5g13350|Bradi5g13350.2PF01179 Copper amine oxidase, enzyme domain0 1.4.3.21 Primary-amine oxidase. primary-amine oxidase [EC:1.4.3.21] K00276 gi|1027548131|dbj|IAAQ01032233.1| LOC_Os08g23710|LOC_Os08g23710.1 7.40E-96 LOC_Os08g23710|LOC_Os08g23710.1 PF01248 Ribosomal protein L7Ae/L30e/S12e/Gadd45 family0 60S ribosomal protein L7A KOG3166 large subunit ribosomal protein L7Ae K02936 TRIUR3_19210 ChrSy.fgenesh.gene.7|ChrSy.fgenesh.mRNA.7 8.40E-140 ChrSy.fgenesh.gene.7|ChrSy.fgenesh.mRNA.7 PF07727 Reverse transcriptase (RNA-dependent DNA polymerase)0 2.7.7.49 RNA-directed DNA polymerase. gb|GEDQ01040680.1| Bradi3g35480|Bradi3g35480.1 9.30E-28 Bradi3g35480|Bradi3g35480.1 0 gi|1027446095|dbj|IAAV01016872.1| Bradi3g26410|Bradi3g26410.1 4.90E-305 Bradi3g26410|Bradi3g26410.1 PF03372 Endonuclease/Exonuclease/phosphatase family0 Glucose-repressible alcohol dehydrogenase transcriptional effector CCR4 and related proteins 3.1.13.4 Poly(A)-specific ribonuclease. KOG0620 CCR4-NOT transcription complex subunit 6 [EC:3.1.-.-] K12603 Traes_1BL_164A89B5B.3 Bradi4g11400|Bradi4g11400.3 1.10E-106 Bradi4g11400|Bradi4g11400.3PF00106 short chain dehydrogenase0 Reductases with broad range of substrate specificities 1.1.1.206 Tropinone reductase I. KOG0725 tropine dehydrogenase [EC:1.1.1.206] K08081 Traes_5BL_39FF2F90F.1 GRMZM2G121189|GRMZM2G121189_T01 3.10E-32 GRMZM2G121189|GRMZM2G121189_T01 Copper chaperone KOG1603 gi|1027460211|dbj|IAAV01007594.1| Bradi5g23600|Bradi5g23600.1 0.00E+00 Bradi5g23600|Bradi5g23600.1 PF00664 ABC transporter transmembrane region0 3.6.3.44 Xenobiotic-transporting ATPase. ATP-binding cassette, subfamily B (MDR/TAP), member 1 K05658 gi|359633748|gb|JP900344.1| Bradi2g05030|Bradi2g05030.1 5.50E-46 Bradi2g05030|Bradi2g05030.1PF10513 Enhancer of polycomb-like0 2.1.1.43 Histone-lysine N-methyltransferase.

文件二是gff3格式的文件

chr1A IWGSCv1.0_gmap gene 11740 12074 . + . ID=TRIAE_CS42_1AS_TGACv1_023354_AA0082670.1.path1;Name=TRIAE_CS42_1AS_TGACv1_023354_AA0082670.1 chr1A IWGSCv1.0_gmap mRNA 11740 12074 . + . ID=TRIAE_CS42_1AS_TGACv1_023354_AA0082670.1.mrna1;Name=TRIAE_CS42_1AS_TGACv1_023354_AA0082670.1;Parent=TRIAE_CS42_1AS_TGACv1_023354_AA0082670.1.path1;coverage=100.0;identity=100.0;matches=335;mismatches=0;indels=0;unknowns=0 chr1A IWGSCv1.0_gmap exon 11740 12074 100 + . ID=TRIAE_CS42_1AS_TGACv1_023354_AA0082670.1.mrna1.exon1;Parent=TRIAE_CS42_1AS_TGACv1_023354_AA0082670.1.mrna1;Parent=TRIAE_CS42_1AS_TGACv1_023354_AA0082670.1.mrna1 chr1A IWGSCv1.0_gmap gene 22427 24851 . - . ID=TRIAE_CS42_1AS_TGACv1_024449_AA0082770.1.path1;Name=TRIAE_CS42_1AS_TGACv1_024449_AA0082770.1 chr1A IWGSCv1.0_gmap mRNA 22427 24851 . - . ID=TRIAE_CS42_1AS_TGACv1_024449_AA0082770.1.mrna1;Name=TRIAE_CS42_1AS_TGACv1_024449_AA0082770.1;Parent=TRIAE_CS42_1AS_TGACv1_024449_AA0082770.1.path1;coverage=100.0;identity=100.0;matches=2425;mismatches=0;indels=0;unknowns=0 chr1A IWGSCv1.0_gmap exon 22427 24851 100 - . ID=TRIAE_CS42_1AS_TGACv1_024449_AA0082770.1.mrna1.exon1;Parent=TRIAE_CS42_1AS_TGACv1_024449_AA0082770.1.mrna1;Parent=TRIAE_CS42_1AS_TGACv1_024449_AA0082770.1.mrna1 chr1A IWGSCv1.0_gmap gene 28794 39054 . + . ID=TRIAE_CS42_1AS_TGACv1_021338_AA0081570.1.path1;Name=TRIAE_CS42_1AS_TGACv1_021338_AA0081570.1 chr1A IWGSCv1.0_gmap mRNA 28794 39054 . + . ID=TRIAE_CS42_1AS_TGACv1_021338_AA0081570.1.mrna1;Name=TRIAE_CS42_1AS_TGACv1_021338_AA0081570.1;Parent=TRIAE_CS42_1AS_TGACv1_021338_AA0081570.1.path1;coverage=100.0;identity=100.0;matches=1624;mismatches=0;indels=0;unknowns=0 chr1A IWGSCv1.0_gmap exon 28794 28929 100 + . ID=TRIAE_CS42_1AS_TGACv1_021338_AA0081570.1.mrna1.exon1;Parent=TRIAE_CS42_1AS_TGACv1_021338_AA0081570.1.mrna1;Parent=TRIAE_CS42_1AS_TGACv1_021338_AA0081570.1.mrna1 chr1A IWGSCv1.0_gmap exon 37567 39054 100 + . ID=TRIAE_CS42_1AS_TGACv1_021338_AA0081570.1.mrna1.exon2;Parent=TRIAE_CS42_1AS_TGACv1_021338_AA0081570.1.mrna1;Parent=TRIAE_CS42_1AS_TGACv1_021338_AA0081570.1.mrna1

代码

#!/usr/bin/env python# -*- coding: utf-8 -*-__author__ = 'Sheng-Wei Ma'__author_email__ = 'shengweima@icloud.com'dic = {}with open('2.txt', 'r') as f:    for line in f:        lin = line.strip().split()        if lin[0] not in dic:            dic[lin[0]] = lin[1:]with open('RNA_Seq.trans.sorted.gff3', 'r') as f2:    for line in f2:        lin = line.strip().split('\t')        if lin[2] == 'mRNA':            name = lin[8].split(";")[1].split('=')[1]            if name in dic:                print line.strip() + ";" + str(dic[name])        else:            print line,
0 0
原创粉丝点击