生物信息-002-人类基因组本地化及简单分析

来源：互联网发布：数据库笛卡尔积图解编辑：程序博客网时间：2024/05/20 18:54

在NCBI上下载 GRCh38

wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz

解压文件（.fasta, .fa, .fna, .fsa, .mpfa）

gzip  -d GRCh38_latest_genomic.fna.gz#人的h38基因组是3G的大小，一个英文字符是一个字节，所以30亿bp的碱基就是3G左右

head GRCh38_latest_genomic.fna

这里写图片描述
查看该文件可以看到，里面有很多的N，这是基因组里面未知的序列，用N占位，但是觉得部分都是A.T.C.G这样的字符，大小写都有，分别代表不同的意思

统计了一下里面这个文件的行数

time wc -l GRCh38_latest_genomic.fna

这里写图片描述

用awk统计行数（效率相比wc –l 慢）

time awk 'END { print NR }' GRCh38_latest_genomic.fna

这里写图片描述

看一下标题行

grep '>' GRCh38_latest_genomic.fna | sed -n 'p'grep '>' GRCh38_latest_genomic.fna | sed -n 'p' >> list.txt

统计每个标题下基因片段的长度，提取标题和长度写入一个新文件

time python GECh38_title_length.py

fasta_file=open('/home/sunchengquan/GRCh38_latest_genomic.fna','r')out_file = open('GRCh38_title_length.txt','w')seq = ''i = 0for line in fasta_file:    if line[0] == '>' and seq == '':        header = line.strip()    elif line[0] != '>':        seq =seq + line.strip()    elif line[0] == '>' and seq != '':        num = len(seq)        out_file.write(header +'\n'+ str(num)+ '\n')        i += 1        print('writing:',i)        seq = ''        header = line.strip() out_file.close()

看一下GRCh38_title_length.txt里面的内容

这里写图片描述

提取标题行，添加到列表，并打印

time python GECh38_title.py

input_file=open("/home/sunchengquan/GRCh38_latest_genomic.fna","r")title_list = []for line in input_file:    if line[0] == '>':        field = line        title_list.append(field)        print(field)类似于grep '>' GRCh38_latest_genomic.fna | sed -n 'p' > list.txt

这里写图片描述

阅读全文

0 0