Sequence Cleaner
来源:互联网 发布:华为s7703 mac绑定 编辑:程序博客网 时间:2024/04/27 13:52
Sequence Cleaner
Description
I want to share my script using Biopython to clean sequences up. You should know that analyzing poor data takes CPU time and interpreting the results from poor data takes people time, so it’s always important to make a preprocessing.
Let me call my script as “Sequence_cleaner” and the big idea is to remove duplicate sequences, remove too short sequences (the user defines the minimum length) and remove sequences which have too many unknown nucleotides (N) (the user defines the % of N it allows ) and in the end the user can choose if he/she wants to have a file as output or print the result.
Script
import sysfrom Bio import SeqIOdef sequence_cleaner(fasta_file, min_length=0, por_n=100): # Create our hash table to add the sequences sequences={} # Using the Biopython fasta parse we can read our fasta input for seq_record in SeqIO.parse(fasta_file, "fasta"): # Take the current sequence sequence = str(seq_record.seq).upper() # Check if the current sequence is according to the user parameters if (len(sequence) >= min_length and (float(sequence.count("N"))/float(len(sequence)))*100 <= por_n): # If the sequence passed in the test "is it clean?" and it isn't in the # hash table, the sequence and its id are going to be in the hash if sequence not in sequences: sequences[sequence] = seq_record.id # If it is already in the hash table, we're just gonna concatenate the ID # of the current sequence to another one that is already in the hash table else: sequences[sequence] += "_" + seq_record.id # Write the clean sequences # Create a file in the same directory where you ran this script output_file = open("clear_" + fasta_file, "w+") # Just read the hash table and write on the file as a fasta format for sequence in sequences: output_file.write(">" + sequences[sequence] + "\n" + sequence + "\n") output_file.close() print("CLEAN!!!\nPlease check clear_" + fasta_file)userParameters = sys.argv[1:]try: if len(userParameters) == 1: sequence_cleaner(userParameters[0]) elif len(userParameters) == 2: sequence_cleaner(userParameters[0], float(userParameters[1])) elif len(userParameters) == 3: sequence_cleaner(userParameters[0], float(userParameters[1]), float(userParameters[2])) else: print("There is a problem!")except: print("There is a problem!")
Using the command line, you should run:
python sequence_cleaner.py input[1] min_length[2] min[3]
python sequence_cleaner.py Aip_coral.fasta 10 10
0 0
- Sequence Cleaner
- Landmine Cleaner
- Visual Studio History Cleaner
- thinkpad system update cleaner
- Remove System Cleaner Trojan
- RegVac Registry Cleaner 4.02.35
- Cleaner通用文件处理模块
- Cleaner View Casting with Generics
- Cleaner View Casting with Generics
- Regcure - #1 Converting Registry Cleaner.
- CF Cleaner Robot (BFS)
- Codeforces 589J. Cleaner Robot
- CodeForces 589J Cleaner Robot
- CodeForces 589 J Cleaner Robot
- Cleaner Robot(CF 589J)
- CF-589J Cleaner Robot
- sequence
- SEQUENCE
- 面试笔试---数据库知识点
- Java中涉及的编码知识
- linun发展史
- VS(Visual Studio)自动创建的文件格式
- Android Studio中找到项目的密匙数字证书(MD5和SHA1)配置高德地图Android地图SDK【高德地图使用】
- Sequence Cleaner
- Servlet 生命周期、工作原理
- 2.Add Two Numbers
- Oracle知识点巩固 (安装与卸载)和cmd命令
- js--面向对象--创建对象
- 设置mysql5.7编码格式为UTF8
- React Native中的=>是什么运算?
- jquery隐藏显示元素
- 获取指定位置序列