NCBI SRA格式转换

来源:互联网 发布:linux进入命令行 编辑:程序博客网 时间:2024/05/17 04:41
 最近NCBI的数据格式由于空间缘故都转换成了*.sra格式,不再支持*.fastq.gz,因此需要一个特别的转化工具来转换下载的*.sra数据文件。
转换工具下载地址: http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

windows下进行格式转换:

我用的版本是 MS Windows 32 bit architecture

1. cd 你的 sratoolkit.2.0.1rc1-win32 文件夹,可以单独建一个临时文件夹,把 fastq-dump程序放里面,然后CD到这个文件夹,把data也放里面,

2.然后使用如下命令行:fastq-dump ERR022480.lite.sra 生成的FASTQ文件就在同一个文件夹,一般有三个fastq文件,这样就可以很的把sra格式转成fastq格式了。


Ubuntu Linux 32 bit architecture:

 
转换命令  
 $fastq-dump -A <SRR_accession> -D <Path_to_SRR_Directory> -O <Output_Path>
 基本的命令参数
 CommandDescription‘-A’ or ‘--accession’Enables modification of the output name used for the fastq files. For example:
fastq-dump -A foo SRR000001
Will produce files named ‘foo.fastq’, ‘foo_1.fastq’, and ‘foo_2.fastq’
‘-D’ or ‘--table-path’Makes the archive path more explicitly specified, thus preventing confusion when more than option is specified. These two commands produce the same files:
fastq-dump ~/SRR000001
fastq-dump -D ~/SRR000001
However, the first command below will fail while the second will succeed:
fastq-dump -C ~/SRR000001
fastq-dump -C -D ~/SRR000001
(‘-C’ option is explained further below)
‘-N’ or ‘--minSpotId’Minimum spot number at which to start the dump process‘-X’ or ‘--maxSpotId’Maximum spot number at which to stop the dump process
For example:
fastq-dump -N 5 -X 10 SRR000001
This command will dump six spots starting from spot ‘SRR000001.5’ and ending in spot ‘SRR000001.10’. Filtered spots can result in less than (maxSpotId - minSpotId + 1) total spots output.‘-G’ or ‘--spot-group’Boolean option that results in fastq files divided into spot groups as defined in the Experiment (or eventually Run) xml. This command:
fastq-dump -G SRR051894
Produces these five fragment files:
SRR051894.fastq
SRR051894_GDSX2KN04_PSORIASISMDA-POOL-738_CB028-01WG.fastq
SRR051894_GDSX2KN04_PSORIASISMDA-POOL-738_CB036-01WG.fastq
SRR051894_GDSX2KN04_PSORIASISMDA-POOL-738_CD021-01WG.fastq
SRR051894_GDSX2KN04_PSORIASISMDA-POOL-738_CD036-01WG.fastq
‘-T’ or ‘--group-in-dirs’Boolean option directing the utility to produce fastq files in sub-directories rather than producing files within the same directory‘-O’ or ‘--outdir’Indicates the directory where the fastq result should be placed
For example:
fastq-dump -O /tmp -T SRR000001
will create a directory, SRR000001, in /tmp with this tree structure:
>tree /tmp/SRR000001
/tmp/SRR000001
|-- 1
| `-- fastq
|-- 2
| `-- fastq
`-- fastq
‘-K’ or ‘--keep-empty-files’Has no effect - at one time this option would represent all three possible files even if one or two were empty‘-M’ or ‘--minReadLen’Allows specification of the desired minimum read length to output (default is 25). The command ‘fastq-dump -M 0 SRR000001’ prevents any filtering based on read length.‘-W’ or ‘--noclip’Prevents clipping of a spot sequence based on the right clip information. Toggling ‘show-clipped’ in the ‘customize’ area for reads in the SRA Run Brower enables observing the effect of this option (e.g. seeSRR000001).‘-F’ or ‘--origfmt’Results in fastq containing only the original identifier on the defline (i.e. no length or SRR identifier are present)‘-C’ or ‘--dumpcs’Forces color space sequence to be dumped instead of base space. If the optional ‘cskey’ if provided (i.e. A, C, T, or G), then all fastq files produced will use that key at the start of each color space sequence.‘-B’ or ’--dumpbase’Forces base space sequence to be dumped instead of color space.‘-Q’ or ‘--offset’Allows using a different offset value to represent a different offset character in the fastq output. For example, using an offset of 64 represents using ‘@’ as the offset character.‘-I’ or ‘--readids’Appends a read index to the run identifier starting with ‘1’ as the first index. Note that this differs from the spot descriptor in the Experiment xml where the read indices start with ‘0’. In the case of SRR000001, the first spot in each file would have the identifiers ‘SRR000001.5.4’, ‘SRR000001.1.2’, and ‘SRR000001.1.4’. Note that the first spot sequence in SRR000001.fastq, the fragment file, comes from the second biological/application read which has an index of ‘4’.‘-E’ or ‘--no_qual_filter’This option turns off quality filtering based on leading/trailing low quality values. As reads have become longer this option has become a more viable alternative.‘-SF’ or ‘--complete’Outputs the separated reads into a single file. For example, the command:
fastq-dump -SF SRR029338
Results in the first eight lines of the file, SRR029338.fastq, containing:
@SRR029338.1 080115_EAS112_0034:8:1:615:780 length=36
GGTTGAGTAAAGTGTCTAAAGGCATAGCCTGATTAT
+SRR029338.1 080115_EAS112_0034:8:1:615:780 length=36
IIIIIIIIIIIIIIIIIIIAIIA<I8I+7I9+II2I
@SRR029338.1 080115_EAS112_0034:8:1:615:780 length=36
AAAGTCAAATTTGAATTGTTGTCAGCTTGTCAAAAT
+SRR029338.1 080115_EAS112_0034:8:1:615:780 length=36
IIIIIIIIDIIIIIIIIIIIII.1F2II=8*2+//I
In the case of 454 pair submissions, the second technical read (i.e. linker) is included in this single output file.
‘-DB’ or ‘--defline-seq’Allows specification of the sequence defline format. For example:
-DB "@$ac.$si $sn length=$rl"
This specification produces the same output as the default output. See
Appendix D for a more in-depth explanation. Note that submission of a
‘fastq-dump’ command to a compute farm (e.g. Sun Grid Engine) can
require preceding a number of the characters with backslash characters
when using this option. The above example might require this version:
-DB "@\\\$ac.\\\$si \\\$sn length=\\\$rl"‘-DQ’ or ‘--defline-qual’Allows specification of the quality defline format. For example:
-DQ "+$ac.$si $sn length=$rl"‘-alt [n]’Provides alternative output formats without have to indicate the individual options. Alternate ‘1’, the only option, results in this format for SRR029338_1.fastq:
@SRR029338.1 080115_EAS112_0034:8:1:615:780/1
GGTTGAGTAAAGTGTCTAAAGGCATAGCCTGATTAT
+
IIIIIIIIIIIIIIIIIIIAIIA<I8I+7I9+II2I
And this format for SRR029338_2.fastq:
@SRR029338.1 080115_EAS112_0034:8:1:615:780/2
AAAGTCAAATTTGAATTGTTGTCAGCTTGTCAAAAT
+
IIIIIIIIDIIIIIIIIIIIII.1F2II=8*2+//I

2.转换*.sra 文件格式到SFF格式
$ sff-dump -A <SRR_accession> -D <Path_to_SRR_Directory> -O <Output_Path>
 
Options:
CommandDescription-O
Allows user to specify an output directory. If not used, output will default to the current directory.-NMinimum spot ID to output. The first spot in the output will be the number given for this option.-XMaximum spot ID to output. The last spot in the output will be the number given. Min and Max spot options can be combined to output subsections of an SRR.-Gspotgroup-file Split into files by SPOT_GROUP-Tspotgroup-dir Split into subdirectories (of -O ) by SPOT_GROUP-LLog level: 0-13 or fatal|sys|int|err|warn|info|debug[1-10]. (default: info) Set to ‘4’ to mimic the unix standard of no messages for a successful operation.-HPrints this help message and version information.

3 转换*.sra 文件格式到Illumina native文件格式
$illumina-dump [options] -path <directory_containing_the_accession> <acces
CommandDescription-D, --table-pathPath to accession data.-O, --outdirOutput directory. Default: '.'-N, --minSpotIdMinimum spot id to output.-X, --maxSpotIdMaximum spot id to output.-G, --spot-groupSplit into files by SPOT_GROUP (member).-T, --group-in-dirsSplit into subdirectories instead of files.-K, --keep-empty-filesDo not delete empty files.-L, --log-levelLogging level: 0-13 or fatal|sys|int|err|warn|info|debug[1-10]. Default: info-H, --helpPrints this message

Format options:

CommandDescription-r, --readOutput READ: "seq". Default: on-q, --qual1Output QUALITY, into single (1) or multiple (2) files: "qcal". Default: 1-p, --qual4Output full QUALITY: "prb". Default: off-i, --intensityOutput INTENSITY, if present: "int". Default: off-n, --noiseOutput NOISE, if present: "nse". Default: off-s, --signalOutput SIGNAL, if present: "sig2". Default: off-qseqOutput QSEQ format: "qseq". Default: off