基因数据处理75之从HDFS读取vcf文件存为Adam的parquet文件（成功）

来源：互联网发布：名媛风的淑女打扮知乎编辑：程序博客网时间：2024/06/06 02:31

1.参考：

package org.bdgenomics.adam.cliclass FlattenSuite extends ADAMFunSuite {val loader = Thread.currentThread().getContextClassLoaderval inputPath = loader.getResource("small.vcf").getPathval outputFile = File.createTempFile("adam-cli.FlattenSuite", ".adam")val outputPath = outputFile.getAbsolutePathval flatFile = File.createTempFile("adam-cli.FlattenSuite", ".adam-flat")val flatPath = flatFile.getAbsolutePathassert(outputFile.delete(), "Couldn't delete (empty) temp file")assert(flatFile.delete(), "Couldn't delete (empty) temp file")val argLine = "%s %s".format(inputPath, outputPath).split("\\s+")val args: Vcf2ADAMArgs = Args4j.apply[Vcf2ADAMArgs](argLine)val vcf2Adam = new Vcf2ADAM(args)vcf2Adam.run(sc)

2.代码：

package org.gcdss.cli.loadimport org.apache.spark.sql.SQLContextimport org.apache.spark.{SparkConf, SparkContext}import org.bdgenomics.adam.cli.{Vcf2ADAMArgs, Vcf2ADAM}import org.bdgenomics.adam.rdd.ADAMContextimport org.bdgenomics.adam.rdd.ADAMContext._import org.bdgenomics.utils.cli.Args4j//import org.bdgenomics.avocado.AvocadoFunSuiteobject Callvcf2Adam {  //  def resourcePath(path: String) = ClassLoader.getSystemClassLoader.getResource(path).getFile  //  def tmpFile(path: String) = Files.createTempDirectory("").toAbsolutePath.toString + "/" + path  //  def apply(local: Boolean, fqFile: String, faFile: String, configFile: String, output: String) {  def main(args: Array[String]) {    println("start:")    var conf = new SparkConf().setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))).setMaster("spark://219.219.220.149:7077")    //    var conf = new SparkConf().setAppName("AvocadoSuite").setMaster("local[4]")    val sc = new SparkContext(conf)    val startTime = System.currentTimeMillis()    val path = "hdfs://219.219.220.149:9000/xubo/callVariant/vcf/All_20160407.vcf"    val output = "/xubo/callVariant/vcf/All_20160407.adam"    val argLine = "%s %s".format(path, output).split("\\s+")    val args: Vcf2ADAMArgs = Args4j.apply[Vcf2ADAMArgs](argLine)    //    val arr=Array(argLine)    val vcf2Adam = new Vcf2ADAM(args)    vcf2Adam.run(sc)    val saveTime = System.currentTimeMillis()    println("run time:" + (saveTime - startTime) + " ms")    println("*************end*************")    sc.stop()  }}

3.结果：
202个小文件

时间：
211898ms
有点快。

通过adam-shell读取，记录为0：

scala> val rdd= sc.loadVariantAnnotations(“/xubo/callVariant/vcf/All_20160407.adam”)
print(rdd.count)

参考

【1】https://github.com/xubo245/AdamLearning【2】https://github.com/bigdatagenomics/adam/ 【3】https://github.com/xubo245/SparkLearning【4】http://spark.apache.org【5】http://stackoverflow.com/questions/28166667/how-to-pass-d-parameter-or-environment-variable-to-spark-job  【6】http://stackoverflow.com/questions/28840438/how-to-override-sparks-log4j-properties-per-driver

研究成果：

【1】 [BIBM] Bo Xu, Changlong Li, Hang Zhuang, Jiali Wang, Qingfeng Wang, Chao Wang, and Xuehai Zhou, "Distributed Gene Clinical Decision Support System Based on Cloud Computing", in IEEE International Conference on Bioinformatics and Biomedicine. (BIBM 2017, CCF B)【2】 [IEEE CLOUD] Bo Xu, Changlong Li, Hang Zhuang, Jiali Wang, Qingfeng Wang, Xuehai Zhou. Efficient Distributed Smith-Waterman Algorithm Based on Apache Spark (CLOUD 2017, CCF-C).【3】 [CCGrid] Bo Xu, Changlong Li, Hang Zhuang, Jiali Wang, Qingfeng Wang, Jinhong Zhou, Xuehai Zhou. DSA: Scalable Distributed Sequence Alignment System Using SIMD Instructions. (CCGrid 2017, CCF-C).【4】more: https://github.com/xubo245/Publications

Help

If you have any questions or suggestions, please write it in the issue of this project or send an e-mail to me: xubo245@mail.ustc.edu.cnWechat: xu601450868QQ: 601450868

阅读全文

'); })();