基因数据处理76之从HDFS读取fasta并统计条数

来源:互联网 发布:名媛风的淑女打扮知乎 编辑:程序博客网 时间:2024/06/05 21:17
  1. 读入fasta格式数据:
    第一次:

    hadoop@Master:~/xubo/project/load/loadfastqFromHDFSfastaAndCount$ ./load.sh start:1run time:25101 ms*************end*************hadoop@Master:~/xubo/project/load/loadfastqFromHDFSfastaAndCount$ mv load.sh loadGRCH38chr14.shhadoop@Master:~/xubo/project/load/loadfastqFromHDFSfastaAndCount$ cp loadGRCH38chr14.sh loadGRCH38.shhadoop@Master:~/xubo/project/load/loadfastqFromHDFSfastaAndCount$ vi loadGRCH38.sh hadoop@Master:~/xubo/project/load/loadfastqFromHDFSfastaAndCount$ ./loadGRCH38.sh start:[Stage 2:======================================================>  (24 + 1) / 25]16/06/08 13:47:37 ERROR TaskSchedulerImpl: Lost executor 1 on 219.219.220.215: remote Rpc client disassociated456                                                                             run time:585513 ms*************end*************

    第二次:

    hadoop@Master:~/xubo/project/load/loadfastaFromHDFSfastaAndCount$ ./loadGRCH38chr14.sh start:1run time:30775 ms*************end*************hadoop@Master:~/xubo/project/load/loadfastaFromHDFSfastaAndCount$ ./loadGRCH38.sh start:456                                                                             run time:262677 ms*************end*************16/06/08 14:01:59 WARN QueuedThreadPool: 8 threads could not be stopped16/06/08 14:02:04 WARN QueuedThreadPool: 1 threads could not be stopped

2.读入adam:

读取方法不对

//    val rdd = sc.loadParquetContigFragments(args(0))

解决办法:

val rdd = sc.loadSequence(args(0))

运行记录:

hadoop@Master:~/xubo/project/load/loadfastaFromHDFSAdamAndCount$ ./loadGRCH38chr14.sh start:1                                                                               run time:25802 ms*************end*************Jun 8, 2016 2:08:11 PM INFO: org.apache.parquet.hadoop.ParquetInputFormat: Total input paths to process : 1hadoop@Master:~/xubo/project/load/loadfastaFromHDFSAdamAndCount$ ./loadGRCH38.sh start:456                                                                             run time:40620 ms*************end*************Jun 8, 2016 2:08:56 PM INFO: org.apache.parquet.hadoop.ParquetInputFormat: Total input paths to process : 25

报错记录请见spark问题1

3.文件大小:


4.代码:

package org.gcdss.cli.loadimport org.apache.spark.{SparkConf, SparkContext}import org.bdgenomics.adam.rdd.ADAMContext._//import org.bdgenomics.avocado.AvocadoFunSuiteobject loadfastaFromHDFSAdamAndCount {  def main(args: Array[String]) {    println("start:")    var conf = new SparkConf().setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))).setMaster("spark://219.219.220.149:7077")    //    var conf = new SparkConf().setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))).setMaster("local[4]")    val sc = new SparkContext(conf)    val startTime = System.currentTimeMillis()    //    val path = "hdfs://219.219.220.149:9000/xubo/ref/GRCH38Index/GCA_000001405.15_GRCh38_full_analysis_set.fna"    //    val path = "hdfs://219.219.220.149:9000/xubo/ref/GRCH38chr14/GRCH38chr14.fasta"    //    val rdd = sc.loadFasta(path, 1000000000L)//    val rdd = sc.loadParquetContigFragments(args(0))    val rdd = sc.loadSequence(args(0))    println(rdd.count())    val saveTime = System.currentTimeMillis()    println("run time:" + (saveTime - startTime) + " ms")    println("*************end*************")    sc.stop()  }}

参考

【1】https://github.com/xubo245/AdamLearning【2】https://github.com/bigdatagenomics/adam/ 【3】https://github.com/xubo245/SparkLearning【4】http://spark.apache.org【5】http://stackoverflow.com/questions/28166667/how-to-pass-d-parameter-or-environment-variable-to-spark-job  【6】http://stackoverflow.com/questions/28840438/how-to-override-sparks-log4j-properties-per-driver

研究成果:

【1】 [BIBM] Bo Xu, Changlong Li, Hang Zhuang, Jiali Wang, Qingfeng Wang, Chao Wang, and Xuehai Zhou, "Distributed Gene Clinical Decision Support System Based on Cloud Computing", in IEEE International Conference on Bioinformatics and Biomedicine. (BIBM 2017, CCF B)【2】 [IEEE CLOUD] Bo Xu, Changlong Li, Hang Zhuang, Jiali Wang, Qingfeng Wang, Xuehai Zhou. Efficient Distributed Smith-Waterman Algorithm Based on Apache Spark (CLOUD 2017, CCF-C).【3】 [CCGrid] Bo Xu, Changlong Li, Hang Zhuang, Jiali Wang, Qingfeng Wang, Jinhong Zhou, Xuehai Zhou. DSA: Scalable Distributed Sequence Alignment System Using SIMD Instructions. (CCGrid 2017, CCF-C).【4】more: https://github.com/xubo245/Publications

Help

If you have any questions or suggestions, please write it in the issue of this project or send an e-mail to me: xubo245@mail.ustc.edu.cnWechat: xu601450868QQ: 601450868
阅读全文
'); })();
0 0
原创粉丝点击
热门IT博客
热门问题 老师的惩罚 人脸识别 我在镇武司摸鱼那些年 重生之率土为王 我在大康的咸鱼生活 盘龙之生命进化 天生仙种 凡人之先天五行 春回大明朝 姑娘不必设防,我是瞎子 城镇户口暂住证怎么办 包头居民证怎么办 拉萨怎么办暂住证 宁波租房暂住证怎么办 不租房怎么办居住证 驾校办理暂住证怎么办 上还怎么办居住证 在校学生暂住证怎么办 桂林怎么办暂住证 在校大学生怎么办暂住证 暂住证应该怎么办 学生暂住证怎么办 青岛暂住证过期怎么办 外地学生暂住证怎么办 驾校说色盲怎么办 酒驾被吊销驾照怎么办 四川暂住证怎么办 扣48分怎么办 驾驶证档案丢失怎么办 绵阳居住证怎么办 终身禁驾怎么办 分期车还完贷款怎么办 实习驾照到期怎么办 驾照过有效期怎么办 c1驾照资格证书怎么办 小车逾期年审怎么办 资格证超出时间怎么办 驾照忘记年审怎么办 年检超过几天怎么办 去西班牙签证怎么办 上学积分不够怎么办 扣十八分怎么办 驾照色盲怎么办 驾驶证逾期注销怎么办 上海没有身份证怎么办 乘车临时身份证怎么办 2018身份证到期怎么办 回国身份证过期怎么办 国外身份证遗失怎么办 深圳市怎么办etc 违章处理5239怎么办