spark读写hbase
来源:互联网 发布:游族网络 大皇帝 手游 编辑:程序博客网 时间:2024/05/16 10:39
1 配置
1.1 开发环境:
- HBase:hbase-1.0.0-cdh5.4.5.tar.gz
- Hadoop:hadoop-2.6.0-cdh5.4.5.tar.gz
- ZooKeeper:zookeeper-3.4.5-cdh5.4.5.tar.gz
- Spark:spark-2.1.0-bin-hadoop2.6
1.2 Spark的配置
- Jar包:需要HBase的Jar如下(经过测试,正常运行,但是是否存在冗余的Jar并未证实,若发现多余的jar可自行进行删除)
- spark-env.sh
添加以下配置:export SPARK_CLASSPATH=/home/hadoop/data/lib1/*
注:如果使用spark-shell的yarn模式进行测试的话,那么最好每个NodeManager节点都有配置jars和hbase-site.xml - spark-default.sh
spark.yarn.historyServer.address=slave11:18080spark.history.ui.port=18080spark.eventLog.enabled=truespark.eventLog.dir=hdfs:///tmp/spark/eventsspark.history.fs.logDirectory=hdfs:///tmp/spark/eventsspark.driver.memory=1gspark.serializer=org.apache.spark.serializer.KryoSerializer
1.3 数据
1)格式: barCode@item@value@standardValue@upperLimit@lowerLimit
01055HAXMTXG10100001@KEY_VOLTAGE_TEC_PWR@1.60@1.62@1.75@1.55
01055HAXMTXG10100001@KEY_VOLTAGE_T_C_PWR@1.22@1.24@1.45@0.8
01055HAXMTXG10100001@KEY_VOLTAGE_T_BC_PWR@1.16@1.25@1.45@0.8
01055HAXMTXG10100001@KEY_VOLTAGE_11@1.32@1.25@1.45@0.8
01055HAXMTXG10100001@KEY_VOLTAGE_T_RC_PWR@1.24@1.25@1.45@0.8
01055HAXMTXG10100001@KEY_VOLTAGE_T_VCC_5V@1.93@1.90@1.95@1.65
01055HAXMTXG10100001@KEY_VOLTAGE_T_VDD3V3@1.59@1.62@1.75@1.55
2 代码演示
2.1 准备动作
1)既然是与HBase相关,那么首先需要使用hbase shell来创建一个表
创建表格:create ‘data’,’v’,create ‘data1’,’v’
2)使用spark-shell进行操作,命令如下:
bin/spark-shell --master yarn --deploy-mode client --num-executors 5 --executor-memory 1g --executor-cores 2
3)import 各种类
import org.apache.spark._import org.apache.spark.rdd.NewHadoopRDDimport org.apache.hadoop.mapred.JobConfimport org.apache.hadoop.conf.Configurationimport org.apache.hadoop.mapreduce.Jobimport org.apache.hadoop.mapreduce.lib.input.FileInputFormatimport org.apache.hadoop.mapreduce.lib.output.FileOutputFormatimport org.apache.hadoop.fs.Pathimport org.apache.hadoop.hbase.client.Putimport org.apache.hadoop.hbase.io.ImmutableBytesWritableimport org.apache.hadoop.hbase.mapred.TableOutputFormatimport org.apache.hadoop.hbase.mapreduce.TableInputFormatimport org.apache.hadoop.hbase.HBaseConfigurationimport org.apache.hadoop.hbase.client.HBaseAdminimport org.apache.hadoop.hbase.client.HTableimport org.apache.hadoop.hbase.client.Scanimport org.apache.hadoop.hbase.client.Getimport org.apache.hadoop.hbase.protobuf.ProtobufUtilimport org.apache.hadoop.hbase.util.{Base64,Bytes}import org.apache.hadoop.hbase.KeyValueimport org.apache.hadoop.hbase.mapreduce.HFileOutputFormatimport org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFilesimport org.apache.hadoop.hbase.HColumnDescriptorimport org.apache.commons.codec.digest.DigestUtils
2.2 代码实战
创建conf和table
val conf= HBaseConfiguration.create()conf.set(TableInputFormat.INPUT_TABLE,"data1")val table = new HTable(conf,"data1")
2.2.1 数据写入
格式:
val put = new Put(Bytes.toBytes("rowKey"))put.add("cf","q","value")
使用for来插入5条数据
for(i <- 1 to 5){ var put= new Put(Bytes.toBytes("row"+i));put.add(Bytes.toBytes("v"),Bytes.toBytes("value"),Bytes.toBytes("value"+i));table.put(put)}
到hbase shell中查看结果
2.2.2 数据读取
val hbaseRdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
1)take
hbaseRdd take 1
2)scan
var scan = new Scan();scan.addFamily(Bytes.toBytes(“v”));var proto = ProtobufUtil.toScan(scan)var scanToString = Base64.encodeBytes(proto.toByteArray());conf.set(TableInputFormat.SCAN,scanToString)val datas = hbaseRdd.map( x=>x._2).map{result => (result.getRow,result.getValue(Bytes.toBytes("v"),Bytes.toBytes("value")))}.map(row => (new String(row._1),new String(row._2))).collect.foreach(r => (println(r._1+":"+r._2)))
2.3 批量插入
2.3.1 普通插入
1)代码
val rdd = sc.textFile("/data/produce/2015/2015-03-01.log")val data = rdd.map(_.split("@")).map{x=>(x(0)+x(1),x(2))}val result = data.foreachPartition{x => {val conf= HBaseConfiguration.create();conf.set(TableInputFormat.INPUT_TABLE,"data");conf.set("hbase.zookeeper.quorum","slave5,slave6,slave7");conf.set("hbase.zookeeper.property.clientPort","2181");conf.addResource("/home/hadoop/data/lib/hbase-site.xml");val table = new HTable(conf,"data");table.setAutoFlush(false,false);table.setWriteBufferSize(3*1024*1024); x.foreach{y => {var put= new Put(Bytes.toBytes(y._1));put.add(Bytes.toBytes("v"),Bytes.toBytes("value"),Bytes.toBytes(y._2));table.put(put)};table.flushCommits}}}
2)执行时间如下:7.6 min
2.3.2 Bulkload
1) 代码:
val conf = HBaseConfiguration.create();val tableName = "data1"val table = new HTable(conf,tableName)conf.set(TableOutputFormat.OUTPUT_TABLE,tableName)lazy val job = Job.getInstance(conf)job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])job.setMapOutputValueClass(classOf[KeyValue])HFileOutputFormat.configureIncrementalLoad(job,table)val rdd = sc.textFile("/data/produce/2015/2015-03-01.log").map(_.split("@")).map{x => (DigestUtils.md5Hex(x(0)+x(1)).substring(0,3)+x(0)+x(1),x(2))}.sortBy(x =>x._1).map{x=>{val kv:KeyValue = new KeyValue(Bytes.toBytes(x._1),Bytes.toBytes("v"),Bytes.toBytes("value"),Bytes.toBytes(x._2+""));(new ImmutableBytesWritable(kv.getKey),kv)}}rdd.saveAsNewAPIHadoopFile("/tmp/data1",classOf[ImmutableBytesWritable],classOf[KeyValue],classOf[HFileOutputFormat],job.getConfiguration())val bulkLoader = new LoadIncrementalHFiles(conf)bulkLoader.doBulkLoad(new Path("/tmp/data1"),table)
2) 执行时间:7s
3)执行结果:
到hbase shell 中查看 list “data1”
通过对比我们可以发现bulkload批量导入所用时间远远少于普通导入,速度提升了60多倍,当然我没有使用更大的数据量测试,但是我相信导入速度的提升是非常显著的,强烈建议使用BulkLoad批量导入数据到HBase中。
- Spark读写HBASE
- spark读写hbase
- spark hbase读写
- Spark读写Hbase示例代码
- Spark读写Hbase示例代码
- Spark实战之读写HBase
- hbase-spark全新的spark读写hbase的方式
- spark-shell 读写hdfs 读写hbase 读写redis
- 如何使用scala+spark读写hbase?
- spark源码阅读一-spark读写hbase代码分析
- Spark读写Hbase的二种方式对比
- Spark读写Hbase的二种方式对比
- Spark连接HBase进行读写相关操作【CDH5.7.X】
- spark hbase
- spark hbase
- Spark&hbase
- spark hbase hbase-rdd
- Spark操作hbase
- 大作业2(Sobel算子)
- 单片机——按键扫描
- 自定义函数实现strcpy,strcat,strcmp的功能
- C语言——实例015 条件运算符,成绩等级
- Detecting Visual Relationships with Deep Relational Networks(阅读笔记)
- spark读写hbase
- 大作业3(验证码处理)
- POJ 3186 Treats for the Cows(区间DP)【区间最大值模板】
- JSP之EL表达式之遍历
- LoactionManager Hook原理
- Moving tables
- HDU2019-数列有序!
- Android面试题目之常见的填空题和简答题
- Linux 中ps 和 top命令详解