HBase之java api接口调用与mapreduce整合即从hdfs中通过mapreduce来导入数据到hbase中
来源:互联网 发布:萝莉有杀气源码 编辑:程序博客网 时间:2024/05/18 02:38
此篇分为两部分来探讨,第一部分是hbase的java api接口,第二部分是hbase与mapreduce整合
一、hbase之java api接口
hbase是基于java写的,所以当然可以调用java api一样通过java代码来操纵hbase,可以实现基本的查询hbase表,向hbase表上传某一条记录等操作。那么首先需要依赖一些基本的hbase的maven的jar包,lz是在idea的ide中操作,具体需要添加的jar包如下:
<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.sunwangdong.hadoop.test</groupId> <artifactId>jkxy</artifactId> <version>1.0-SNAPSHOT</version> <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>2.7.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-common</artifactId> <version>2.7.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-jobclient</artifactId> <version>2.7.1</version> </dependency> <dependency> <groupId>org.apache.hbase</groupId> //hbase的jar包,主要有hbase,hbase-client、hbase-server三个 <artifactId>hbase</artifactId> <version>1.2.6</version> </dependency> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-client</artifactId> <version>1.2.6</version> </dependency> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-server</artifactId> <version>1.2.6</version> </dependency> </dependencies></project>添加完依赖包后,依次来写一些基本的操作,首先是建hbase表的操作,如下:
private static void createTable(HBaseAdmin hBaseAdmin) throws IOException { if(!hBaseAdmin.tableExists(TABLE_NAME)) //判断是否存在以"hello"为表名的表 { HTableDescriptor hTableDescriptor = new HTableDescriptor(TABLE_NAME); //表名 HColumnDescriptor hColumnDescriptor = new HColumnDescriptor(FAMILY_NAME); //列族 hTableDescriptor.addFamily(hColumnDescriptor); //表添加列族 hBaseAdmin.createTable(hTableDescriptor); //添加表 } }这里主要通过基本类,分别是HTableDescriptor,这个类表示的是表名,可以通过它的构造函数,为它添加具体的表名。然后是HColumnDescriptor类,这个类表示的是列族,通过HTableDescriptor.addFamily(HColumnDescriptor)来为表添加具体的列族。此外还有一个HBaseAdmin接口,这个接口是用来管理hbase数据库的表信息,提供的方法有:创建表、删除表、列出表项等操作,可以通过HBaseAdmin.createTable(HTableDescriptor)来创建一张表。
然后是删除表:
private static void dropTable(HBaseAdmin hBaseAdmin) throws IOException { if(hBaseAdmin.tableExists(TABLE_NAME)) //是否存在改表 { hBaseAdmin.disableTable(TABLE_NAME); //先要disable表 hBaseAdmin.deleteTable(TABLE_NAME); //然后再delete表 } }先判断要删除的表是否在hbase数据库中,如果存在,那么调用deleteTable实现删除操作,当然在执行删除操作之前,往往需要先disable那张表,用于使那张表先失效,然后再删除。
接下来是查看数据,具体有两种方法,分别是scan和get方法,分别对应hbase的scan和get方法。
private static void scanTable(HTable hTable) throws IOException { System.out.printf("遍历表结果如下:"); Scan scan = new Scan(); ResultScanner results = hTable.getScanner(scan); for(Result result : results) { byte[] value = result.getValue(FAMILY_NAME.getBytes(),COLUMN_NAME.getBytes()); System.out.println(new String(value)); } }注意到,因为scan是扫描整张表,所以会得到很多行的数据,所以这里用到了一个ResultScanner的结果集来表示,我们可以通过一个HTable类的getScanner方法来得到。注意:HTable类是用来与hbase表进行通信,但是此方法是线程不安全的,如果有多个线程尝试与单个HTable实例进行通信,则写缓冲器就会失效。注意,每次获取得到的信息都是bytes类型的,所以都需要转化为string类型即可。
然后是get方法,与scan有点类似,不过get方法只得到某个rowkey下的数据,而不是整张表的数据。
private static void getRecord(HTable hTable) throws IOException { Get get = new Get(ROW_KEY.getBytes()); Result result = hTable.get(get); byte[] value = result.getValue(FAMILY_NAME.getBytes(),COLUMN_NAME.getBytes()); System.out.println("查询结果为:" + new String(value)); }这里需要指定某一个rowkey的某一列的内容,当然如果有时间戳,那么还需要指明,因为hbase中的一条数据是由四个因素确定的,分别是rowkey,列族,列名和value。结果是一个result,那么同样是bytes数组,然后输出时,需要转化成string类型。
最后是put,也就是向hbase上传、更新某一条记录。
private static void putRecord(HTable hTable) throws IOException { Put put = new Put(ROW_KEY.getBytes()); put.add(FAMILY_NAME.getBytes(),COLUMN_NAME.getBytes(),"25".getBytes()); hTable.put(put); System.out.println("insert a record!"); }这里用到了put类,它的作用是用来对单个行执行添加操作。当然上传一条记录,同样需要四个信息,行键,列族,列和具体的value值,有时候还需要指定时间戳。
完整的代码如下:
ackage com.hbase;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.hbase.HBaseConfiguration;import org.apache.hadoop.hbase.HColumnDescriptor;import org.apache.hadoop.hbase.HTableDescriptor;import org.apache.hadoop.hbase.client.*;import java.io.IOException;/** * Created by sunwangdong on 2017/7/20. */public class HbaseTest{ //public HBaseAdmin admin = null; public static final String TABLE_NAME="hello"; //表名 public static final String FAMILY_NAME="info"; //列族 public static final String COLUMN_NAME="age"; //列名 public static final String ROW_KEY="xiaoming"; //行键rowkey public static void main(String[] args) throws IOException { Configuration conf = HBaseConfiguration.create(); //创建一个configuration conf.set("hbase.rootdir","hdfs://localhost:9000/hbase"); //conf.set("hbase.zookeeper.quorum",""); HBaseAdmin hBaseAdmin = new HBaseAdmin(conf); //createTable(hBaseAdmin); //创建表 HTable hTable = new HTable(conf,TABLE_NAME.getBytes()); //putRecord(hTable); //插入数据 putRecord2(hTable,"88"); getRecord(hTable); //取出数据 scanTable(hTable); //遍历表结果 //dropTable(hBaseAdmin); //删除表 } private static void scanTable(HTable hTable) throws IOException { System.out.printf("遍历表结果如下:"); Scan scan = new Scan(); ResultScanner results = hTable.getScanner(scan); for(Result result : results) { byte[] value = result.getValue(FAMILY_NAME.getBytes(),COLUMN_NAME.getBytes()); System.out.println(new String(value)); } } private static void getRecord(HTable hTable) throws IOException { Get get = new Get(ROW_KEY.getBytes()); Result result = hTable.get(get); byte[] value = result.getValue(FAMILY_NAME.getBytes(),COLUMN_NAME.getBytes()); System.out.println("查询结果为:" + new String(value)); } private static void putRecord(HTable hTable) throws IOException { Put put = new Put(ROW_KEY.getBytes()); put.add(FAMILY_NAME.getBytes(),COLUMN_NAME.getBytes(),"25".getBytes()); hTable.put(put); System.out.println("insert a record!"); } private static void putRecord2(HTable hTable,String value) throws IOException { Put put = new Put(ROW_KEY.getBytes()); put.add(FAMILY_NAME.getBytes(),COLUMN_NAME.getBytes(),value.getBytes()); hTable.put(put); System.out.println("insert or update a record!"); } private static void createTable(HBaseAdmin hBaseAdmin) throws IOException { if(!hBaseAdmin.tableExists(TABLE_NAME)) //判断是否存在以"hello"为表名的表 { HTableDescriptor hTableDescriptor = new HTableDescriptor(TABLE_NAME); //表名 HColumnDescriptor hColumnDescriptor = new HColumnDescriptor(FAMILY_NAME); //列族 hTableDescriptor.addFamily(hColumnDescriptor); //表添加列族 hBaseAdmin.createTable(hTableDescriptor); //添加表 } } private static void dropTable(HBaseAdmin hBaseAdmin) throws IOException { if(hBaseAdmin.tableExists(TABLE_NAME)) //是否存在改表 { hBaseAdmin.disableTable(TABLE_NAME); //先要disable表 hBaseAdmin.deleteTable(TABLE_NAME); //然后再delete表 } }}注意,main函数里需要新建一个configuration类,且此类中需要指定hbase.rootdir的路径,这个需要与hbase-site.xml中配置的路径一致,否则将无法运行。
二、hbase与mapreduce整合
hbase和mapreduce的整合,其实是将hdfs中的数据导入到hbase的数据库中去,目前主要有两种方式来实现,其中一种就是利用mapreduce程序将hdfs中的数据按行划分成列输入到hbase中去,看以下的这个例子,现在在hdfs中有如下的数据表,其中这张数据表存放在hdfs中,具体的路径为:/t1/t1
1 zhangsan 10 male NULL2 lisi NULL NULL NULL3 wangwu NULL NULL NULL4 zhaoliu NULL NULL 1993其中每一行中的每一列数据由"\t"来分隔,然后我们通过自定义mapreduce来实现:
首先是map函数
public static class HdfsToHBaseMapper extends Mapper<LongWritable,Text,Text,Text> { private Text outKey = new Text(); private Text outValue = new Text(); public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException { String[] splits = value.toString().split("\t"); outKey.set(splits[0]); outValue.set(splits[1] + "\t" + splits[2] + "\t" + splits[3] + "\t" + splits[4]); context.write(outKey,outValue); } }
是将输入的数据按照行中的"\t"分隔符来分隔,然后我们通过按照行键和内容的形式输出。
然后是汇总的reduce函数
public static class HdfsToHBaseReducer extends TableReducer<Text,Text,NullWritable> { public void reduce(Text k2, Iterable<Text> v2s,Context context) throws IOException, InterruptedException { Put put = new Put(k2.getBytes()); for(Text v2 : v2s) { String[] splis = v2.toString().split("\t"); if(splis[0] != null && !"NULL".equals(splis[0])) { put.addColumn("f1".getBytes(),"name".getBytes(),splis[0].getBytes()); } if(splis[1] != null && !"NULL".equals(splis[1])) { put.addColumn("f1".getBytes(),"age".getBytes(),splis[1].getBytes()); } if(splis[2] != null && !"NULL".equals(splis[2])) { put.addColumn("f1".getBytes(),"gender".getBytes(),splis[2].getBytes()); } if(splis[3] != null && !"NULL".equals(splis[3])) { put.addColumn("f1".getBytes(),"birthday".getBytes(),splis[3].getBytes()); } } context.write(NullWritable.get(),put); } }
注意,这个reduce继承自TableReducer,这个类来自于org.apache.hbase.client的jar包中,一开始,lz因为没有通过maven添加此jar包而出错!!!而且这个reduce的泛型约束也与一般的reducer不一样,只有三个,且最后一个是NullWritable,前两个当然是一样的,都是map的输出格式。最后的context的输出形式也不一样,第一个key的输出格式是 NullWritable.get(),而第二个value是put。
最后完整的代码如下:
package com.hbase;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.hbase.HBaseConfiguration;import org.apache.hadoop.hbase.client.Put;import org.apache.hadoop.hbase.mapred.TableOutputFormat;import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;import org.apache.hadoop.hbase.mapreduce.TableReducer;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import java.io.IOException;/** * Created by sunwangdong on 2017/7/23. */public class HdfsToHBase{ public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Configuration conf = HBaseConfiguration.create(); conf.set("hbase.rootdir","hdfs://localhost:9000/hbase"); //约束hbase.root的路径,与hadoop的配置文件一致 conf.set(TableOutputFormat.OUTPUT_TABLE ,args[1]); Job job = Job.getInstance(conf,HdfsToHBase.class.getSimpleName()); //对job的约束 TableMapReduceUtil.addDependencyJars(job); job.setJarByClass(HdfsToHBase.class); job.setMapperClass(HdfsToHBaseMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setReducerClass(HdfsToHBaseReducer.class); FileInputFormat.addInputPath(job,new Path(args[0])); job.setOutputFormatClass(org.apache.hadoop.hbase.mapreduce.TableOutputFormat.class); //这里设置的格式特别 Boolean b = job.waitForCompletion(true); if(!b) { System.err.println("failed"); } else System.out.println("finished!"); } public static class HdfsToHBaseMapper extends Mapper<LongWritable,Text,Text,Text> { private Text outKey = new Text(); private Text outValue = new Text(); public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException { String[] splits = value.toString().split("\t"); outKey.set(splits[0]); outValue.set(splits[1] + "\t" + splits[2] + "\t" + splits[3] + "\t" + splits[4]); context.write(outKey,outValue); } } public static class HdfsToHBaseReducer extends TableReducer<Text,Text,NullWritable> { public void reduce(Text k2, Iterable<Text> v2s,Context context) throws IOException, InterruptedException { Put put = new Put(k2.getBytes()); for(Text v2 : v2s) { String[] splis = v2.toString().split("\t"); if(splis[0] != null && !"NULL".equals(splis[0])) { put.addColumn("f1".getBytes(),"name".getBytes(),splis[0].getBytes()); } if(splis[1] != null && !"NULL".equals(splis[1])) { put.addColumn("f1".getBytes(),"age".getBytes(),splis[1].getBytes()); } if(splis[2] != null && !"NULL".equals(splis[2])) { put.addColumn("f1".getBytes(),"gender".getBytes(),splis[2].getBytes()); } if(splis[3] != null && !"NULL".equals(splis[3])) { put.addColumn("f1".getBytes(),"birthday".getBytes(),splis[3].getBytes()); } } context.write(NullWritable.get(),put); } }}然后需要将此代码打成jar包,注意,运行hbase一定要用hadoop jar的形式,因为lz是在idea中来编译的,所以直接通过idea来编译生成jar包即可,注意,生成jar包,需要删除里面的META-INF/LICENSE.
zip -d ****.jar META-INF/LICENSE
当然因为上述代码中并没有直接写在hbase中的create创建表格的代码,所以我们需要事先在hbase中创建表格,即:
hbase(main):006:0> create 'table1','f1'0 row(s) in 1.4270 seconds=> Hbase::Table - table1
完成上述在hbase中的表格创建,其中表格名字为"table1",它的其中一个行键是"f1"。
那么接下来就可以用hadoop的命令来执行上述jar包了,通过
hadoop jar ./***.jar com.hbase.HdfsToHBase /t1/t1 table1
那么可以看到在hadoop中通过mapreduce来执行了:
localhost:jkxy_jar12 sunwangdong$ hadoop jar ./jkxy.jar com.hbase.HdfsToHBase /t1/t1 table117/07/23 11:40:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable17/07/23 11:40:06 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:803217/07/23 11:40:07 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.17/07/23 11:40:08 INFO input.FileInputFormat: Total input paths to process : 117/07/23 11:40:08 INFO mapreduce.JobSubmitter: number of splits:117/07/23 11:40:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1500776717808_000317/07/23 11:40:08 INFO impl.YarnClientImpl: Submitted application application_1500776717808_000317/07/23 11:40:08 INFO mapreduce.Job: The url to track the job: http://sunwangdongMacBook-Pro.local:8088/proxy/application_1500776717808_0003/17/07/23 11:40:08 INFO mapreduce.Job: Running job: job_1500776717808_000317/07/23 11:40:16 INFO mapreduce.Job: Job job_1500776717808_0003 running in uber mode : false17/07/23 11:40:16 INFO mapreduce.Job: map 0% reduce 0%17/07/23 11:40:22 INFO mapreduce.Job: map 100% reduce 0%17/07/23 11:40:29 INFO mapreduce.Job: map 100% reduce 100%17/07/23 11:40:29 INFO mapreduce.Job: Job job_1500776717808_0003 completed successfully17/07/23 11:40:29 INFO mapreduce.Job: Counters: 49File System CountersFILE: Number of bytes read=109FILE: Number of bytes written=250471FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=187HDFS: Number of bytes written=0HDFS: Number of read operations=2HDFS: Number of large read operations=0HDFS: Number of write operations=0Job CountersLaunched map tasks=1Launched reduce tasks=1Data-local map tasks=1Total time spent by all maps in occupied slots (ms)=3069Total time spent by all reduces in occupied slots (ms)=3801Total time spent by all map tasks (ms)=3069Total time spent by all reduce tasks (ms)=3801Total vcore-seconds taken by all map tasks=3069Total vcore-seconds taken by all reduce tasks=3801Total megabyte-seconds taken by all map tasks=3142656Total megabyte-seconds taken by all reduce tasks=3892224Map-Reduce FrameworkMap input records=4Map output records=4Map output bytes=95Map output materialized bytes=109Input split bytes=92Combine input records=0Combine output records=0Reduce input groups=4Reduce shuffle bytes=109Reduce input records=4Reduce output records=4Spilled Records=8Shuffled Maps =1Failed Shuffles=0Merged Map outputs=1GC time elapsed (ms)=130CPU time spent (ms)=0Physical memory (bytes) snapshot=0Virtual memory (bytes) snapshot=0Total committed heap usage (bytes)=347602944Shuffle ErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0File Input Format CountersBytes Read=95File Output Format CountersBytes Written=0finished!
完成成功,那么可以在hbase中看到,刚才创建的"table1"表格已经有了数据
hbase(main):010:0> scan 'table1'ROW COLUMN+CELL 1 column=f1:age, timestamp=1500781227480, value=10 1 column=f1:gender, timestamp=1500781227480, value=male 1 column=f1:name, timestamp=1500781227480, value=zhangsan 2 column=f1:name, timestamp=1500781227480, value=lisi 3 column=f1:name, timestamp=1500781227480, value=wangwu 4 column=f1:birthday, timestamp=1500781227480, value=1993 4 column=f1:name, timestamp=1500781227480, value=zhaoliu4 row(s) in 0.0750 seconds
至此,上述通过mapreduce将hdfs上的表格传递给hbase就成功了!
- HBase之java api接口调用与mapreduce整合即从hdfs中通过mapreduce来导入数据到hbase中
- 1007-使用MapReduce把数据从HDFS导入到HBase
- 自定义MapReduce导入HDFS数据到HBase
- MapReduce将HDFS文本数据导入HBase中
- HDFS+MapReduce+HBase整合
- HDFS数据用MapReduce导入Hbase
- 从hbase表1中读取数据,最终结果写入到hbase表2 ,如何通过MapReduce实现 ?
- 通过MapReduce把Hive表数据导入到HBase
- MapReduce编程之通过MapReduce读取数据,往Hbase中写数据
- 用MapReduce把hdfs数据写入HBase中
- hbase 下mapreduce 读取hbase中数据
- 使用MapReduce从HBase中读取数据存入HDFS路径问题
- 使用java MapReduce job 批量导入大额数据到Hbase
- Hbase通过 Mapreduce 写入数据到Mysql
- 将HDFS中的数据通过MapReduce产生HFile,然后将HFile导入到HBase具体案例分析
- HBase建表高级属性,hbase应用案例看行键设计,HBase和mapreduce结合,从Hbase中读取数据、分析,写入hdfs,从hdfs中读取数据写入Hbase,协处理器和二级索引
- Hadoop MapReduce将HDFS文本数据导入HBase
- 从HDFS导入数据到HBASE
- 刷新父iframe的信息
- oracle中斜杠(/)的含义
- JAVA IO流_3
- 理解CSS中的BFC(块级可视化上下文)
- 51单片机 时钟程序设计 数码管可闪烁显示数值
- HBase之java api接口调用与mapreduce整合即从hdfs中通过mapreduce来导入数据到hbase中
- 小米手机Installation failed with message Failed to establish session.
- 将编辑好的PDF文件进行编辑困难吗
- 《辛雷学习方法》读书笔记——第二章 心态
- Spring-(1)HelloSpring
- ionic客服拖拽
- onmouseclick多次点击
- Softmax的求导
- XMind思维导图是什么