Hive 数据入库到HBase
来源:互联网 发布:elvis elvin 知乎 编辑:程序博客网 时间:2024/05/01 14:51
测试数据:http://www.nber.org/patents/apat63_99.zip
测试环境:hadoop-2.3 + hive-0.13.1 + hbase-0.98.4
测试效率:6列6亿的Hive表数据半小时
- 创建hfile.hql
drop table hbase_splits;CREATE EXTERNAL TABLE IF NOT EXISTS hbase_splits(partition STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe'STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat'LOCATION '/tmp/hbase_splits_out';-- create a location to store the resulting HFilesdrop table hbase_hfiles;CREATE TABLE if not exists hbase_hfiles(rowkey STRING, pageviews STRING, bytes STRING)STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat'TBLPROPERTIES('hfile.family.path' = '/tmp/hbase_hfiles/w');ADD JAR /root/hive-0.13.1/lib/hive-contrib-0.13.1.jar; SET mapred.reduce.tasks=1;CREATE TEMPORARY FUNCTION row_seq AS 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';-- input file contains ~4mm records. Sample it so as to produce 5 input splits.INSERT OVERWRITE TABLE hbase_splitsSELECT PATENT FROM (SELECT PATENT, row_seq() AS seq FROM apat tablesample(bucket 1 out of 1000 on PATENT) sorder by PATENTlimit 10000000) xWHERE (seq % 300) = 0ORDER BY PATENTLIMIT 4;-- after this is finished, combined the splits file:dfs -rmr /tmp/hbase_splits;dfs -cp /tmp/hbase_splits_out/* /tmp/hbase_splits;ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-client-0.98.4-hadoop2.jar;ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-common-0.98.4-hadoop2.jar;ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-examples-0.98.4-hadoop2.jar;ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-hadoop2-compat-0.98.4-hadoop2.jar;ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-hadoop-compat-0.98.4-hadoop2.jar;ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-it-0.98.4-hadoop2.jar;ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-prefix-tree-0.98.4-hadoop2.jar;ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-protocol-0.98.4-hadoop2.jar;ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-server-0.98.4-hadoop2.jar;;ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-shell-0.98.4-hadoop2.jar;ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-testing-util-0.98.4-hadoop2.jar;ADD JAR /root/hbase-0.98.4-hadoop2/lib/hbase-thrift-0.98.4-hadoop2.jar;ADD JAR /root/hive-0.13.1/lib/hive-hbase-handler-0.13.1.jar;SET mapred.reduce.tasks=5;--SET total.order.partitioner.path=/tmp/hbase_splits;set mapreduce.totalorderpartitioner.path=/tmp/hbase_splits;SET hive.mapred.partitioner=org.apache.hadoop.mapred.lib.TotalOrderPartitioner;-- generate hfiles using the splits rangesINSERT OVERWRITE TABLE hbase_hfilesSELECT PATENT,GYEAR,GDATE FROM apatCLUSTER BY PATENT;
- 创建hbase表
#echo "create 'apat','w'" | hbase shell
- 导入数据
#hive -f hfile.hql
#hadoop jar /root/hbase-0.98.4-hadoop2/lib/hbase-server-0.98.4-hadoop2.jar completebulkload /tmp/hbase_hfiles apat
# echo "scan 'apat', { LIMIT => 2 }" | hbase shell
- 可能遇到的问题:
- Can't read partitions file,需要设置SET total.order.partitioner.path=/tmp/hbase_splits;为set mapreduce.totalorderpartitioner.path=/tmp/hbase_splits;前面那个是hadoop1用的,报错信息如下:
Diagnostic Messages for this Task:Error: java.lang.IllegalArgumentException: Can't read partitions file at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:116) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.<init>(MapTask.java:569) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)Caused by: java.io.FileNotFoundException: File file:/home/hadoop/hadoopData/tmp/nm-local-dir/usercache/root/appcache/application_1412175430384_0093/container_1412175430384_0093_01_000008/_partition.lst does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:722) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1749) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1773) at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.readPartitions(TotalOrderPartitioner.java:301) at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:88) ... 10 more
- java.io.IOException: Added a key not lexically larger than previous key,可能是rowkey没有设置对或者有重复的rowkey
- Split points are out of order,splite文件没有排序
- reduce 阶段 No files found...,hbase_splits对应的区域内没有数据
- Wrong number of partitions in keyset,这是因为hbase_splits中的条数和后面SET mapred.reduce.tasks=5不一致造成的,hbase_splits应该是mapred.reduce.tasks的值减1
- 入库时报Class not found:org.apache.hadoop.hbase.filter.Filter,需要把hbase需要的jar报复制到/root/hadoop-2.3.0/share/hadoop/common/lib,如下:
cd /root/hadoop-2.3.0/share/hadoop/common/libln -s /root/hbase-0.98.4-hadoop2/lib/hbase-client-0.98.4-hadoop2.jar hbase-client-0.98.4-hadoop2.jarln -s /root/hbase-0.98.4-hadoop2/lib/hbase-common-0.98.4-hadoop2.jar hbase-common-0.98.4-hadoop2.jarln -s /root/hbase-0.98.4-hadoop2/lib/hbase-examples-0.98.4-hadoop2.jar hbase-examples-0.98.4-hadoop2.jarln -s /root/hbase-0.98.4-hadoop2/lib/hbase-hadoop2-compat-0.98.4-hadoop2.jar hbase-hadoop2-compat-0.98.4-hadoop2.jarln -s /root/hbase-0.98.4-hadoop2/lib/hbase-hadoop-compat-0.98.4-hadoop2.jar hbase-hadoop-compat-0.98.4-hadoop2.jarln -s /root/hbase-0.98.4-hadoop2/lib/hbase-it-0.98.4-hadoop2.jar hbase-it-0.98.4-hadoop2.jarln -s /root/hbase-0.98.4-hadoop2/lib/hbase-prefix-tree-0.98.4-hadoop2.jar hbase-prefix-tree-0.98.4-hadoop2.jarln -s /root/hbase-0.98.4-hadoop2/lib/hbase-protocol-0.98.4-hadoop2.jar hbase-protocol-0.98.4-hadoop2.jarln -s /root/hbase-0.98.4-hadoop2/lib/hbase-server-0.98.4-hadoop2.jar hbase-server-0.98.4-hadoop2.jarln -s /root/hbase-0.98.4-hadoop2/lib/hbase-shell-0.98.4-hadoop2.jar hbase-shell-0.98.4-hadoop2.jarln -s /root/hbase-0.98.4-hadoop2/lib/hbase-thrift-0.98.4-hadoop2.jar hbase-thrift-0.98.4-hadoop2.jarln -s /root/hbase-0.98.4-hadoop2/lib/htrace-core-2.04.jar htrace-core-2.04.jar
参考:
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_user-guide/content/user-guide-hbase-import-1.html
https://cwiki.apache.org/confluence/display/Hive/HBaseBulkLoad
0 0
- Hive 数据入库到HBase
- 股票数据自动入库hive到oracle
- hive导入数据到hbase
- FTP数据入库Hive
- Hive表数据导入到Hbase
- Hive表数据导入到Hbase
- 生成HFile以及入库到HBase
- 生成HFile以及入库到HBase
- MapReduce生成HFile入库到HBase
- 生成HFile以及入库到HBase
- MapReduce生成HFile入库到HBase
- MapReduce生成HFile入库到HBase
- MapReduce生成HFile入库到HBase
- 生成HFile以及入库到HBase
- MapReduce生成HFile入库到HBase
- 生成HFile以及入库到HBase
- MapReduce生成HFile入库到HBase
- 生成HFile以及入库到HBase
- 栈
- docker 安装 Ubuntu
- 树链剖分
- 链表逆置
- gdb 入门
- Hive 数据入库到HBase
- 《企业应用架构模式》 - 书摘精要
- 位运算
- AbsListView android:smoothScrollbar
- notifyDataSetChanged() 动态更新ListView
- Dictionary的遍历方法
- Vector用法
- Swift SpriteKit 忍者无敌
- hdu 4971