Hive数据导入HBase

来源:互联网 发布:数据库中自然连接 编辑:程序博客网 时间:2024/05/04 22:09

Hive数据导入HBase大致有3中方法

  1. 在Hive创建数据保存在HBase的表方式,这种方法的特点是简单,但是数据量超过千万以后 ,数据偏移现象比较明显,效率不高
  2. 在定义Hive的UDF,将数据写入HBase,如果提前将HBase表的regen分好,这种直接put的方法效率还行
  3. 直接用MapReduce生成Hfile,然后导入HBase,这种方法的特点是程序步奏很多,但是效率高,每分钟轻松能到3000万数据


下面介绍一下我这种用MapReduce生成Hfile,然后导入HBase的方法,一共分为6步:

1. 生成Range Partitioning


这一步决定后面生成HFile时的reduce的个数,比如下面这段sql共生成137556197/100/22000=62条记录,则生成HFile时用63个reduce


select list_no,row_sequence() from ( 
        select 
            list_no, 
            row_sequence() as row 
        from (
            select 
                list_no 
            from user_info  tablesample(bucket 1 out of 100 on list_no) s order by list_no
        ) t order by list_no 
    ) x where (row % 22000)=0 order by list_no ;




2. 创建存储HFile数据表
存储HiveHFileOutputFormat数据


CREATE EXTERNAL TABLE IF NOT EXISTS hbase_user_info(
  list_no string,
asset_id double,
  ...

STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat' 
TBLPROPERTIES('hfile.family.path' = '/tmp-data/user/hbase/hbase_hfiles/user_info/f');


3. 生成HFile


INSERT OVERWRITE TABLE hbase_user_info  
SELECT * FROM user_info CLUSTER BY list_no;


4. 创建HBase表


echo "create 'user_info', {NAME => 'f', COMPRESSION => 'GZ'}" | hbase shell  


5. 将HFile文件导入HBase


hadoop jar /appcom/hbase/hbase-0.94.11.jar  completebulkload  /tmp-data/user/hbase/hbase_hfiles/user_info user_info


6. 测试是否有数据


echo "scan 'user_info', { LIMIT => 1 }" | hbase shell


下面是前面6步合成的shell脚本

#!/bin/bashhive_table=user_inforowkey=idhfile_path=/tmp-data/user/hbase/hbase_hfiles/hbase_jar=$(echo "$(ls -l /software/hbase/*.jar | grep -v test)"|awk  '{print $9}')hive_contrib_jar=$(echo "$(ls -l /software/hive/lib/hive-contrib*.jar | grep -v test)"|awk  '{print $9}')hive_hbase_handler_jar=$(echo "$(ls -l /software/hive/lib/hive-hbase-handler*.jar | grep -v test)"|awk  '{print $9}')echo "##################################[step 1 generate splites]#####################################"hive -e "    use user_db;    CREATE EXTERNAL TABLE  IF NOT EXISTS  hbase_splits(partition STRING, count int)    PARTITIONED BY (table STRING);    add jar ${hive_contrib_jar};    create temporary function row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';    INSERT OVERWRITE TABLE  hbase_splits    PARTITION (table='${hive_table}')    select ${rowkey},row_sequence() from (        select            ${rowkey},            row_sequence() as row        from (            select                ${rowkey}            from ${hive_table}  tablesample(bucket 1 out of 100 on ${rowkey}) s order by ${rowkey}        ) t order by ${rowkey}    ) x where (row % 22000)=0 order by ${rowkey} ;    CREATE EXTERNAL TABLE  IF NOT EXISTS hbase_splits_file(partition STRING)    PARTITIONED BY (table STRING)    ROW FORMAT      SERDE 'org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe'    STORED AS      INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat'    LOCATION '/tmp-data/user/hbase/hbase_splits_file';    INSERT OVERWRITE TABLE hbase_splits_file    PARTITION (table='${hive_table}')    select partition from hbase_splits where table='${hive_table}';"echo "##################################[step 2 create hfile table ]#####################################"echo "DEBUG: table name is: "${hive_table}sql_select_col="CREATE EXTERNAL TABLE IF NOT EXISTS hbase_${hive_table}("desc_table_cols=$(hive -e "    use user_db;    desc ${hive_table};")desc_table_cols=$(echo "${desc_table_cols}" | grep -Ev "^$|^#.*$|^col_name.*$")temp_file=`mktemp -u temp.user.XXXXXX.$$`echo "$desc_table_cols" > ${temp_file}while read linedo    lgt=$(expr length "${line}")    if [ ${lgt} -eq 0 ];    then        break    fi;    col_name=$(echo "${line}"|awk -F ' ' '{print $1}')    col_type=$(echo "${line}"|awk -F ' ' '{print $2}')    sql_select_col="${sql_select_col}${col_name} ${col_type},";done < ${temp_file}rm -rf ${temp_file}len=$(expr length "${sql_select_col}")let "len = len - 1"sql_select_col=$(echo ${sql_select_col}|cut -c1-$len)sql_select_col=${sql_select_col}") STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'  OUTPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat' TBLPROPERTIES('hfile.family.path' = '${hfile_path}${hive_table}/f');"echo "DEBUG: cols:"${sql_select_col}hive -e "use user_db;    ${sql_select_col};"echo "##################################[step 3 create hfile ]#####################################"task_num=$(    hive -e "        use user_db;        select max(count) + 1 from hbase_splits where table='${hive_table}';    ")task_num_str=$(echo ${task_num})num=$(echo "${task_num_str}" | awk '{print $2}')echo ${num}hive -e "ADD JAR ${hbase_jar};ADD JAR ${hive_hbase_handler_jar};USE user_db;SET mapred.reduce.tasks=${num};SET total.order.partitioner.path=/tmp-data/user/hbase/hbase_splits_file/table=${hive_table}/000000_0;SET hive.mapred.partitioner=org.apache.hadoop.mapred.lib.TotalOrderPartitioner;set hive.optimize.sampling.orderby=true;set hive.optimize.sampling.orderby.number=10000000;set hive.optimize.sampling.orderby.percent=0.1f;INSERT OVERWRITE TABLE hbase_${hive_table}SELECT * FROM ${hive_table} CLUSTER BY ${rowkey};"status=$?echo status=${status}if [ ${status} -eq 0 ];then    echo "##################################[step 4 create hbase table  ]#####################################"    #create 'testtable', { NAME => 'colfam1', COMPRESSION => 'GZ' }    echo "create '${hive_table}', {NAME => 'f', COMPRESSION => 'GZ'}" | hbase shell    echo "##################################[step 5 move hfile to hbase ]#####################################"    hadoop jar ${hbase_jar}  completebulkload  ${hfile_path}${hive_table} ${hive_table}    echo "##################################[step 6 test ]#####################################"    echo "scan '${hive_table}', { LIMIT => 1 }" | hbase shellelse    echo "ERROR:@@@@@@     generate hfile error       @@@@@@";    exit -1;fi


根据经验,基本上分在每个reduce的DateSize在1-2G之间效率比较高,有一个计算公式如下:

【以user_info为例,假设每个reduce 1.5G,(591999730/100) /  (63208973969/1024/1024/1024/1.5) ) = 150846,这个值在第一步(row %  150846)时使用】

其中63208973969通过hadoop fs -du获得,591999730为记录条数



可能遇到的问题

a)     Can't read partitions file,需要设置SET mapreduce.totalorderpartitioner.path=/tmp/hbase_splits;SETtotal.order.partitioner.path=/tmp/hbase_splits;前面那个是hadoop2用的;

b)    java.io.IOException: Added a key not lexically larger thanprevious key,可能是rowkey有重复;

c)     reduce阶段 Nofiles found...RangePartitioning对应的区域内没有数据;

d)    Wrong number of partitions in keyset,这是因为Range Partitioning中的条数和后面SETmapred.reduce.tasks=43不一致造成的,RangePartitioning应该是mapred.reduce.tasks的值减1


0 0
原创粉丝点击