学习Hadoop第二十五课(单节点HBase建表、插入数据及查询)

来源:互联网 发布:汪峰半壁江山知乎 编辑:程序博客网 时间:2024/06/05 22:42

       上节课我们一起学习了HBase的一些理论知识还搭建好了单节点的HBase,那么这节课我们一起来学习一下HBase是如何建表,如何插入数据以及如何查询数据的。大家如果没看过上节课内容的,可以到:http://blog.csdn.net/u012453843/article/details/52970967这篇博客进行学习。

第一部分:建表

       我们使用命令行的方式来进行建表,插入数据和查询的。在建表之前先要保证我们的HBase进程已经启动。我们使用jps命令来查看进程,如下所示,发现当前并没有HMaster进程,说明我们的HBase并没有启动。

[root@itcast03 ~]# jps
5408 Jps
[root@itcast03 ~]#

       既然HBase没有启动,那么我们就先来启动HBase,启动完之后我们查看进程,如下所示,发现已经启动HBase了。

[root@itcast03 ~]# cd /itcast/hbase-0.98.23-hadoop2/
[root@itcast03 hbase-0.98.23-hadoop2]# ls
bin  CHANGES.txt  conf  docs  hbase-webapps  LEGAL  lib  LICENSE.txt  logs  NOTICE.txt  README.txt
[root@itcast03 hbase-0.98.23-hadoop2]# cd bin
[root@itcast03 bin]# ls
get-active-master.rb  hbase-cleanup.sh  hbase-config.cmd  hbase-daemons.sh  local-master-backup.sh  region_mover.rb   replication   start-hbase.cmd  stop-hbase.sh   zookeepers.sh    graceful_stop.sh      hbase.cmd         hbase-config.sh   hbase-jruby       local-regionservers.sh  regionservers.sh  rolling-restart.sh        start-hbase.sh   test
hbase   hbase-common.sh   hbase-daemon.sh   hirb.rb           master-backup.sh        region_status.rb  shutdown_regionserver.rb  stop-hbase.cmd   thread-pool.rb
[root@itcast03 bin]# ./start-hbase.sh
starting master, logging to /itcast/hbase-0.98.23-hadoop2/bin/../logs/hbase-root-master-itcast03.out
[root@itcast03 bin]# jps
5537 HMaster

5623 Jps

       启动了HBase,我们来用shell命令的方式来操作HBase,如下所示,进入到了shell模式。

[root@itcast03 bin]# ./hbase shell
2016-10-30 16:53:34,888 INFO  [main] Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.98.23-hadoop2, r44c724b56dc1431209f561cb997fce805f9f45f9, Wed Oct  5 01:05:05 UTC 2016
hbase(main):001:0>
      那么HBase都提供了哪些命令呢?我们输入help,回显信息如下所示。可以看到命令是按组来分的,有general、ddl、namesapce等等组。我们常用到的组是ddl和dml。
那么ddl和dml代表的意思是什么呢?
       DDLData Definition Language数据库定义语言,用于定义数据库的三级结构,包括外模式、概念模式、内模式及其相互之间的映像,定义数据的完整性、安全控制等约束。DDL不需要commit。常用的命令有alter(修改表),create(创建表), describe(表结构的描述信息),drop(删除表),list(查询所有的表),可以发现都是针对表的操作。
      DMLData Manipulation Language数据操纵语言,用于让用户或程序员使用,实现对数据库中数据的操作。DML分成交互型DML和嵌入型DML两类。依据语言的级别,DML又可分成过程性DML和非过程性DML两种。需要commit。常用的命令有scan(全表扫描,相当于select *),get(取出一条数据),put(向表中插入数据),delete(删除表中数据),等等。可以发现是对数据操作的命令。
hbase(main):001:0> help
HBase Shell, version 0.98.23-hadoop2, r44c724b56dc1431209f561cb997fce805f9f45f9, Wed Oct  5 01:05:05 UTC 2016
Type 'help "COMMAND"', (e.g. 'help "get"' -- the quotes are necessary) for help on a specific command.
Commands are grouped. Type 'help "COMMAND_GROUP"', (e.g. 'help "general"') for help on a command group.
COMMAND GROUPS:
  Group name: general

  Commands: processlist, status, table_help, version, whoami
  Group name: ddl
  Commands: alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, show_filters
  Group name: namespace
  Commands: alter_namespace, create_namespace, describe_namespace, drop_namespace, list_namespace, list_namespace_tables
  Group name: dml
  Commands: append, count, delete, deleteall, get, get_counter, get_splits, incr, put, scan, truncate, truncate_preserve
  Group name: tools
  Commands: assign, balance_switch, balancer, balancer_enabled, catalogjanitor_enabled, catalogjanitor_run, catalogjanitor_switch, close_region, compact, compact_rs, flush, hlog_roll, major_compact, merge_region, move, split, trace, unassign, zk_dump
  Group name: replication
  Commands: add_peer, disable_peer, disable_table_replication, enable_peer, enable_table_replication, get_peer_config, list_peer_configs, list_peers, list_replicated_tables, remove_peer, set_peer_tableCFs, show_peer_tableCFs, update_peer_config
  Group name: snapshots
  Commands: clone_snapshot, delete_all_snapshot, delete_snapshot, delete_table_snapshots, list_snapshots, list_table_snapshots, restore_snapshot, snapshot
  Group name: security
  Commands: grant, list_security_capabilities, revoke, user_permission
  Group name: visibility labels
  Commands: add_labels, clear_auths, get_auths, list_labels, set_auths, set_visibility
     现在我们要新建一张表,那么应该怎么建表呢?我们心里没谱,既然没谱,我们就使用它给我们的帮助,如下所示,创建表的命令是create,那么我们就使用help 'create'来看看帮助信息。下面黑色加粗字体的内容是给我们的建表语句。
hbase(main):001:0> help 'create'
Creates a table. Pass a table name, and a set of column family
specifications (at least one), and, optionally, table configuration.
Column specification can be a simple string (name), or a dictionary
(dictionaries are described below in main help output), necessarily
including NAME attribute.
Examples:
Create a table with namespace=ns1 and table qualifier=t1
  hbase> create 'ns1:t1', {NAME => 'f1', VERSIONS => 5}
Create a table with namespace=default and table qualifier=t1
  hbase> create 't1', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'}
  hbase> # The above in shorthand would be the following:
  hbase> create 't1', 'f1', 'f2', 'f3'
  hbase> create 't1', {NAME => 'f1', VERSIONS => 1, TTL => 2592000, BLOCKCACHE => true}
  hbase> create 't1', {NAME => 'f1', CONFIGURATION => {'hbase.hstore.blockingStoreFiles' => '10'}}

 
Table configuration options can be put at the end.
Examples:
  hbase> create 'ns1:t1', 'f1', SPLITS => ['10', '20', '30', '40']
  hbase> create 't1', 'f1', SPLITS => ['10', '20', '30', '40']
  hbase> create 't1', 'f1', SPLITS_FILE => 'splits.txt', OWNER => 'johndoe'
  hbase> create 't1', {NAME => 'f1', VERSIONS => 5}, METADATA => { 'mykey' => 'myvalue' }
  hbase> # Optionally pre-split the table into NUMREGIONS, using
  hbase> # SPLITALGO ("HexStringSplit", "UniformSplit" or classname)
  hbase> create 't1', 'f1', {NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}
  hbase> create 't1', 'f1', {NUMREGIONS => 15, SPLITALGO => 'HexStringSplit', CONFIGURATION => {'hbase.hregion.scan.loadColumnFamiliesOnDemand' => 'true'}}
  hbase> create 't1', {NAME => 'f1', DFS_REPLICATION => 1}
You can also keep around a reference to the created table:
  hbase> t1 = create 't1', 'f1'
Which gives you a reference to the table named 't1', on which you can then
call methods.
hbase(main):002:0>
      有了上面的帮助信息,我们现在开始新建一张表,如下所示。我来说一下下面这句建表语句的意思,create不用多说,就是创建的意思,'student'是表名,{NAME => 'info', VERSIONS =>3}的意思是一个列族,建表的时候我们必须至少建一个列族,也可以建多个,NAME => 'info'是给这个列族起的名字,VERSIONS =>3是指这个列族可以存储三个版本的数据,多于3个的话,最老的版本将被删除(这个后面会说到),同理,{NAME => 'data', VERSIONS =>1}这句的意思是建了另外一个列族,这个列族的名字是'data',存储的版本只有1个。
hbase(main):002:0> create 'student', {NAME => 'info', VERSIONS => 3}, {NAME => 'data', VERSIONS =>1}
      执行上面的建表语句之后,我们来使用list命令查看一下是否生成了student表,如下所示,发现已经生成成功了。
hbase(main):001:0> list
TABLE                                                                                                                                                                                                      
student                                                                                                                                                                                                    
1 row(s) in 1.0400 seconds
=> ["student"]
hbase(main):002:0>
      现在我们想查看一下这张表的表结构,我们使用命令:describe 'student'来查看,如下所示。可以看到这张表中确实有两个列族,分别是data和info,VERSIONS分别是1和3。说明我们建表成功。
hbase(main):003:0> describe 'student'
Table student is ENABLED                                                                                                                                                                                   
student                                                                                                                                                                                                    
COLUMN FAMILIES DESCRIPTION                                                                                                                                                                                
{NAME => 'data', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0',VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => true'}                                                                                                                                        
{NAME => 'info', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0',VERSIONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE =>'true'}
                                                                                                                                          
2 row(s) in 0.0980 seconds
hbase(main):004:0>
第二部分:向表中插入数据及查询
       向表中插入数据我们用到的是dml语句当中的put,那么put命令后面跟什么内容呢?我们可以查看put的帮助,下面红色加粗的两条语句就是我们常用的插入语句。
hbase(main):001:0> help 'put'
Put a cell 'value' at specified table/row/column and optionally
timestamp coordinates.  To put a cell value into table 'ns1:t1' or 't1'
at row 'r1' under column 'c1' marked with the time 'ts1', do:
  hbase> put 'ns1:t1', 'r1', 'c1', 'value'
  hbase> put 't1', 'r1', 'c1', 'value'
  hbase> put 't1', 'r1', 'c1', 'value', ts1
  hbase> put 't1', 'r1', 'c1', 'value', {ATTRIBUTES=>{'mykey'=>'myvalue'}}
  hbase> put 't1', 'r1', 'c1', 'value', ts1, {ATTRIBUTES=>{'mykey'=>'myvalue'}}
  hbase> put 't1', 'r1', 'c1', 'value', ts1, {VISIBILITY=>'PRIVATE|SECRET'}
The same commands also can be run on a table reference. Suppose you had a reference
t to table 't1', the corresponding command would be:
  hbase> t.put 'r1', 'c1', 'value', ts1, {ATTRIBUTES=>{'mykey'=>'myvalue'}}
hbase(main):002:0>
        知道了插入语句,接下来我们便向student表中插入一条数据,如下所示,我来具体说一下这条语句的意思,put的意思是插入,'student'的意思是表名,表示我们是向student表中插入数据,'rk0001'的意思是row key,可以认为是一行的唯一标识符,'info:name'的意思是一个cell(单元格),一个单元格是由列族和列名共同组成的,iinfo是列族,name是列名,'tom'是name的值。其实我们还可以指定timestamp的值,我们这里没有指定,系统会自动帮我们生成一个timestamp。
hbase(main):002:0> put 'student', 'rk0001', 'info:name', 'tom'
0 row(s) in 0.3440 seconds
      建表成功后,我们来查看一下student这张表的信息,scan是全表扫描,相当于关系型数据库的select *,如下所示。可以看到我们插入的数据都在里面了,说明我们插入语句成功了。
hbase(main):003:0> scan 'student'
ROW                                        COLUMN+CELL
 rk0001                                    column=info:name, timestamp=1477832018237,value=tom
1 row(s) in 0.1460 seconds
hbase(main):004:0>
      现在,我们想继续给student增加属性,比如我们增加性别一列。如下所示,我们发现继续向主键rk0001中插入数据是没问题的,这就是HBase和关系型数据库一个不同的地方。我们HBase中的一列其实是由列族和列名共同组成的,我们在info这个列族下添加了名为gender的一列。值为male,如下所示,数据插入完毕之后我们全表扫描一下student表,我们看到ROW下面有两个rk0001,但其实我们查到的是一行数据,下面黑色粗体1 row(s) in 0.0450 seconds很明显的告诉了我们,我们可以这样记:主键相同的是一行,我们看到scan到的内容中有主键都是rk0001,说明它们是一行。
hbase(main):004:0> put 'student', 'rk0001', 'info:gender', 'male'
0 row(s) in 0.3860 seconds
hbase(main):005:0> scan 'student'
ROW                                      COLUMN+CELL
 rk0001                                  column=info:gender, timestamp=1477832942394,value=male
 rk0001
                                  column=info:name, timestamp=1477832018237,value=tom
1 row(s) in 0.0450 seconds
hbase(main):006:0>
       我们再给student增加属性,我们这次添加的属性是年龄。如下所示。

hbase(main):002:0> put 'student', 'rk0001', 'info:age', '20'
0 row(s) in 0.2780 seconds
hbase(main):003:0> scan 'student'
ROW                                        COLUMN+CELL
 rk0001                                     column=info:age,timestamp=1477833920512, value=20
 rk0001                                     column=info:gender, timestamp=1477832942394, value=male
 rk0001                                     column=info:name, timestamp=1477832018237, value=tom

1 row(s) in 0.0390 seconds
hbase(main):004:0>
      好,我们现在已经向rk0001添加了三条数据了,我们现在向另外一个列族data中添加一条数据。我们这次是向主键是rk0002插入的数据,列族是data,列名是score,值是99。插入完毕后,我们查看一下student表当前的数据,可以发现现在是2行数据了(2 row(s) in 0.0580 seconds)原因是我们的主键变了。
hbase(main):001:0> put 'student', 'rk0002', 'data:score', '99'
0 row(s) in 0.2980 seconds
hbase(main):002:0> scan 'student'
ROW                                            COLUMN+CELL
 rk0001                                         column=info:age, timestamp=1477833920512, value=20
 rk0001                                         column=info:gender, timestamp=1477832942394, value=male
 rk0001                                         column=info:name, timestamp=1477832018237, value=tom
 rk0002                                         column=data:score,timestamp=1477834144805, value=99
2 row(s) in 0.0580 seconds
hbase(main):003:0>
      那我们现在突然发现刚才我们插入的主键为rk0002的数据有问题,我们想删掉它,我们怎么做呢?我们还是通过帮助来解决,如下所示,可以看到我们可以使用delete 't1', 'r1', 'c1', ts1这条语句来删除。
hbase(main):001:0> help 'delete'
Put a delete cell value at specified table/row/column and optionally
timestamp coordinates.  Deletes must match the deleted cell's
coordinates exactly.  When scanning, a delete cell suppresses older
versions. To delete a cell from  't1' at row 'r1' under column 'c1'
marked with the time 'ts1', do:
  hbase> delete 'ns1:t1', 'r1', 'c1', ts1
  hbase> delete 't1', 'r1', 'c1', ts1
  hbase> delete 't1', 'r1', 'c1', ts1, {VISIBILITY=>'PRIVATE|SECRET'}
The same command can also be run on a table reference. Suppose you had a reference
t to table 't1', the corresponding command would be:
  hbase> t.delete 'r1', 'c1',  ts1
  hbase> t.delete 'r1', 'c1',  ts1, {VISIBILITY=>'PRIVATE|SECRET'}
hbase(main):002:0>
     知道了命令,我们来执行删除操作(注意:最后一个时间戳参数是不能带''号的)
hbase(main):001:0> delete 'student', 'rk0002', 'data:score', 1477834144805
0 row(s) in 0.3230 seconds
     删除完之后我们再来看看student表中的内容,如下所示,发现我们刚才添加的主键为rk0002的内容已经被删除了。
hbase(main):002:0> scan 'student'
ROW                                                  COLUMN+CELL
 rk0001                                              column=info:age, timestamp=1477833920512, value=20
 rk0001                                              column=info:gender, timestamp=1477832942394, value=male 
 rk0001                                              column=info:name, timestamp=1477832018237, value=tom
1 row(s) in 0.0480 seconds
hbase(main):003:0>
     我们现在继续向student表中插入另外一名同学jerry的相关信息。如下所示,我们只添加了info:name和info:gender的信息,并没有添加age属性的值。
hbase(main):001:0> put 'student', 'rk0002', 'info:name', 'jerry'
0 row(s) in 0.2790 seconds
hbase(main):002:0> put 'student', 'rk0002', 'info:gender','male'
0 row(s) in 0.0150 seconds
hbase(main):003:0> scan 'student'
ROW                                                  COLUMN+CELL
 rk0001                                              column=info:age, timestamp=1477833920512, value=20
 rk0001                                              column=info:gender, timestamp=1477832942394, value=male
 rk0001                                              column=info:name, timestamp=1477832018237, value=tom
 rk0002                                              column=info:gender,timestamp=1477836207768, value=male
 rk0002                                              column=info:name, timestamp=1477836194267, value=jerry

2 row(s) in 0.0380 seconds
hbase(main):004:0>
     现在我们来验证一下我们在建表时给列族设定的VERSIONS =>3是否有效,我们向rk0001的iinfo:age列继续添加两次数据。info:age的值分别是21和22。
hbase(main):001:0> put 'student', 'rk0001', 'info:age','21'
0 row(s) in 0.2790 seconds
     插入完info:age的值是21的信息后,我们再来查看一下该表的信息,发现当前显示的info:age的值是21,不是刚才的20了。
hbase(main):002:0> scan 'student'
ROW                                                  COLUMN+CELL
 rk0001                                              column=info:age, timestamp=1477836580124,value=21
 rk0001                                              column=info:gender, timestamp=1477832942394, value=male
 rk0001                                              column=info:name, timestamp=1477832018237, value=tom
 rk0002                                              column=info:gender, timestamp=1477836207768, value=male
 rk0002                                              column=info:name, timestamp=1477836194267, value=jerry
2 row(s) in 0.0510 seconds
hbase(main):003:0> put 'student', 'rk0001', 'info:age','22'
0 row(s) in 0.0090 seconds
      插入完info:age的值是22的数据后,我们再来查看一下该表的信息,发现当前显示的info:age的值是22,不是21了。说明scan命令浏览到的信息是最近一次插入的数据。
hbase(main):004:0> scan 'student'
ROW                                                  COLUMN+CELL
 rk0001                                              column=info:age, timestamp=1477836634700,value=22
 rk0001                                              column=info:gender, timestamp=1477832942394, value=male
 rk0001                                              column=info:name, timestamp=1477832018237, value=tom
 rk0002                                              column=info:gender, timestamp=1477836207768, value=male
 rk0002                                              column=info:name, timestamp=1477836194267, value=jerry
2 row(s) in 0.0300 seconds
hbase(main):005:0>
       那么我们会有个疑问,我们前面插入的info:age的值为20和21的数据被删除了吗?其实没有。我们可以通过scan 'student', {COLUMNS => 'info', VERSIONS => 3}来查看,COLUMNS => 'info'指定的是列族,VERSIONS => 3是建这个列族时指定的可以容纳版本的数量,执行结果如下所示,我们发现info:age的所有值我们都查询出来了。
hbase(main):001:0> scan 'student', {COLUMNS => 'info', VERSIONS => 3}
ROW                                                  COLUMN+CELL
 rk0001                                              column=info:age, timestamp=1477836634700, value=22 
 rk0001                                              column=info:age, timestamp=1477836580124, value=21
 rk0001                                              column=info:age, timestamp=1477833920512, value=20

 rk0001                                              column=info:gender, timestamp=1477832942394, value=male
 rk0001                                              column=info:name, timestamp=1477832018237, value=tom
 rk0002                                              column=info:gender, timestamp=1477836207768, value=male
 rk0002                                              column=info:name, timestamp=1477836194267, value=jerry
2 row(s) in 0.2680 seconds
hbase(main):002:0>
       既然名为info的列族设置了版本数量为3的限制,现在已经有3个版本了,那么我们继续向这个列族添加数据的话,看看是什么效果,如下所示,发现添加info:age的值为23的数据后,我们查看到的info:age信息当中只有21、22、23了,没有了最开始的20。其实info:age值为20的数据现在已经被标记为删除了,内存被flush的话就真正删除了。当前内存还没有flush,我们仍然是可以查看到那条被标记为删除的记录的。
hbase(main):001:0> put 'student', 'rk0001', 'info:age', '23'
0 row(s) in 0.3200 seconds
hbase(main):003:0> scan 'student', {COLUMNS => 'info',VERSIONS => 3}
ROW                                                  COLUMN+CELL
 rk0001                                              column=info:age, timestamp=1477837982394, value=23
 rk0001                                              column=info:age, timestamp=1477836634700, value=22
 rk0001                                              column=info:age, timestamp=1477836580124, value=21

 rk0001                                              column=info:gender, timestamp=1477832942394, value=male
 rk0001                                              column=info:name, timestamp=1477832018237, value=tom
 rk0002                                              column=info:gender, timestamp=1477836207768, value=male
 rk0002                                              column=info:name, timestamp=1477836194267, value=jerry
2 row(s) in 0.0300 seconds
hbase(main):004:0>
       我们使用scan 'student', {RAW => true, VERSIONS => 10}这条命令来查询包括缓存中已被标记为删除的记录。如下所示。直到缓存中的数据被flush之后才不再显示。
hbase(main):001:0> scan 'student', {RAW => true, VERSIONS => 10}
ROW                                                  COLUMN+CELL                                                                                                                                           
 rk0001                                              column=info:age, timestamp=1477837982394, value=23
 rk0001                                              column=info:age, timestamp=1477836634700, value=22
 rk0001                                              column=info:age, timestamp=1477836580124, value=21
 rk0001                                              column=info:age, timestamp=1477833920512, value=20

 rk0001                                              column=info:gender, timestamp=1477832942394, value=male
 rk0001                                              column=info:name, timestamp=1477832018237, value=tom
 rk0002                                              column=data:score, timestamp=1477834144805, type=DeleteColumn
 rk0002                                              column=info:gender, timestamp=1477836207768, value=male
 rk0002                                              column=info:name, timestamp=1477836194267, value=jerry
2 row(s) in 0.2830 seconds
hbase(main):002:0>
       我们现在还没有向rk0001的名为data的列族添加过数据,现在我们添加一条数据,如下所示。
hbase(main):003:0> put 'student', 'rk0001', 'data:score', '90'
0 row(s) in 0.0740 seconds
hbase(main):004:0> scan 'student'
ROW                                                  COLUMN+CELL
 rk0001                                              column=data:score, timestamp=1477840556318, value=90
 rk0001                                              column=info:age, timestamp=1477837982394, value=23
 rk0001                                              column=info:gender, timestamp=1477832942394, value=male
 rk0001                                              column=info:name, timestamp=1477832018237, value=tom
 rk0002                                              column=info:gender, timestamp=1477836207768, value=male
 rk0002                                              column=info:name, timestamp=1477836194267, value=jerry
2 row(s) in 0.0360 seconds
hbase(main):005:0>
第三部分:HBase数据表分析
      我们把我们刚才操作的数据表给画出来,如下图所示,可见,这是一张不规则的表,这也是HBase的特色之处,我们可以灵活的给列族当中添加列,列的名称由我们来定。我们可以从这张图看到有些列是没有值的,那么这些空的值占空间吗?在HBase当中,这些空值是不占空间的,这比我们的关系型数据库明显要有优势(关系型数据库,你只要声明了某列,即使你不给它赋值,它也是占空间的)
     
        好了,这节课我们便一起学习到这儿了,下节课我们一起学习HBase集群的搭建。        


0 0
原创粉丝点击