Hive的快速使用说明一

来源:互联网 发布:java正则去掉html标签 编辑:程序博客网 时间:2024/05/16 13:53

这里记录一下Hive的快速使用说明

Hive的安装其实是非常简单的,从官网上下载.tar.gz文件后,在Linux服务器上解压开来,就可以见到我们常见的一些目录,像conf、bin等等目录,进入到conf目录,将文件

hive-default.xml.template复制成hive-default.xml,然后新建一个文件hive-site.xml,同样将hive-env.sh.template文件复制成hive-env.sh。

Hive由于要使用到Hadoop,所以在使用Hive之前必须确保Hadoop正确安装了,在使用HIVE前,先在hdfs中创建两个目录,如下:

  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp

  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse

  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp

  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse

当然了,最好是hadoop和hive的bin目录都在环境变量里设一下,这样更方便使用。

配置

Hive的配置默认都是在<install-dir>/conf/hive-default.xml中,如果要更改默认的配置,则可以在hive-site.xml中进行,hive-site.xml里的配置会覆盖掉hive-default.xml里的配置,在conf目录里还包括log4j日志的配置,通过命令还可以设置log4j的日志打印级别,如:bin/hive -hiveconf hive.root.logger=INFO,console

除了在配置文件中配置外,还可以直接运行命令进行设置,如:
hive> SET mapred.job.tracker=myhost.mycompany.com:50030;
hive> SET -v;
通过SET命令不仅仅可以设置Hive的参数,还可以设置Hadoop的参数。

DDL(Data Definition Language)操作
1、创建一个叫pokes的表:
hive> CREATE TABLE pokes (foo INT, bar STRING) row format delimited fields terminated by ‘|’;
该表有两列,第一个参数为int型,第二个参数为字符串型。

row format delimited fields terminated by ‘|’ 表示字段之间以|分隔,比如将文本文件写入到hive中时,要与该处一致,如某一要导入的文本文件为:

1|a
2|b
3|c
4|d
5|e
6|f
7|g
8|h
9|i
10|j

创建一个叫invites的表
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
partition分区的意思,用来在查询时缩小查询范围,加快数据检索速度。
2、创建一个叫invites的表
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
两列以及一个叫ds的分区列。
3、查看已存在的表
hive> SHOW TABLES;
OK
invites
pokes
Time taken: 3.376 seconds
也支持正则的写法,如:
hive> SHOW TABLES ‘.*s’;
OK
invites
pokes
Time taken: 0.034 seconds
4、查看表结构
hive> DESCRIBE invites;
OK
foo     int
bar     string
ds      string
Time taken: 0.24 seconds
5、修改表
我们也可以像关系数据库中那样对表的结构进行修改,比如新增列等等。
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
再describe pokes;发现pokes表新增了一列了。
修改表名,如:
ALTER TABLE pokes RENAME TO pokes2;
将表pokes改为pokes2.
6、删除表
hive> DROP TABLE pokes2;
DML操作(Data Manipulation Language)
1、 从文本文件中导入数据到hive表中
hive>LOAD DATA LOCAL INPATH ‘../examples/files/kv1.txt’ OVERWRITE INTO TABLE pokes;
结果:
Copying data from file:/usr/local/hive-0.9.0/examples/files/kv1.txt
Copying file: file:/usr/local/hive-0.9.0/examples/files/kv1.txt
Loading data to table default.pokes
Deleted hdfs://master:9000/user/hive/warehouse/pokes
OK
Time taken: 0.247 seconds
在上面的命令中OVERWRITE表示覆盖的意思,如果没有OVERWRITE则在原来记录的基础上添加。
2、 查看pokes中有多少条记录
hive> select count(*) from pokes;
结果如下:
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201303271417_0002, Tracking URL = http://master:50030/jobdetails.jsp?jobid=job_201303271417_0002
Kill Command = /usr/local/hadoop-1.0.0/libexec/../bin/hadoop job  -Dmapred.job.tracker=master:9001 -kill job_201303271417_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2013-04-02 16:52:49,857 Stage-1 map = 0%,  reduce = 0%
2013-04-02 16:52:55,892 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.64 sec
2013-04-02 16:52:56,900 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.64 sec
2013-04-02 16:52:57,907 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.64 sec
2013-04-02 16:52:58,913 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.64 sec
2013-04-02 16:52:59,919 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.64 sec
2013-04-02 16:53:00,925 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.64 sec
2013-04-02 16:53:01,930 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.64 sec
2013-04-02 16:53:02,936 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.64 sec
2013-04-02 16:53:03,942 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.64 sec
2013-04-02 16:53:04,947 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.64 sec
2013-04-02 16:53:05,953 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.64 sec
2013-04-02 16:53:06,958 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.64 sec
2013-04-02 16:53:07,964 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.38 sec
2013-04-02 16:53:08,969 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.38 sec
2013-04-02 16:53:09,975 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.38 sec
2013-04-02 16:53:10,980 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.38 sec
2013-04-02 16:53:11,985 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.38 sec
2013-04-02 16:53:12,989 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.38 sec
2013-04-02 16:53:13,994 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.38 sec
2013-04-02 16:53:15,000 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.38 sec
2013-04-02 16:53:16,006 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.38 sec
2013-04-02 16:53:17,012 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.38 sec
MapReduce Total cumulative CPU time: 3 seconds 380 msec
Ended Job = job_201303271417_0002
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 3.38 sec   HDFS Read: 6018 HDFS Write: 4 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 380 msec
OK
500
Time taken: 35.347 seconds
可以看到该查询其实是启动了一个MapReduce来进行的计算,总共有500条记录。
3、 创建带有分区的表
hive> LOAD DATA LOCAL INPATH ‘./examples/files/kv2.txt’ OVERWRITE INTO TABLE invites PARTITION (ds=’2008-08-15′);
hive> LOAD DATA LOCAL INPATH ‘./examples/files/kv3.txt’ OVERWRITE INTO TABLE invites PARTITION (ds=’2008-08-08′);
4、 从hdfs中导入数据到Hive表中
hive> LOAD DATA INPATH ‘/user/myname/kv2.txt’ OVERWRITE INTO TABLE invites PARTITION (ds=’2008-08-15′);

如果是多分区,如两个分区,则为:(ds=’2008-08-15′,node=’bjuni’)

上面是hive快速使用的一部分,下一部分在之后会再整理。

如无转载说明,则均为本站原创文章,转载请注明:来源:子猴博客

0 0