Hive知识点总结

来源：互联网发布：android程序员水平高低编辑：程序博客网时间：2024/06/06 03:30

Linux命令汇总:
tar -zxvf *.tar.gz -C /test ----将*.tar.gz解压到/test下
rpm -ivh MySQL-server-5.1.73-1.glibc23.i386.rpm ----安装*.rpm
uname -a ---查看本机信息

Hive的引入：
1、hive是一个将sql语句转换成mapreduce程序的工具，不需要部署集群，哪用哪安装即可
2、hive中也可以创建“库”和“表”
3、所谓“库”和“表”，在hive中是不保存数据的，只保存库和表的描述信息
4、所谓“库”和“表”的数据是保存在hdfs上
5、hive就需要技术“库”和“表”与hdfs上的具体存放路径之间的映射关系，这种映射关系保存在hive的“元数据库”中
6、hive的“元数据库”默认使用derby来存储，有弊端，换一个目录执行hive，之前创建的库表的元数据就没有了，所以，我们应该给hive配置一个更强大的数据库来管理元数据，一般配置使用mysql（小巧，免费，简单）

linux下安装mysql数据库：
hive是去访问mysql，故mysql安装在哪里都可以
安装mysql
1.将MySQL-server-5.1.73-1.glibc23.i386.rpm 和 MySQL-client-5.1.73-1.glibc23.i386.rpm put到linux本地
2.rpm -ivh MySQL-server-5.1.73-1.glibc23.i386.rpm
rpm -e mysql-libs-5.1.66-2.el6_3.i686 --nodeps CenOS自带一个mysql的包，与此次安装的包冲突了，解决冲突（卸载CenOS自带的mysql包）之后，
再安装
rpm -ivh MySQL-server-5.1.73-1.glibc23.i386.rpm
3.然后安装客户端
rpm -ivh MySQL-client-5.1.73-1.glibc23.i386.rpm
然后设置root的密码：执行 /usr/bin/mysql_secure_installation，根据提示操作，其中对root开放远程登录权限！！！
4.先登录到mysql中
mysql -uroot -proot
然后，设置权限：
grant all privileges on *.* to 'root'@'%' identified by 'root' with grant option; （上面命令：是使任何库的任何表可以在任何主机去访问~~~）
flush privileges;（刷新）

hive的安装：
将apache-hive-1.2.1-bin.tar.gz put到linux本地
tar -zxvf apache-hive-1.2.1-bin.tar.gz
将
启动hive /home/hadoop/apache-hive-1.2.1-bin/bin/hive
hive执行sql：
执行show databases;会出现初始的OK和default
create databases testdb;
use testdb;
create table test_t3(id int,name string) row format delimited fields terminated by ',' stored as textfile;
此时回到/home/hadoop/apache-hive-1.2.1-bin/bin/下会多出一个metastore_db
（注意：默认情况下在哪个目录下运行的hive，就会在当前的目录下生成一套元数据，即metastore_db，再换到其他目录启动hive将不能访问到这个metastore_db，而是会本次启动hive的目录下再次生成一个metastore_db，这样不合适，故要安装第三方的数据库来存储这些元数据才合理，
这里选择mysql数据库。）

下面配置hive，让hive使用mysql作为元数据的管理系统：
1.既然hive要访问mysql，hive中就需要mysql的jdbc包，由于/home/hadoop/apache-hive-1.2.1-bin/lib下没有，
故找一个拷贝进去（mysql-connector-java-5.1.28.jar）
2.在/home/hadoop/apache-hive-1.2.1-bin/conf中新建hive-site.xml，并配置如下：
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoop-03-server:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
</property>
</configuration>
3./home/hadoop/apache-hive-1.2.1-bin/bin/hive
新建一个表create table test_t; show tables;
此时，在mysql中会创建好上面配置的hive库，同时在hdfs文件系统中将会出现/user/hive/warehouse/test_t~~~
4.注意：在hive中执行sql建表时，除了常规的sql语法，还必须要同时声明一些表格的数据格式、字段等；
这样，当往hdfs中放一些数据时就按照这些声明好的格式放数据，在hive这边才可以查得到数据~~~！！！！！！！！！！！！！！！
如：hive> create table test_t3(id int,name string)
> row format delimited----声明一行作为一条数据
> fields terminated by ','----声明以“，”作为分隔符
> stored as textfile;----声明文件格式为普通文本（不写默认就是这种，）

hadoop fs -put hivetest.txt /user/hive/warehouse/test_t3 ----将数据put到hdfs的表目录中（put多个文件的会作为一个表的数据！！！）
(hivetest.txt文件内容： 001，张三
002，lisi
003，wangwu)

此时，表格数据格式声明好了，也按这个格式添加了数据，
hive>select * from test_t3;
hive>select count(*) from test_t3;
hive>select * from test_t3 order by id desc;

阶段总结：用hive做数据分析与用数据库存表格用sql查询，唯一的不同就是建表的时候需要声明一些数据规则！！！
5.三种加载表数据到hdfs系统的方法：
(1)除了上面那种从本地系统put的方法，还可以用下面给两种方法:
(2)//从本地导入数据到hive的表中（实质就是将文件上传到hdfs中hive管理目录下）
load data local inpath '/home/hadoop/ip.txt' into table tab_ext;
(3)//从hdfs上导入数据到hive表中（实质就是将文件从原始目录移动到hive管理的目录下）
load data inpath 'hdfs://ns1/aa/bb/data.log' into table tab_user;

6.external外部表：
上面用的都是内部表,内部表的数据都必须强制性放在指定hdfs系统目录（如上面是：/user/hive/warehouse/test_t3）
而外部表不用，可以在创建表时在sql中指定一个存放表的hdfs系统目录即可~~~
外部表的创建只比创建内部表多一句：
hive>CREATE EXTERNAL TABLE tab_ip_ext(id int, name string,ip STRING,country STRING)
>ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
>STORED AS TEXTFILE
>LOCATION '/external/user';
//external外部表（drop一个外部表时，只是清楚它的元数据，而表在hdfs上的数据并不会删除；反之，内部表drop后，元数据和表中的数据都会被删除）

hive sql分区表：
hive> create table lpx_partition_test(global_id int, company_name string)partitioned by (stat_date string, province string) row formatdelimited fields terminated by ',';
hive> alter table lpx_partition_test add PARTITION(stat_date='2011-06-08', province='ZheJiang');
hive> alter table lpx_partition_test add PARTITION(stat_date='2011-06-08', province='GuangDong');
hive> alter table lpx_partition_test add PARTITION(stat_date='2011-06-09', province='ZheJiang');
hadoop fs -ls /user/hive/warehouse/lpx_partition_test
（Found 2 items
drwxr-xr-x - root supergroup 0 2011-08-01 16:42 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08
drwxr-xr-x - root supergroup 0 2011-08-01 16:42 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-09）
hadoop dfs -ls /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08
（drwxr-xr-x - root supergroup 0 2011-08-01 16:42 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=GuangDong
drwxr-xr-x - root supergroup 0 2011-08-01 16:37 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=ZheJiang）
由此可见，每个分区都有一个独立的文件对应，stat_date位于父层级，province位于子层级。
向分区中插入数据：
hive> create table lpx_partition_test_in(global_id int, company_name string, province string)row format delimited fields terminated by ' ';
hive> LOAD DATA LOCAL INPATH '/opt/hadoop/mytest/lpx_partition_test.txt' OVERWRITE INTO TABLE lpx_partition_test_in;
hive> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province='ZheJiang') select global_id, company_name fromlpx_partition_test_in where province='ZheJiang';
hive> select * from lpx_partition_test；

hive> from lpx_partition_test_in
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province='ZheJiang') select global_id, company_name whereprovince='ZheJiang'
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province='GuangDong') select global_id, company_name whereprovince='GuangDong'
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-09', province='ZheJiang') select global_id, company_name whereprovince='ZheJiang'
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-09', province='GuangDong') select global_id, company_name whereprovince='GuangDong';
hive> select * from lpx_partition_test;

动态分区：
--动态分区
如果有大量的数据需要插入到不同的分区，需要对每一个分区都写一条insert语句，必须使用大量的insert语句。
为了加载某一天各个省份的分区数据，必须为每一个省份写一条insert语句，使用起来非常不方便。如果需要插入另外一天的数据，必须要修改DML语句和DDL语句，而且每个insert语句转化为MapReduce任务也非常不方便。动态分区，用来解决这个问题，它可以根据输入数据来动态决定如何创建哪个区分和将数据放入哪个分区。这个特性从0.6.0版开始有。在动态插入过程中，输入列值被评估并动态地决定要插入的分区。如果相应的分区没有被创建，它会自动地创建该分区。使用这个特性，我们可以使用一条insert语句来创建和写入所有必要的分区。另外，因为只有一条sql语句，所以对应地只有一个MapReduce任务。这可以极大地提升性能并且减少Hadoop聚类的负载。

动态分区参数：
hive.exec.max.dynamic.partitions.pernode：每个mapper or reducer创建动态分区的最大数量，小于100.
hive.exec.max.dynamic.partitions：每个DML语句可以创建的动态分区的数量，小于1000.
hive.exec.max.created.files:所有smapper or reducer创建文件的最大数量，小于100000.

hive> set hive.exec.dynamic.partition;
hive.exec.dynamic.partition=false
hive> set hive.exec.dynamic.partition = true;
hive> set hive.exec.dynamic.partition;
hive.exec.dynamic.partition=true

hive> from lpx_partition_test_in
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province) select global_id, company_name,province;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Execution log at: /tmp/root/root_20110801183737_64ce8cf1-a068-4fbf-9d8e-561118569b2c.log
Job running in-process (local Hadoop)
2011-08-01 18:37:57,566 null map = 100%, reduce = 0%
Ended Job = job_local_0001
Ended Job = -1141443727, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-37-51_921_8609501383674778354/-ext-10000
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-08, province=null)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=GuangDong
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=ZheJiang
Loading partition {stat_date=2011-06-08, province=GuangDong}
Loading partition {stat_date=2011-06-08, province=ZheJiang}
Partition default.lpx_partition_test{stat_date=2011-06-08, province=GuangDong} stats: [num_files: 1, num_rows: 1, total_size: 23]
Partition default.lpx_partition_test{stat_date=2011-06-08, province=ZheJiang} stats: [num_files: 1, num_rows: 3, total_size: 60]
Table default.lpx_partition_test stats: [num_partitions: 4, num_files: 4, num_rows: 8, total_size: 166]
OK
Time taken: 6.683 seconds

hive> from lpx_partition_test_in
> insert overwrite table lpx_partition_test PARTITION(stat_date, province) select global_id, company_name,stat_date,province;
FAILED: Error in semantic analysis: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict

hive> set hive.exec.dynamic.partition.mode=nonstrict;

hive> from lpx_partition_test_in t
> insert overwrite table lpx_partition_test PARTITION(stat_date, province) select t.global_id, t.company_name, t.stat_date, t.province DISTRIBUTE BY t.stat_date, t.province;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Execution log at: /tmp/root/root_20110802131616_02744950-1c88-4073-8aae-07c964073c1a.log
Job running in-process (local Hadoop)
2011-08-02 13:16:30,765 null map = 0%, reduce = 0%
2011-08-02 13:16:37,776 null map = 100%, reduce = 0%
2011-08-02 13:16:40,915 null map = 100%, reduce = 100%
Ended Job = job_local_0001
Loading data to table default.lpx_partition_test partition (stat_date=null, province=null)
Loading partition {stat_date=20110608, province=GuangDong}
Loading partition {stat_date=20110608, province=ZheJiang}
Loading partition {stat_date=20110609, province=ZheJiang}
Partition default.lpx_partition_test{stat_date=20110608, province=GuangDong} stats: [num_files: 1, num_rows: 1, total_size: 23]
Partition default.lpx_partition_test{stat_date=20110608, province=ZheJiang} stats: [num_files: 1, num_rows: 1, total_size: 23]
Partition default.lpx_partition_test{stat_date=20110609, province=ZheJiang} stats: [num_files: 1, num_rows: 2, total_size: 37]
Table default.lpx_partition_test stats: [num_partitions: 7, num_files: 7, num_rows: 12, total_size: 249]
OK
Time taken: 26.672 seconds

hive> select * from lpx_partition_test;
OK
99001 xxxcompany_name1 2011-06-08 GuangDong
99001 xxxcompany_name1 2011-06-08 ZheJiang
99002 xxxcompany_name1 2011-06-08 ZheJiang
99003 xxxcom2 2011-06-08 ZheJiang
99001 xxxcompany_name1 2011-06-09 GuangDong
99001 xxxcompany_name1 2011-06-09 ZheJiang
99002 xxxcompany_name1 2011-06-09 ZheJiang
99003 xxxcom2 2011-06-09 ZheJiang
99001 xxxcompany_name1 20110608 GuangDong
99001 xxxcompany_name1 20110608 ZheJiang
99002 xxxcompany_name1 20110609 ZheJiang
99003 xxxcom2 20110609 ZheJiang
Time taken: 1.179 seconds

为了让分区列的值相同的数据尽量在同一个MapReduce中，这样每一个mapreduce可以尽量少的产生新的文件夹，可以借助distribute by 功能，将分区列值相同的数据放在一起。

0 0