(2) Hive 数据导入方式

来源:互联网 发布:自学java好难 编辑:程序博客网 时间:2024/05/27 02:29

Hive 数据导入方式

  述:Hive中,没有行级别的数据插入、数据更新和删除操作。往表中装载数据的唯一途径就是使用大量的数据装载操作。其中hive的数据导入方式分为四种,分别是

  1. 从本地的文件系统中导入数据
  2. Hadoop分布式文件系统(HDFS)中导入数据
  3. 通过查询语句向表中插入数据
  4. 单个查询语句中创建表并加载数据

 

1  从本地的文件系统中导入数据

1.1 本地系统文件

假设在本地系统的Home目录下有一个文件 /home/zhangjiangbin/student_grade.txt文件,该文件的数据列之间是用 \t 制表位分割的。其具体内容为学号、姓名、课程、分数:

[zhangjiangbin@tnamenode2 ~]$cat student_grade.txt

2013331013      张三    英语    96

2013331013      张三    语文    89

2013331014      王五    数学    99

2013331014      王五    政治    79

2013331014      王五    化学    91

2013331014      王五    物理    78

2013331016      李四    数学    86

2013331016      李四    语文    76

2013331016      李四    政治    89

2013331016      李四    化学    85

1.2 创建表

 

1

2

3

4

5

6

7

create table if not exists student_grade

(student_id string,

name string,

course string,

score int

)

row format delimited fields terminated by '\t' stored as textfile

 

1.2 导入数据

采用如下的语句导入数据

1

2

LOAD DATA LOCAL INPATH '/home/zhangjiangbin/student_grade.txt'

OVERWRITE INTO TABLE student_grade



 

创建的详细信息为

1

2

3

4

5

6

hive> LOAD DATA LOCAL INPATH '/home/zhangjiangbin/student_grade.txt'

> OVERWRITE INTO TABLE student_grade;

Loading data to table default.student_grade

Table default.student_grade stats: [numFiles=1, numRows=0, totalSize=290, rawDataSize=0]

OK

Time taken: 1.702 seconds

 

这样,就将文件中的信息存入到了hive表中。

 

1

2

3

4

5

6

7

8

9

10

11

12

13

hive> SELECT * FROM student_grade;

OK

2013331013      张三    英语    96

2013331013      张三    语文    89

2013331014      王五    数学    99

2013331014      王五    政治    79

2013331014      王五    化学    91

2013331014      王五    物理    78

2013331016      李四    数学    86

2013331016      李四    语文    76

2013331016      李四    政治    89

2013331016      李四    化学    85

Time taken: 0.216 seconds, Fetched: 10 row(s)

 

2  Hadoop分布式文件系统(HDFS)中导入数据

在上节中,从文件中导入数据到hive表中,使用了LOCAL关键字,那么这个路径就是本地的文件系统。如果去除掉LOCAL关键字,那么这个路径就是分布式文件系统中的路径。

从本地文件系统中将数据导入到Hive表的过程中,其实是先将数据临时复制到HDFS的一个目录下(典型的情况是复制到上传用户的HDFS home目录下),然后再将数据从那个临时目录下移动(注意,这里说的是移动,不是复制!)到对应的Hive表的数据目录里面。既然如此,那么Hive肯定支持将数据直接从HDFS上的一个目录移动到相应Hive表的数据目录下。

 

 

3  单个查询语句中创建表并加载数据

我们可以在一个语句中完成创建表并将查询结果载入到这个表中。其相应的操作为

 

 

1

2

3

CREATE TABLE student_grade_test AS

SELECT *

FROM student_grade

 

操作结果如下:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

hive> CREATE TABLE student_grade_test AS

    > SELECT *

    > FROM student_grade

    > ;

Query ID = zhangjiangbin_20160402173434_2989ede3-27ae-4138-99cb-1d0287998ef5

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1459502490104_0003, Tracking URL = http://tnamenode2:8088/proxy/application_1459502490104_0003/

Kill Command = /opt/data/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/hadoop/bin/hadoop job  -kill job_1459502490104_0003

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2016-04-02 17:34:42,543 Stage-1 map = 0%,  reduce = 0%

2016-04-02 17:34:51,394 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.42 sec

MapReduce Total cumulative CPU time: 2 seconds 420 msec

Ended Job = job_1459502490104_0003

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://tnameservice1/user/hive/warehouse/.hive-staging_hive_2016-04-02_17-34-29_637_1462779194903195470-1/-ext-10001

Moving data to: hdfs://tnameservice1/user/hive/warehouse/student_grade_test

Table default.student_grade_test stats: [numFiles=1, numRows=10, totalSize=280, rawDataSize=270]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1   Cumulative CPU: 2.42 sec   HDFS Read: 3434 HDFS Write: 363 SUCCESS

Total MapReduce CPU Time Spent: 2 seconds 420 msec

OK

Time taken: 24.479 seconds

 

1

2

3

4

5

6

7

8

9

10

11

12

13

hive> SELECT * FROM student_grade_test;

OK

2013331013      张三    英语    96

2013331013      张三    语文    89

2013331014      王五    数学    99

2013331014      王五    政治    79

2013331014      王五    化学    91

2013331014      王五    物理    78

2013331016      李四    数学    86

2013331016      李四    语文    76

2013331016      李四    政治    89

2013331016      李四    化学    85

Time taken: 0.174 seconds, Fetched: 10 row(s)

 

    使用这个功能的常见情况是从一个大的宽表中选取部分需要的数据集。这个功能不能用于外部表。

 

4  通过查询语句向表中插入数据

INSERT语句允许用户通过查询语句向目标表中插入数据。假设在hive中有一个student_grade_test表,其字段信息与第1节中创建的student_grade表的字段是一致的。则,其操作为

 

1

2

INSERT INTO TABLE student_grade_test

SELECT * FROM student_grade

 

其结果如下:

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

hive> INSERT INTO TABLE student_grade_test

    > SELECT * FROM student_grade;

Query ID = zhangjiangbin_20160402174141_0b9e43e1-b0b1-484b-8229-806cfdb03547

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1459502490104_0004, Tracking URL = http://tnamenode2:8088/proxy/application_1459502490104_0004/

Kill Command = /opt/data/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/hadoop/bin/hadoop job  -kill job_1459502490104_0004

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2016-04-02 17:41:58,114 Stage-1 map = 0%,  reduce = 0%

2016-04-02 17:42:07,707 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.83 sec

MapReduce Total cumulative CPU time: 2 seconds 830 msec

Ended Job = job_1459502490104_0004

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://tnameservice1/user/hive/warehouse/student_grade_test/.hive-staging_hive_2016-04-02_17-41-46_083_4588923498662617977-1/-ext-10000

Loading data to table default.student_grade_test

Table default.student_grade_test stats: [numFiles=2, numRows=20, totalSize=560, rawDataSize=540]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1   Cumulative CPU: 2.83 sec   HDFS Read: 3913 HDFS Write: 363 SUCCESS

Total MapReduce CPU Time Spent: 2 seconds 830 msec

OK

Time taken: 23.385 seconds

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

hive> SELECT * FROM student_grade_test;

OK

2013331013      张三    英语    96

2013331013      张三    语文    89

2013331014      王五    数学    99

2013331014      王五    政治    79

2013331014      王五    化学    91

2013331014      王五    物理    78

2013331016      李四    数学    86

2013331016      李四    语文    76

2013331016      李四    政治    89

2013331016      李四    化学    85

2013331013      张三    英语    96

2013331013      张三    语文    89

2013331014      王五    数学    99

2013331014      王五    政治    79

2013331014      王五    化学    91

2013331014      王五    物理    78

2013331016      李四    数学    86

2013331016      李四    语文    76

2013331016      李四    政治    89

2013331016      李四    化学    85

Time taken: 0.193 seconds, Fetched: 20 row(s)

 

这里没有使用OVERWRITE,那么Hive将会以追加的方式写入数据,不会覆盖掉之前存在的内容。若采用了OVERWRITE,目标表中的内容会被覆盖掉。其操作和相应的结果如下

1

2

INSERT OVERWRITE TABLE student_grade_test

SELECT * FROM student_grade

 

1

2

3

4

5

6

7

8

9

10

11

12

13

hive> SELECT * FROM student_grade_test;

OK

2013331013      张三    英语    96

2013331013      张三    语文    89

2013331014      王五    数学    99

2013331014      王五    政治    79

2013331014      王五    化学    91

2013331014      王五    物理    78

2013331016      李四    数学    86

2013331016      李四    语文    76

2013331016      李四    政治    89

2013331016      李四    化学    85

Time taken: 0.183 seconds, Fetched: 10 row(s)

 

需要说明的是,使用OVERWRITE只是覆盖掉相应的分区,不会覆盖其他的分区。

 

   

 

 

  1. Hive几种数据导入方式 http://www.iteblog.com/archives/949
  2. Hive总结(七)Hive四种数据导入方式 http://blog.csdn.net/lifuxiangcaohui/article/details/40588929

 

 

 

0 0
原创粉丝点击