hadoop记录篇6-数据仓库hive
来源:互联网 发布:数据库原理中的设计题 编辑:程序博客网 时间:2024/06/04 23:21
一 。hive简介
hive是基于hadoop文件系统的大数据分析工具,能够轻松实现数据汇总 点对点查询 大批量数据分析等 使用传统的SQL语法 提供了UDF 用户自定义函数来分析统计数据。
hive的数据组成:
数据库(Databases) 类似于 mysql的数据库 用于将不同表进行区分的命名空间;
表(Table) ddl表预先定义列名和数据的格式 dml操作带有行和列的数据集;
分区(Partitions) 根据特定的列进行分区 数据会被写到文件系统不同的分区目录;
桶 Buckets (or Clusters) 定义hash的桶的个数和列 插入数据时根据列的值%桶的个数 决定文件写入哪个文件
hive架构:
hive启动将表结构数据存储在derby或者其他数据库中,将数据行存储在hdfs文件系统中 客户端通过jdbc或者对应客户端发起查询时 hive将对应sql结构分析后 通过metastore(表结构)优化后转换成mapreduce任务执行 返回结果(图片摘自网络)
hive适用于少量用户统计查询数据 不适用于大批量用户请求
二。 hive安装
1》hive单机安装(参考官方wiki https://cwiki.apache.org/confluence/display/Hive/Home)
hive使用hadoop来存储数据必须拥有hadoop环境 使用关系型数据库存储表结构数据 hive内置了derby数据库 默认使用derby 需要搭建hadoop集群
环境参考http://blog.csdn.net/liaomin416100569/article/details/78360734环境
主机信息
/etc/hosts 192.168.58.147 node1 192.168.58.149 node2 192.168.58.150 node3 192.168.58.151 node4节点信息
namenode node1 nameserviceid:node1 node2 nameserviceid:node2 secondarynode node1 node2 DataNodes node2 node3 node4 Resource Manager node1 NodeManger node2 node3 node4我这里是fendration所以 将node1的core-site.xml 默认dfs修改成本机非viewfs
<property> <name>fs.defaultFS</name> <value>hdfs://node1</value></property>hive单机还 安装在node1节点上
安装jdk1.7以上版本(安装过hadoop所以有了)
下载hive版本 apache-hive-2.2.0-bin.tar.gz (hive.apache.org)
解压hive安装包到: /soft/hive-2.2.0
/etc/profile添加
HIVE_HOME=/soft/hive-2.2.0export HIVE_HOMEPATH=$PATH:${HIVE_HOME}/binexport PATH执行命令 立刻生效
source /etc/profilenode1的hdfs上创建几个目录用于存储数据
hdfs上创建关键目录 hdfs dfs -mkdir /tmp hdfs dfs -mkdir -p /user/hive/warehouse hdfs dfs -chmod g+w /tmp hdfs dfs -chmod g+w /user/hive/warehouse单机方式metadata文件只能同一时间一个用户同时访问 该方式不启用任何端口 连接就直接访问元数据和文件系统(hive要读取hadoop知道hdfs默认地址所以必须配置HADOOP_HOME)
内嵌数据库 derby管理元数据
将 $HIVE_HOME/conf/metastore_db 删除或者 备份为其他名字
mv $HIVE_HOME/conf/metastore_db $HIVE_HOME/conf/metastore_db_tmp可以通过 /conf目录下的hive-default.xml.template下查看到以下四个选项 分别是jdbc的url 驱动类 用户名和密码
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:derby:;databaseName=metastore_db;create=true</value> 默认使用derby管理元数据 <description> JDBC connect string for a JDBC metastore. To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL. For example, jdbc:postgresql://myhost/db?ssl=true for postgres database. </description> </property> <name>javax.jdo.option.ConnectionPassword</name> <value>mine</value> <description>password to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>APP</value> <description>Username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.apache.derby.jdbc.EmbeddedDriver</value> <description>Driver class name for a JDBC metastore</description> </property>如果有需要将元数据存储在其他的数据库中 可以将驱动jar包 置于 ${HIVE_HOME}/lib目录下 修改这四个值为对应数据库即可
hive支持的数据库通过以下方式查看
[root@node1 upgrade]# pwd/soft/hive-2.2.0/scripts/metastore/upgrade[root@node1 upgrade]# lltotal 24drwxr-xr-x 2 root root 89 Oct 27 01:39 azuredbdrwxr-xr-x 2 root root 4096 Oct 27 01:39 derbydrwxr-xr-x 2 root root 4096 Oct 27 01:39 mssqldrwxr-xr-x 2 root root 4096 Oct 27 01:39 mysqldrwxr-xr-x 2 root root 4096 Oct 27 01:39 oracledrwxr-xr-x 2 root root 4096 Oct 27 01:39 postgres初始化 hive 使用内置的derby管理元数据 (也就是表结构数据)
schematool -dbType derby -initSchema初始化成功使用hive命令 进入客户端控制
hive可以使用类似mysql的语法创建数据库 操作表等
如果过程中发现无法连接等问题 可以直接查看hive-log4j2.properties.template 查看日志文件在哪里
默认位置是:property.hive.log.dir = ${sys:java.io.tmpdir}/${sys:user.name}
也就是/tmp/root目录 下面有个文件是 hive.log
2》hive服务器(hiveserver2+beeline)安装
hive服务器模式是启动一个端口 对外发布 多个客户端通过jdbcapi 登录该端口进行统计查询 该模式适用于多用户模式
顺便这里演示一下 将元数据保存在mysql中
比如本机window mysql数据库 192.168.58.1 3306 root 123456
$HIVE_HOME/conf目录下新建文件 hive-site.xml 拷贝四要素 并修改为mysql
configuration><property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://192.168.58.1:3306/metadb</value> <description> JDBC connect string for a JDBC metastore. To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL. For example, jdbc:postgresql://myhost/db?ssl=true for postgres database. </description></property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123456</value> <description>password to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> <description>Username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property></configuration>拷贝mysql驱动包 到 ${HIVE_HOME}/lib目录下
58.1 mysql数据库上创建数据库 metadb
create database metadbnode1执行初始化metadb
[root@node1 lib]# schematool -dbType mysql -initSchema SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/soft/hive-2.2.0/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]Metastore connection URL: jdbc:mysql://192.168.58.1:3306/metadbMetastore Connection Driver : com.mysql.jdbc.DriverMetastore connection User: rootStarting metastore schema initialization to 2.1.0Initialization script hive-schema-2.1.0.mysql.sqlInitialization script completedwindow机器上看出 metadb数据库下是否多了一堆表
mysql> use metadbDatabase changedmysql> show tables;+---------------------------+| Tables_in_metadb |+---------------------------+| aux_table || bucketing_cols || cds || columns_v2 || compaction_queue || completed_compactions || completed_txn_components || database_params || db_privs || dbs || delegation_tokens || func_ru || funcs || global_privs || hive_locks || idxs || index_params || key_constraints || master_keys || next_compaction_queue_id || next_lock_id || next_txn_id || notification_log || notification_sequence || nucleus_tables || part_col_privs || part_col_stats || part_privs || partition_events || partition_key_vals || partition_keys || partition_params || partitions || role_map || roles || sd_params || sds || sequence_table || serde_params || serdes || skewed_col_names || skewed_col_value_loc_map || skewed_string_list || skewed_string_list_values || skewed_values || sort_cols || tab_col_stats || table_params || tbl_col_privs || tbl_privs || tbls || txn_components || txns || type_fields || types || version || write_set |+---------------------------+57 rows in set (0.01 sec)确保当前机器存在hadoop安装 并且所有hadoop和yarn必须启动以下配置是hadoop的一种代理机制
修改hadoop的core-site.xml
<property> <name>hadoop.proxyuser.用户名.hosts</name> <value>*</value></property><property> <name>hadoop.proxyuser.用户名.groups</name> <value>*</value></property>用户名可以随便取 *表示hadoop集群中所有机器的用户 可以使用自己指定的用户名伪装代理 就相当于我有了工牌 就不用报名字了
hive默认使用的是匿名账号没有任何权限 所有连接时 配置一个代理用户 -n 代理用户 就可以访问hadoop下所有文件了 代理
用户 相关 参考(http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/Superusers.html)
我假设hive连接是 root
<property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value></property><property> <name>hadoop.proxyuser.root.groups</name> <value>*</value></property>node1(58.147)启动服务
hiveserver2
默认启动端口10000(用于jdbc连接) 10002是web监控界面
其他机器上也安装一个hive
beeline -u jdbc:hive2://192.168.58.147:10000 -n root 使用root用户连接创建表 查询测试注意 yarn和hadoop一定要全部启动 (jps查看) 因为测试插入数据 调用mapreduce
客户端测试
[root@node2 bin]# beeline -u jdbc:hive2://192.168.58.147:10000 -n rootSLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/soft/hive-2.2.0/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]Connecting to jdbc:hive2://192.168.58.147:10000Connected to: Apache Hive (version 2.2.0)Driver: Hive JDBC (version 2.2.0)Transaction isolation: TRANSACTION_REPEATABLE_READBeeline version 2.2.0 by Apache Hive0: jdbc:hive2://192.168.58.147:10000> show databases;+----------------+--+| database_name |+----------------+--+| default || test |+----------------+--+2 rows selected (4.615 seconds)0: jdbc:hive2://192.168.58.147:10000> drop database test;Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database test is not empty. One or more tables exist.) (state=08S01,code=1)0: jdbc:hive2://192.168.58.147:10000> create database hello;No rows affected (0.96 seconds)0: jdbc:hive2://192.168.58.147:10000> use hello;No rows affected (0.252 seconds)0: jdbc:hive2://192.168.58.147:10000> create table tt(id int,name string);No rows affected (1.403 seconds)0: jdbc:hive2://192.168.58.147:10000> insert into tt values(1,'zs');WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.No rows affected (45.633 seconds)0: jdbc:hive2://192.168.58.147:10000> select * from tt;+--------+----------+--+| tt.id | tt.name |+--------+----------+--+| 1 | zs |+--------+----------+--+1 row selected (1.515 seconds)在hiveserver2控制台 看到插入数据会启动mapreduce
测试成功。。。。
使用help查看所有beeline所有帮助
Beeline version 2.2.0 by Apache Hivebeeline> help!addlocaldriverjar Add driver jar file in the beeline client side.!addlocaldrivername Add driver name that needs to be supported in the beeline client side.!all Execute the specified SQL against all the current connections!autocommit Set autocommit mode on or off!batch Start or execute a batch of statements!brief Set verbose mode off!call Execute a callable statement!close Close the current connection to the database!closeall Close all current open connections!columns List all the columns for the specified table!commit Commit the current transaction (if autocommit is off)!connect Open a new connection to the database.!dbinfo Give metadata information about the database!describe Describe a table!dropall Drop all tables in the current database!exportedkeys List all the exported keys for the specified table!go Select the current connection!help Print a summary of command usage!history Display the command history!importedkeys List all the imported keys for the specified table!indexes List all the indexes for the specified table!isolation Set the transaction isolation for this connection!list List the current connections!manual Display the BeeLine manual!metadata Obtain metadata information!nativesql Show the native SQL for the specified statement!nullemptystring Set to true to get historic behavior of printing null as empty string. Default is false.!outputformat Set the output format for displaying results (table,vertical,csv2,dsv,tsv2,xmlattrs,xmlelements, and deprecated formats(csv, tsv))!primarykeys List all the primary keys for the specified table!procedures List all the procedures!properties Connect to the database specified in the properties file(s)!quit Exits the program!reconnect Reconnect to the database!record Record all output to the specified file!rehash Fetch table and column names for command completion!rollback Roll back the current transaction (if autocommit is off)!run Run a script from the specified file!save Save the current variabes and aliases!scan Scan for installed JDBC drivers!script Start saving a script to a file!set Set a beeline variable!sh Execute a shell command!sql Execute a SQL command!tables List all the tables in the database!typeinfo Display the type map for the current connection!verbose Set verbose mode onComments, bug reports, and patches go to ???
退出命令 !quit
3》hcatalog非交互客户端
首先沿用2章节 hiveserver2的metastore
hcatalog可以直接在命令行直接执行sql 主要用执行ddl语句
创建日志目录
mkdir -p /soft/hive-2.2.0/hcatalog/var/log
启动服务
[root@node1 sbin]# ./hcat_server.sh startStarted metastore server init, testing if initialized correctly...Metastore initialized successfully on port[9083].当前机器使用 命令
[root@node1 bin]# ./hcat -e "create table ggg(id int,name string)"使用hive查看是否创建表
[root@node1 log]# hivewhich: no hbase in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/root/mongodb-linux-x86_64-rhel70-3.4.9/bin:/soft/hadoop-2.7.4/bin:/soft/hadoop-2.7.4/sbin:/soft/hive-2.2.0/bin)SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/soft/hive-2.2.0/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]Logging initialized using configuration in jar:file:/soft/hive-2.2.0/lib/hive-common-2.2.0.jar!/hive-log4j2.properties Async: trueHive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.hive> show databases;OKdefaulthellotestTime taken: 3.325 seconds, Fetched: 3 row(s)hive> show tables;OKgggTime taken: 0.209 seconds, Fetched: 1 row(s)hive> desc ggg;OKid int name string Time taken: 0.926 seconds, Fetched: 2 row(s)其他用法参考(https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat)
hcatalog只支持单机访问 如果需要远程 必须搭配webhcat服务
webhcat支持rest风格url操作hive
具体参考
安装 https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ConfigurationManagementOverview
介绍https://cwiki.apache.org/confluence/display/Hive/WebHCat+UsingWebHCat
- hadoop记录篇6-数据仓库hive
- Hadoop数据仓库hive的应用
- Hadoop之数据仓库构建-Hive
- hive (基于hadoop的数据仓库)
- Hadoop之数据仓库构建-Hive
- hive--基于Hadoop的数据仓库Hive 学习指南
- 几篇关于Hadoop+Hive数据仓库的入门文章
- 几篇关于Hadoop+Hive数据仓库的入门文章
- 几篇关于Hadoop+Hive数据仓库的入门文章
- 几篇关于Hadoop+Hive数据仓库的入门文章
- hadoop+hive 做数据仓库 & 一些测试
- 基于Hadoop的数据仓库Hive 学习指南
- hadoop的数据仓库--Hive初识入门
- [完]基于Hadoop的数据仓库Hive 基础知识
- 基于Hadoop的数据仓库Hive 学习指南
- 基于Hadoop的数据仓库Hive 学习指南
- 基于Hadoop的数据仓库Hive基础知识
- 基于Hadoop的数据仓库Hive 基础知识
- Python——提取复数类型的数组的的实数部分和虚数部分
- C/C++基本数据类型所占字节数
- [Oracle]Oracle中查询该表中的主键被哪些表引用,该表中的外键来源于哪些表
- Linux环境编译静态库动态库说明
- 程序员单身高达四成 吃早饭有助于涨薪水!
- hadoop记录篇6-数据仓库hive
- PAT考试乙级1009(C语言实现)
- J-Link在SWD模式与MCU能连接成功但不能读写
- Android 对于AutoLayout的一点小优化
- java中文乱码解决之道(七)-----JSP页面编码过程
- 48-同名覆盖引发的问题
- Android往Web服务器发送文件
- mysql主从设置
- IEEE JBHI 投稿因格式问题打回记录