大数据集群环境搭建

来源：互联网发布：linux查看文件目录命令编辑：程序博客网时间：2024/05/16 19:07

1 概述
本手册包括大数据集群环境搭建的全部环境步骤说明，例如环境准备（操作系统、JAVA环境等）、Hadoop、Spark、NoSQL数据库等。
2 环境准备
2.1 操作系统

安装操作系统：操作系统统一采用CentOS_6.7_x64。

2.2 搭建JAVA环境

下载jdk1.8.0_102
上传至服务器解压 – 例如解压目录 /usr/local/jdk1.8.0_102
vi /etc/profile
在末尾加入
export JAVA_HOME=/usr/local/jdk1.8.0_102
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
source profile

2.3 安装ssh
（1）ubuntu:
Install: apt-get install openssh-server
Start: sudo service ssh start
Stop: sudo service ssh stop
Restart: sudo service ssh restart
查看ssh服务是否启动: ps –e|grep ssh
vim /etc/ssh/sshd_config => PermitRootLogin yes
（2）centos: yum install openssh-server -- 先用ssh –v查看是否已经预装
2.4 *安装rsync
 选择性安装
 检查rsync是否安装：dpkg –list|grep rsync
 安装rsync：apt-get install rsync / yum install rsync
2.5 *安装maven
 如需要编译源码则安装，否则可忽略
 下载maven，上传解压到服务器目录 ---- 例如上传到/usr/local

 vi /etc/profile ---- 配置maven环境变量
 source /etc/profile ---- 使配置生效
 mvn –version ---- 查看版本
3 搭建Hadoop环境
3.1 单节点
3.1.1安装Hadoop
 下载hadoop-2.7.2
 cd /usr/local
 sudo tar zxvf {PATH}/hadoop-2.7.2.tar.gz
 配置JDK:
cd /usr/local/hadoop-2.7.2
vim etc/hadoop/hadoop-env.sh export JAVA_HOME={jdk 绝对路径}
 ./bin/hadoop version
 启动hadoop: sudo ./skin/start-all.sh
3.1.2运行一个测试用例
 sudo mkdir input
 cp etc/hadoop/*.xml input
 sudo . bin/hadoop jar
share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+' --统计输入中包括dfs开头的单次数
 $ cat output/* --查看结果 1 dfsadmin
 如果要重新运行例子，先删除onput目录，不然会报错
3.2 伪分布式
3.2.1配置core-site.xml
 sudo vim etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
3.2.2配置hdfs-site.xml
 sudo vim etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
3.2.3配置ssh无密码登录
 su root --切换至root用户
 ssh-keygen -t rsa --会有提示，都按回车就可以
 cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys --加入授权
 scp /root/.ssh/authorized_keys root@192.168.72.101:/root/.ssh
。。。。。。
---- 将/root/.ssh/authorized_keys复制到其它节点

3.2.4在hdfs执行
 cd /usr/local/hadoop-2.7.2
 bin/hdfs namenode –format
 sudo sbin/start-dfs.sh
 NameNode - http://localhost:50070/
3.2.5在yarn执行
 vim etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
 vim etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
 sudo sbin/start-yarn.sh
 http://localhost:8088/
3.2.6运行一个测试用例
 bin/hdfs dfs -mkdir /user
 bin/hdfs dfs -mkdir /user/<username>
 bin/hdfs dfs -put etc/hadoop input
 bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'
 bin/hdfs dfs -get output output
 cat output/*
 (bin/hdfs dfs -cat output/*)
3.3 完全分布式
3.3.1 搭建JAVA环境
 参照2.3
 所有节点都需要进行java环境变量配置
3.3.2 配置hosts
 vi /etc/hosts
 所有节点都需要加入以下内容
192.168.72.100 namenode.domian
20.0.2.74 namenode2.domain
192.168.72.101 datanode1.domain
192.168.72.102 datanode2.domain
192.168.72.103 datanode3.domain
192.168.72.104 datanode4.domain
3.3.3 配置ssh无密码登录
 su- root --切换至root用户
 ssh-keygen -t rsa --会有提示，都按回车就可以
 cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys --加入授权
 将authorized_keys复制到其它节点
 ssh localhost 测试是否还需要输入密码
 将所有节点执行一次ssh-keygen -t rsa
 将各节点id_rsa.pub中的秘钥汇总到一个authorized_keys文件，将该文件复制到各个节点
3.3.4 关闭防火墙
 重启后生效(永久关闭)
开启： chkconfig iptables on
关闭： chkconfig iptables of
 即时生效，重启后失效
开启： service iptables start
关闭： service iptables stop
3.3.5 安装hadoop
 下载hadoop-2.7.2
 上传至服务器解压 – 例如解压目录 /usr/local/bigdata/hadoop-2.7.2
 vi /etc/profile
export HADOOP_HOME=/usr/local/bigdata/hadoop-2.7.2
export PATH=$ HADOOP_HOME/bin:$ HADOOP_HOME/sbin:$PATH
source /etc/profile
3.3.6 配置hadoop-env.sh
 cd /usr/local/bigdata/hadoop-2.7.2
 vi etc/hadoop/hadoop-env.sh 在文件最后面加入
 export JAVA_HOME=/usr/local/jdk1.8.0_92 （一定要用jdk 绝对路径）
3.3.7 配置masters
 cd /usr/local/bigdata/hadoop-2.7.2
 vi etc/hadoop/masters 配置如下内容
namenode2.domain
3.3.8 配置slaves
 cd /usr/local/bigdata/hadoop-2.7.2
 vi etc/hadoop/slaves 配置如下内容
namenode2.domain
datanode1.domain
datanode2.domain
datanode3.domain
datanode4.domain
3.3.9 配置core-site.xml
 cp etc/hadoop/core-site.xml etc/hadoop/core-site.xml.template (bakup first)
 vi etc/hadoop/core-site.xml 加入以下配置
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode.domain:9000</value>
</property>
<property>

<name>io.file.buffer.size</name>
<value>65536</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/usr/local/bigdata/hadoop-2.7.2/tmp</value>
</property>
</configuration>
3.3.10 配置hdfs-site.xml
 cp etc/hadoop/hdfs-site.xml etc/hadoop/hdfs-site.xml.template (bakup first)
 vi etc/hadoop/hdfs-site.xml 加入以下配置
<configuration>
<property>
<name>dfs.http.address</name>
<value>namenode.domain:50070</value>
<description>
The address and the base port where the dfs namenode web ui will listen on.
If the port is 0 then the server will start on a free port.
</description>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>namenode2.domain:50090</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/bigdata/hadoop-2.7.2/dfs/name</value>
</property>
<property>
<name>dfs.namenode.data.dir</name>
<value>file:/usr/local/bigdata/hadoop-2.7.2/dfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
3.3.11 配置yarn-site.xml
 cp etc/hadoop/yarn-site.xml etc/hadoop/yarn-site.xml.template (bakup first)
 vi etc/hadoop/yarn-site.xml 加入以下配置
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>namenode.domain</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
3.3.12 配置mapred-site.xml
 cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml
 vi etc/hadoop/mapred-site.xml 加入以下配置
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
3.3.13 复制jdk、hadoop-2.7.2到其它节点
 将/usr/local的jdk hadoop-2.7.2复制到其它节点
 配置java环境变量 (参照2.3)
3.2.14运行一个测试用例
 bin/ hdfs namenode –format (只执行一次)
 sbin/start-all.sh ---- 启动集群
 bin/hdfs dfs -mkdir /user
 bin/hdfs dfs -mkdir /user/<username>
 bin/hdfs dfs -put etc/hadoop input
 bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'
此时控制台输出以下内容代表任务运行成功

 bin/hdfs dfs -get output output
 cat output/* ----查看运行结果

 (bin/hdfs dfs -cat output/*)
3.3.15 运维Hadoop
 cd /usr/local/bigdata/hadoop-2.7.2
3.3.15.1 格式化namenode
 bin/ hdfs namenode –format
 如果出现了successfully format 则成功！！
 只执行一次
3.3.15.2 启动/停止Hadoop
 sbin/start-dfs.sh sbin/start-yarn.sh---- 启动
(sbin/start-all.sh 该命令已经过时，不推荐使用)
 sbin/stop-dfs.sh sbin/stop-yarn.sh---- 启动
(sbin/stop-all.sh 该命令已经过时，不推荐使用)
 NameNode:http://namenode.domain:50070
 ResourceManager:http://namenode.domain:8088
 MapReduce JobHistory Server:http://namenode.domain:19888
3.3.15.3 Hadoop Commands
 bin/hadoop ---- prints the description for hadoop commands
Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
CLASSNAME run the class named CLASSNAME
or
where COMMAND is one of:
fs run a generic filesystem user client
version print the version
jar <jar> run a jar file
note: please use "yarn jar" to launch
YARN applications, not this command.
checknative [-a|-h] check native hadoop and compression libraries availability
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath prints the class path needed to get the
credential interact with credential providers
Hadoop jar and the required libraries
daemonlog get/set the log level for each daemon
trace view and modify Hadoop tracing settings
Most commands print help when invoked w/o parameters.
 bin/hdfs ---- prints the description for hdfs commands
Usage: hdfs [--config confdir] [--loglevel loglevel] COMMAND
where COMMAND is one of:
dfs run a filesystem command on the file systems supported in Hadoop.
classpath prints the classpath
namenode -format format the DFS filesystem
secondarynamenode run the DFS secondary namenode
namenode run the DFS namenode
journalnode run the DFS journalnode
zkfc run the ZK Failover Controller daemon
datanode run a DFS datanode
dfsadmin run a DFS admin client
haadmin run a DFS HA admin client
fsck run a DFS filesystem checking utility
balancer run a cluster balancing utility
jmxget get JMX exported values from NameNode or DataNode.
mover run a utility to move block replicas across
storage types
oiv apply the offline fsimage viewer to an fsimage
oiv_legacy apply the offline fsimage viewer to an legacy fsimage
oev apply the offline edits viewer to an edits file
fetchdt fetch a delegation token from the NameNode
getconf get config values from configuration
groups get the groups which users belong to
snapshotDiff diff two snapshots of a directory or diff the
current directory contents with a snapshot
lsSnapshottableDir list all snapshottable dirs owned by the current user
Use -help to see options
portmap run a portmap service
nfs3 run an NFS version 3 gateway
cacheadmin configure the HDFS cache
crypto configure HDFS encryption zones
storagepolicies list/get/set block storage policies
version print the version
Most commands print help when invoked w/o parameters.
 bin/mapred ---- prints the description for mapred commands
Usage: mapred [--config confdir] [--loglevel loglevel] COMMAND
where COMMAND is one of:
pipes run a Pipes job
job manipulate MapReduce jobs
queue get information regarding JobQueues
classpath prints the class path needed for running
mapreduce subcommands
historyserver run job history servers as a standalone daemon
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
hsadmin job history server admin interface
Most commands print help when invoked w/o parameters.
 bin/yarn ---- prints the description for yarn commands
Usage: yarn [--config confdir] [COMMAND | CLASSNAME]
CLASSNAME run the class named CLASSNAME
or
where COMMAND is one of:
resourcemanager -format-state-store deletes the RMStateStore
resourcemanager run the ResourceManager
nodemanager run a nodemanager on each slave
timelineserver run the timeline server
rmadmin admin tools
sharedcachemanager run the SharedCacheManager daemon
scmadmin SharedCacheManager admin tools
version print the version
jar <jar> run a jar file
application prints application(s)
report/kill application
applicationattempt prints applicationattempt(s)
report
container prints container(s) report
node prints node report(s)
queue prints queue information
logs dump container logs
classpath prints the class path needed to
get the Hadoop jar and the
required libraries
cluster prints cluster information
daemonlog get/set the log level for each
daemon
Most commands print help when invoked w/o parameters.
3.3.15.4 Hadoop Compatibility
3.3.16 特别注意
 所有节点hosts文件保持一致
 所有节点hadoop配置文件保持一致
 所有节点都需要各自生成ssh秘钥，实现集群各节点之间无密码登录
 关闭所有节点防火墙
4 Hive安装配置
4.1 Requirements
 Java 1.7+
Note: Hive versions 1.2 onward require Java 1.7 or newer. Hive versions 0.14 to 1.1 work with Java 1.6 as well. Users are strongly advised to start moving to Java 1.8 (see HIVE-8607).
 Hadoop 2.x (preferred), 1.x (not supported by Hive 2.0.0 onward).
Hive versions up to 0.13 also supported Hadoop 0.20.x, 0.23.x.
 Hive is commonly used in production Linux and Windows environment. Mac is a commonly used development environment. The instructions in this document are applicable to Linux and Mac. Using it on Windows would require slightly different steps.
4.2安装准备
4.2.1 搭建JAVA环境
 参照2.3 搭建JAVA环境
4.2.2 搭建Hadoop环境
 参照3 Hadoop环境搭建
4.3 安装配置Hive
4.3.1 下载最新稳定版Stable Release
 Hive-2.1.0
源码下载地址：
https://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-2.1.0/apache-hive-2.1.0-src.tar.gz
压缩包下载地址：
https://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-2.1.0/apache-hive-2.1.0-bin.tar.gz
4.3.2 安装配置
 cd /usr/local/bigdata
---- 本次安装在namenode 后续安装在secondarynamenode
 tar zxvf apache-hive-2.1.0-bin.tar.gz
---- 解压到bigdata目录，重命名为apache-hive-2.1.0
 vi /etc/profile ------ 配置hive环境变量
export HIVE_HOME={{pwd}} ---- /usr/local/bigdata/apache-hive-2.1.0
export PATH=$HIVE_HOME/bin:$PATH
source /etc/profile
4.4 Running Apache Hive
4.4.1 Ready
 $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
 $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
 $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
 $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
 不要用内嵌的derby数据库，经验证内嵌derby不支持多客户端连接，需改为mysql或postgresql
 如需要个性化配置，必须复制hive-defaule.xml.template 为 hive-site.xml
4.4.2 Hive集成MySQL管理元数据
 安装配置MySQL ---- 参照5章在本例中hive装在namenode上，mysql装在secondarynamenode上
 将mysql驱动jar包复制到{HIVE_HOME}/lib
 在mysql中新建数据库
---- 例如metastore_db 字符集编码必须用latin1.
 vi {HIVE_HOME}/conf/hive-site.xml ---- 用户自定义配置

4.4.3 Running Hive CLI (command line interface)
 $HIVE_HOME/bin/hive ----启动Hive,下图表示启动成功

 HiveCLI is now deprecated in favor of Beeline, as it lacks the multi-user, security, and other capabilities of HiveServer2
 测试

4.4.4 Running Hive Server/Client
 $HIVE_HOME/bin/hiveserver2 ---- 启动HiveServer2
 $HIVE_HOME/bin/beeline -u jdbc:hive2://192.168.72.100:10000
---- Beeline is started with the JDBC URL of the HiveServer2, which depends on the address and port where HiveServer2 was started.
By default, it will be (localhost:10000), so the address will look like jdbc:hive2://localhost:10000.Or to start Beeline and HiveServer2 in the same process for testing purpose, for a similar user experience to HiveCLI:
$HIVE_HOME/bin/beeline -u jdbc:hive2://
 vi /usr/local/bigdata/hadoop-2.7.2/etc/hadoop/core-site.xml
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
---- 用jdbc连接远程连接hive服务端时需要先设置用户权限
4.4.5 Running Hive Web UI
 $HIVE_HOME/hcatalog/sbin/hcat_server.sh ---- running HCatalog
 $HIVE_HOME/hcatalog/sbin/webhcat_server.sh ---- running WebHCat
4.5 Apache Hive 优缺点
 优点：
（1）Hive 使用类SQL 查询语法, 最大限度的实现了和SQL标准的兼容，大大降低了传统数据分析人员学习的曲线；
（2）使用JDBC 接口/ODBC接口，开发人员更易开发应用；
（3）以MR 作为计算引擎、HDFS 作为存储系统，为超大数据集设计的计算/ 扩展能力；
（4）统一的元数据管理（Derby、MySql等），并可与Pig 、Presto 等共享；
 缺点：
（1）Hive 的HQL 表达的能力有限，有些复杂运算用HQL 不易表达；
（2）由于Hive自动生成MapReduce 作业， HQL 调优困难；
（3）粒度较粗，可控性差；
（4）hive不支持对某个具体行的操作，hive对数据的操作只支持覆盖原数据和追加数据，也不支持事务。
5 MySQL安装配置
5.1卸载通过yum安装的MySQL
 yum remove mysql mysql-server mysql-libs compat-mysql51
 rm -rf /var/lib/mysql
 rm /etc/my.cnf
 rpm -qa|grep mysql ----查看是否还有mysql软件，有的话继续删除
5.2卸载通过rpm安装的MySQL
 rpm -qa | grep mysql ------ 查找以安装的MySQL

 rpm -e --nodeps mysql-libs-5.1.73-5.el6_6.x86_64
 rpm -qa|grep mysql ----查看是否还有mysql软件，有的话继续删除
5.3 rpm安装MySQL
 下载上传MySQL-5.6.29-1.el6.x86_64.rpm-bundle.tar
 tar xvf MySQL-5.6.29-1.el6.x86_64.rpm-bundle.tar
 rpm -ivh MySQL-server-5.6.29-1.el6.x86_64.rpm

 rpm -ivh MySQL-client-5.6.29-1.el6.x86_64.rpm

 rpm -ivh MySQL-devel-5.6.29-1.el6.x86_64.rpm

 服务启动/停止/重启
service mysql start ---- 启动服务
service mysql stop ---- 启动服务
service mysql restart ---- 启动服务
 cat /root/.mysql_secret ---- 查看当前初始化密码
 mysql –uroot –p密码 ---- 用初始密码登录
 SET PASSWORD=PASSWORD('新密码'); ---- 用mysql命令设置新密码
 show variables like 'char%'; ---- 查看当前数据库编码

 set character_set_database=utf8; ---- 设置数据库编码utf8
 set character_set_server=utf8; ---- 设置服务端编码 utf8
 重启数据库服务查看编码是否修改生效,如果不生效，按以下步骤修改配置文件
 vi /usr/my.cnf
 GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'password' WITH GRANT OPTION; ---- 为远程客户端连接授权
 FLUSH PRIVILEGES; ---- 授权生效
6 Apache Zookeeper安装配置
6.1下载安装
 http://mirrors.cnnic.cn/apache/zookeeper/stable/zookeeper-3.4.8.tar.gz/
---- 下载地址，下载后解压上传到服务器

 tar zxvf zookeeper-3.4.8.tar.gz ---- 解压
6.2配置环境变量
 vi /etc/profile ---- 配置zookeeper环境变量
export ZOOKEEPER_HOME=/usr/local/bigdata/zookeeper-3.4.8
export PATH=$ZOOKEEPER_HOME/bin:$PATH
 source /etc/profile
 集群每个节点都需要配置
6.3配置zoo.cfg
 vi cong/zoo.cfg ---- 创建配置文件
tickTime=2000
dataDir=/home/root/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=namenode.domain:2888:3888
server.2=namenode2.domain:2888:3888
server.3=datanode1.domain:2888:3888
server.4=datanode2.domain:2888:3888
server.5=datanode4.domain:2888:3888
6.4新建data
 cd {ZOOKEEPER_HOME} ---- 进入zookeeper安装目录
 mkdir data ---- 新建zoo.cfg配置的dataDir目录
 cd data
 touch myid
 echo 1 > myid ---- 此处的1与zoo.cfg中server.1的1一致
6.5将zookeeper主目录复制到其它节点
 将myid中的序号改为对应zoo.cfg中配置的该服务器的节点序号，例如namenode2对应的server.2 那么复制到该节点时需要将myid中内容改成2
6.6启动、停止
 zkServer.sh start ---- 启动，在每个节点都运行
 zkServer.sh stop ---- 停止，在每个节点都运行
7 Apache HBase安装配置
7.1特性
 Linear and modular scalability. ----线性和模块化的可伸缩性。
 Strictly consistent reads and writes. ---- 严格的读、写一致
 Automatic and configurable sharding of tables. ----自动和可配置的分片表
 Automatic failover support between RegionServers. ---- 在RegionServers之间支持自动故障转移
 Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables. --- 方便支持Hadoop MR Job
 Easy to use Java API for client access. ---- JavaAPI
 Block cache and Bloom Filters for real-time queries. ---- 支持实时查询
 Query predicate push down via server side Filters. ----
 Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options. ---- 支持多种格式的外部服务
 Extensible jruby-based (JIRB) shell ----
 Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
7.2下载安装
 http://apache.fayea.com/hbase/stable/hbase-1.2.2-bin.tar.gzhttp://apache.fayea.com/hbase/stable/hbase-1.2.2-bin.tar.gz
---- 下载地址,下载后上传到服务器

 tar zxvf hbase-1.2.2-bin.tar.gz ---- 解压
7.3 配置
7.3.1 基本配置
 vi hbase-1.2.2/conf/hbase-env.sh ---- 配置JAVA_HOME

 vi /etc/profile ---- 配置Hbase环境变量
export HBASE_HOME=/usr/local/bigdata/hbase-1.2.2
export PATH=$HBASE_HOME/bin:$PATH
source /etc/profile
 集群每个节点都需要配置
7.3.2 单节点模式配置
 vi hbase-1.2.2/conf/hbase-site.xml
---- hbase默认会把数据存在/tmp/hbase-${user.name}目录下，/tmp目录在系统重启的时候有可能会被删掉，所以需要重新设置hbase和zookeeper的数据存储路径
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///home/root/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/root/zookeeper</value>
</property>
</configuration>
7.3.3 分布式模式配置
 vi hbase-1.2.2/conf/hbase-site.xml ---- 配置zookeeper,rootdir

要配置hbase.master.maxclockskew,因为有可能各节点系统时间不一致导致问题
 vi hbase-1.2.2/conf/hbase-env.sh ---- 使用已经存在zookeeper集群
export HBASE_MANAGES_ZK=false
 将hbase打包复制到其它节点相同路径，配置环境变量
 vi hbase-1.2.2/conf/backup-masters ---- 配置随主机一起启动的备机
 vi hbase-1.2.2/conf/regionservers ---- 配置随主机一起启动的regionserver
namenode.domain
namenode2.domain
datanode1.domain
datanode2.domain
datanode4.domain
 hadoop fs -ls /hbase ---- Check the HBase directory in HDFS

7.4 启动、停止
 zkServer.sh start ---- 启动zookeeper，在每个节点都运行
 start-hbase.sh ---- 启动habse
stop-hbase.sh ---- 停止habse
 http://master:16010/ ---- 访问主机

 http://backup-master:16010/ ---- 访问备机

 http:// region-server:16010/ ---- 访问region

7.5 Connect to Hbase With Shell
 hbase shell ---- Connect to Hbase With Shell

7.5.1 Shell 练习
 start-hbase.sh ---- 启动 hbase
 create ‘表名’,’列族名’ ---- 例如创建一个名为testhbase 的表，只有一个列族mycf
 list ---- 列出所有表
 describe 'scores' ---- 查看表的构造
 put ‘表名’,’row key’,’列族：列’,’值’
---- 例如
put 'testhbase','row1','mycf:a','aaaa'
put 'testhbase','row2','mycf:a','aaaa2'
put 'testhbase','row1','mycf:a','aaaa'
put 'testhbase','row2','mycf:c','cccc'
 scan ‘表名’ ---- 扫描表
7.6 特殊说明
 现在HBase并不能很好的处理两个或者三个以上的列族，所以尽量让你的列族数量少一些;
 要尽量避免时间戳或者(e.g. 1, 2, 3)这样的key;
 尽量最小化行和列的大小;
 尽量使列族名小，最好一个字符。(如 "d" 表示 data/default);
 行的版本的数量是HColumnDescriptor设置的，每个列族可以单独设置，默认是3; 最小版本数缺省值是0，表示该特性禁用。
 支持数据类型: 输入可以是字符串，数字，复杂对象，甚至图像，只要他们能转为字节;
8 Apache Storm安装配置
8.1特性
 Storm is simple, can be used with any programming language.
 Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.
 Storm is fast: a benchmark clocked it at over a million tuples processed per second per node.
 Users: Yahoo Twitter Spotify Yelp Flipboard Ooyala Alibaba Baidu 爱奇艺…….
8.2下载安装
 https://storm.apache.org/downloads.html
---- 下载地址，下载后解压上传到服务器

 tar xvf apache-storm-1.0.1.tar ---- 解压

8.3配置环境变量
 vi /etc/profile ---- 配置apache storm环境变量
export STORM_HOME=/usr/local/bigdata/apache-storm-1.0.1
export PATH=$STORM_HOME/bin:$PATH
 source /etc/profile
 集群每个节点都需要配置
8.4配置storm.yaml
 cd {STORM_HOME} ---- 进入storm目录
 vi conf/storm.yaml
storm.zookeeper.servers: ---- zookeeper 集群
- "namenode.domain"
- "namenode2.domain"
- "datanode1.domain"
- "datanode2.domain"
- "datanode4.domain"
storm.local.dir: "/usr/local/apache-storm-1.0.1/data" ---- The Nimbus and Supervisor daemons require a directory on the local disk to store small amounts of state (like jars, confs, and things like that)
nimbus.seeds: ["namenode2.domain"] ---- storm集群的主控节点，可以配置多个
ui.port: 8888 ---- ui端口,必须是整数，否则启动会报错
 将storm目录复制到其它节点
8.5启动、停止Storm集群
Storm集群分为主控节点和工作节点，根据自己的storm.yaml配置文件，在主控节点上启动nimbus，在工作节点上启动supervisor。
 zkServer.sh start ---- 启动zookeeper集群，如果zookeeper已经启动则此步骤省略
 storm nimbus &---- 后台启动主控节点

 storm supervisor & ---- 后台启动工作节点

 storm logviewer & ---- 后台启动logviewer 最好每个节点都启动

 storm ui & ---- 后台启动UI 可以只启动主控节点的UI

 http://<ip>:8888 ---- 进入UI页面，可以查看各工作节点日志和集群配置

9 Spark安装配置
9.1下载安装
9.1.1 下载安装Jdk1.8+
 参照2.3 章安装配置
9.1.2 下载安装Scala-2.11.8
 下载后解压到服务器目录

 在每个节点上都要安装
9.1.3 下载安装spark-1.6.2-bin-hadoop2.6
 下载后解压到服务器目录

 在每个节点上都要安装
9.2配置环境变量
9.2.1 配置Scala环境变量
 vi /etc/profile ---- 配置环境变量
export SCALA_HOME=/usr/local/scala-2.11.8
export PATH=$SCALA_HOME/bin:$PATH
 source /etc/profile ---- 使配置生效
 scala –version ---- 查看scala版本，出现以下信息，说明配置成功

 scala ---- 进入scala命令行工具

 在每个节点上都要配置
9.2.2 配置Spark环境变量
 cd {SPARK_HOME} ---- 进入spark安装目录
 cp conf/spark-env.sh.template conf/spark-env.sh ---- 复制配置文件模板
 vi conf/spark-env.sh
export SPARK_DIST_CLASSPATH=$(/usr/local/bigdata/hadoop-2.7.2/bin/hadoop classpath)
 在每个节点上都要配置,可以在主节点上配置好后，复制到其它节点
 bin/run-example SparkPi ---- 运行一个实例测试,计算出PI
9.2.3 Spark完全分布式配置
 cd {SPARK_HOME} ---- 进入spark安装目录
 cp conf/slaves.template conf/slaves ---- 复制配置文件模板
 vi conf/slaves ---- 增加worker节点
 vi conf/spark-env.sh ---- 配置主节点IP端口
export SPARK_MASTER_IP=namenode.domain
export SPARK_MASTER_PORT=7077
 {SPARK_HOME}/sbin/start-all.sh ---- 启动集群
 http://master:8080 ---- 集群UI

9.3*集成Hive
如果是用spark-1.6.2-bin-without-hadoop，则需要此步骤进行重新编译由于Spark从1.3以后的版本不集成Hive，因此需要自行编译Hive模块。
 安装配置Maven ---- 参照2.5章节
 下载spark源码，解压到服务器目录，例如 /usr/local/bigdata/spark-1.6.2
 cd /usr/local/bigdata/spark-1.6.2 ---- 进入spark源码目录
 vi pom.xml ---- spark默认编译hive-1.2.1 hadoop2.6可以在pom.xml修改成自己的版本
 mvn -Pyarn -Dhadoop.version=2.7 –Phadoop.version=2.7.2 -Phive -Phive-thriftserver –Dscala-2.11 -DskipTests clean package
---- 执行编译，漫长的等待。。。。。。如果用scala-2.11编译需要先执行dev/change-scala-version.sh 2.11 然后编译时增加 –Dscala-2.11
 SPARK_HOME_SRC\assembly\target\scala-2.10\spark-assembly-1.4.0-hadoop2.3.0-cdh5.0.0.jar 文件替换掉SPARK_HOME/lib 目录下的 spark-assembly*.jar文件
---- 替换编译的jar包，其中SPARK_HOME_SRC是spark源码目录，SPARK_HOME是spark安装目录
 cp HIVE_HOME/conf/hive-site.xml SPARK_HOME/conf/
----将hive安装目录下的hive-site.xml文件复制到spark的conf目录下
 SPARK_HOME/bin/spark-sql
 select * from {tablename}; ---- 查询测试
10 Spark-2.0.x安装配置
10.1下载安装
10.1.1下载安装Jdk1.8+
 参照2.3 章安装配置
10.1.2下载安装Scala-2.11.8
 参照9.1.3章
10.1.3下载安装spark-2.0.0-bin-hadoop2.7
 下载后解压到服务器目录

 在每个节点上都要安装，推荐在进行相应配置后统一复制到其它节点
10.2配置环境变量
10.2.1配置Scala环境变量
 参照9.2.1 章
10.2.2配置Spark环境变量
 参照9.2.2 章
10.2.3 Spark完全分布式配置
 参照9.2.3 章
10.2.4 Hive on Spark 配置
 选择一个工作节点，将{HIVE_HOME}/conf/hive-site.xml
复制到{SPARK_HOME}/conf/ 目录
 在每个spark节点{SPARK_HOME}/jars 加入hive元数据库所使用的对应数据库驱动，例如hive经常使用mysql作为元数据库，那么需要将mysql数据库驱动复制到{SPARK_HOME}/jars/ 目录

0 0