(转)HBase Installation

来源:互联网 发布:淘宝电器 编辑:程序博客网 时间:2024/05/17 10:56

来源: https://ccp.cloudera.com/display/CDHDOC/HBase+Installation


Contents

  • Upgrading HBase to the Latest CDH3 Release
    • Step 1: Perform a Graceful Cluster Shutdown
    • Step 2. Stop the ZooKeeper Server
    • Step 3: Install the new version of HBase
  • Installing HBase
  • Host Configuration Settings for HBase
    • Configuring the REST Port
    • Using DNS with HBase
    • Using the Network Time Protocol (NTP) with HBase
    • Setting User Limits for HBase
    • Using dfs.datanode.max.xcievers with HBase
  • Starting HBase in Standalone Mode
    • Installing the HBase Master for Standalone Operation
    • Starting the HBase Master
    • Accessing HBase by using the HBase Shell
  • Using MapReduce with HBase
  • Configuring HBase in Pseudo-distributed Mode
    • Modifying the HBase Configuration
    • Creating the /hbase Directory in HDFS
    • Enabling Servers for Pseudo-distributed Operation
    • Installing the HBase Thrift Server
  • Deploying HBase in a Distributed Cluster
    • Choosing where to Deploy the Processes
    • Configuring for Distributed Operation
  • Troubleshooting
  • Viewing the HBase Documentation

Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Cloudera recommends installing HBase in a standalone mode before you try to run it on a whole cluster.

Important
If you have not already done so, install Cloudera's yumzypper/YaST or apt repository before using the following commands to install or upgrade HBase. For instructions, see CDH3 Installation.

Upgrading HBase to the Latest CDH3 Release

Note
To see which version of HBase is shipping in the latest CDH3 release, check the Version and Packaging Information. For important information on new and changed components, see the Release Notes.

The instructions that follow assume that you are upgrading HBase as part of an upgrade to the latest CDH3 release, and have already performed the steps under Upgrading CDH3.

To upgrade HBase to the latest CDH3 release, proceed as follows.

Warning
You must shut down the HBase, Thrift, and ZooKeeper processes as shown below. If these processes are running during the upgrade, the new version will not work correctly.

Step 1: Perform a Graceful Cluster Shutdown

To shut HBase down gracefully, stop the Thrift server and clients, then stop the cluster.

  1. Stop the Thrift server and clients
    sudo service hadoop-hbase-thrift stop
  2. Stop the cluster.
    1. Use the following command on the master node:
      sudo service hadoop-hbase-master stop
    2. Use the following command on each node hosting a region server:
      sudo service hadoop-hbase-regionserver stop

This shuts down the master and the region servers gracefully.

Step 2. Stop the ZooKeeper Server

$ sudo service hadoop-zookeeper-server stop
Note
Depending on your platform and release, you may need to use
$ sudo /sbin/service hadoop-zookeeper-server stop

or

$ sudo /sbin/service hadoop-zookeeper stop

Step 3: Install the new version of HBase

Note
You may want to take this opportunity to upgrade ZooKeeper, but you do not have to upgrade Zookeeper before upgrading HBase; the new version of HBase will run with the older version of Zookeeper. For instructions on upgrading ZooKeeper, see Upgrading ZooKeeper to the Latest CDH3 Release.

Follow directions in the next section, Installing HBase.

Installing HBase

To install HBase on Ubuntu and other Debian systems:

$ sudo apt-get install hadoop-hbase

To install HBase On Red Hat-compatible systems:

$ sudo yum install hadoop-hbase

To install HBase on SUSE systems:

$ sudo zypper install hadoop-hbase

To list the installed files on Ubuntu and other Debian systems:

$ dpkg -L hadoop-hbase

To list the installed files on Red Hat and SUSE systems:

$ rpm -ql hadoop-hbase

You can see that the HBase package has been configured to conform to the Linux Filesystem Hierarchy Standard. (To learn more, run man hier). 

HBase wrapper script             /usr/bin/hbase
HBase Configuration Files        /etc/hbase/conf
HBase Jar and Library Files      /usr/lib/hbase
HBase Log Files                  /var/log/hbase
HBase service scripts            /etc/init.d/hadoop-hbase-*


You are now ready to enable the server daemons you want to use with Hadoop. Java-based client access is also available by adding the jars in /usr/lib/hbase/ and /usr/lib/hbase/lib/ to your Java class path.

Host Configuration Settings for HBase

Configuring the REST Port

You can use an init.d script, /etc/init.d/hadoop-hbase-rest, to start the REST server; for example:

/etc/init.d/hadoop-hbase-rest start

The script starts the server by default on port 8080. This is a commonly used port and so may conflict with other applications running on the same host.

If you need change the port for the REST server, configure it in hbase-site.xml, for example:

<property>
  <name>hbase.rest.port</name>
  <value>60050</value>
</property>
Note
You can use HBASE_REST_OPTS in hbase-env.sh to pass other settings (such as heap size and GC parameters) to the REST server JVM.

Using DNS with HBase

HBase uses the local hostname to report its IP address. Both forward and reverse DNS resolving should work. If your machine has multiple interfaces, HBase uses the interface that the primary hostname resolves to. If this is insufficient, you can set hbase.regionserver.dns.interface in the hbase-site.xml file to indicate the primary interface. To work properly, this setting requires that your cluster configuration is consistent and every host has the same network interface configuration. As an alternative, you can set hbase.regionserver.dns.nameserver in the hbase-site.xmlfile to choose a different name server than the system-wide default.

Using the Network Time Protocol (NTP) with HBase

The clocks on cluster members should be in basic alignments. Some skew is tolerable, but excessive skew could generate odd behaviors. Run NTP on your cluster, or an equivalent. If you are having problems querying data or unusual cluster operations, verify the system time.

Setting User Limits for HBase

Because HBase is a database, it uses a lot of files at the same time. The default ulimit setting of 1024 for the maximum number of open files on Unix systems is insufficient. Any significant amount of loading will result in failures in strange ways and cause the error message java.io.IOException...(Too many open files) to be logged in the HBase or HDFS log files. For more information about this issue, see the Apache HBase Book. You may also notice errors such as:

2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901

Configuring ulimit for HBase

Cloudera recommends increasing the maximum number of file handles to more than 10,000. Note that increasing the file handles for the user who is running the HBase process is an operating system configuration, not an HBase configuration. Also, a common mistake is to increase the number of file handles for a particular user but, for whatever reason, HBase will be running as a different user. HBase prints the ulimit it is using on the first line in the logs. Make sure that it is correct.

If you are using ulimit, you must make the following configuration changes:

  1. In the /etc/security/limits.conf file, add the following lines:
    Note
    Only the root user can edit this file.
    hdfs  -       nofile  32768
    hbase  -       nofile  32768
  2. To apply the changes in /etc/security/limits.conf on Ubuntu and other Debian systems, add the following line in the /etc/pam.d/common-session file:
    session required  pam_limits.so

Using dfs.datanode.max.xcievers with HBase

A Hadoop HDFS DataNode has an upper bound on the number of files that it can serve at any one time. The upper bound property is called dfs.datanode.max.xcievers (the property is spelled in the code exactly as shown here). Before loading, make sure you have configured the value for dfs.datanode.max.xcievers in the conf/hdfs-site.xmlfile to at least 4096 as shown below:

<property>
  <name>dfs.datanode.max.xcievers</name>
  <value>4096</value>
</property>

Be sure to restart HDFS after changing the value for dfs.datanode.max.xcievers. If you don't change that value as described, strange failures can occur and an error message about exceeding the number of xcievers will be added to the DataNode logs. Other error messages about missing blocks are also logged, such as:

10/12/08 20:10:31 INFO hdfs.DFSClient: Could not obtain block blk_XXXXXXXXXXXXXXXXXXXXXX_YYYYYYYY from any node:
java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...

Starting HBase in Standalone Mode

By default, HBase ships configured for standalone mode. In this mode of operation, a single JVM hosts the HBase Master, an HBase Region Server, and a ZooKeeper quorum peer. In order to run HBase in standalone mode, you must install the HBase Master package:

Installing the HBase Master for Standalone Operation

To install the HBase Master on Ubuntu and other Debian systems:

$ sudo apt-get install hadoop-hbase-master

To install the HBase Master On Red Hat-compatible systems:

$ sudo yum install hadoop-hbase-master

To install the HBase Master on SUSE systems:

$ sudo zypper install hadoop-hbase-master

Starting the HBase Master

On Red Hat and SUSE systems (using .rpm packages) you can start now start the HBase Master by using the included service script:

$ sudo /etc/init.d/hadoop-hbase-master start

On Ubuntu systems (using Debian packages) the HBase Master starts when the HBase package is installed.

To verify that the standalone installation is operational, visit http://localhost:60010. The list of Region Servers at the bottom of the page should include one entry for your local machine.

Note
Although you just started the master process, in standalone mode this same process is also internally running a region server and a ZooKeeper peer. In the next section, you will break out these components into separate JVMs.

If you see this message when you start the HBase standalone master:

Starting Hadoop HBase master daemon: starting master, logging to /usr/lib/hbase/logs/hbase-hbase-master/cloudera-vm.out
Couldnt start ZK at requested address of 2181, instead got: 2182.  Aborting. Why? Because clients (eg shell) wont be able to find this ZK quorum
hbase-master.

you will need to stop the hadoop-zookeeper-server or uninstall the hadoop-zookeeper-server package.

Accessing HBase by using the HBase Shell

After you have started the standalone installation, you can access the database by using the HBase Shell:

$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version: 0.89.20100621+17, r, Mon Jun 28 10:13:32 PDT 2010
 
hbase(main):001:0> status 'detailed'
version 0.89.20100621+17
0 regionsInTransition
1 live servers
    my-machine:59719 1277750189913
        requests=0, regions=2, usedHeap=24, maxHeap=995
        .META.,,1
            stores=2, storefiles=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0
        -ROOT-,,0
            stores=1, storefiles=1, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0
0 dead servers

Using MapReduce with HBase

To run MapReduce jobs that use HBase, you need to add the HBase and Zookeeper JAR files to the Hadoop Java classpath. You can do this by adding the following statement to each job:

TableMapReduceUtil.addDependencyJars(job);

This distributes the JAR files to the cluster along with your job and adds them to the job's classpath, so that you do not need to edit the MapReduce configuration.

You can find more information about addDependencyJars here.

When getting an Configuration object for a HBase MapReduce job, instantiate it using theHBaseConfiguration.create() method.

Configuring HBase in Pseudo-distributed Mode

Pseudo-distributed mode differs from standalone mode in that each of the component processes run in a separate JVM.

Note
If the HBase master is already running in standalone mode, stop it by running /etc/init.d/hadoop-hbase-master stop before continuing with pseudo-distributed configuration.

Modifying the HBase Configuration

To enable pseudo-distributed mode, you must first make some configuration changes. Open /etc/hbase/conf/hbase-site.xml in your editor of choice, and insert the following XML properties between the <configuration> and</configuration> tags. Be sure to replace localhost with the host name of your HDFS Name Node if it is not running locally.

<property>
  <name>hbase.cluster.distributed</name>
  <value>true</value>
</property>
<property>
  <name>hbase.rootdir</name>
  <value>hdfs://localhost/hbase</value>
</property>

Creating the /hbase Directory in HDFS

Before starting the HBase Master, you need to create the /hbase directory in HDFS. The HBase master runs ashbase:hbase so it does not have the required permissions to create a top level directory.

To create the /hbase directory in HDFS:

$ sudo -u hdfs hadoop fs -mkdir /hbase
$ sudo -u hdfs hadoop fs -chown hbase /hbase

Enabling Servers for Pseudo-distributed Operation

After you have configured HBase, you must enable the various servers that make up a distributed HBase cluster. HBase uses three required types of servers:

  • ZooKeeper Quorum Peers
  • HBase Master
  • HBase Region Server

Installing and Starting ZooKeeper Server

HBase uses ZooKeeper Server as a highly available, central location for cluster management. For example, it allows clients to locate the servers, and ensures that only one master is active at a time. For a small cluster, running a ZooKeeper node colocated with the NameNode is recommended. For larger clusters, contact Cloudera Support for configuration help.

Install and start the ZooKeeper Server in standalone mode by running the commands shown in the "Installing the ZooKeeper Server Package on a Single Server" section of ZooKeeper Installation.

Starting the HBase Master

After ZooKeeper is running, you can start the HBase master in standalone mode.

$ sudo /etc/init.d/hadoop-hbase-master start

Starting an HBase Region Server

The Region Server is the part of HBase that actually hosts data and processes requests. The region server typically runs on all of the slave nodes in a cluster, but not the master node.

To enable the HBase Region Server on Ubuntu and other Debian systems:

$ sudo apt-get install hadoop-hbase-regionserver

To enable the HBase Region Server On Red Hat-compatible systems:

$ sudo yum install hadoop-hbase-regionserver

To enable the HBase Region Server on SUSE systems:

$ sudo zypper install hadoop-hbase-regionserver

To start the Region Server:

$ sudo /etc/init.d/hadoop-hbase-regionserver start

Verifying the Pseudo-Distributed Operation

After you have started ZooKeeper, the Master, and a Region Server, the pseudo-distributed cluster should be up and running. You can verify that each of the daemons is running using the jps tool from the Oracle JDK, which you can obtain from here. If you are running a pseudo-distributed HDFS installation and a pseudo-distributed HBase installation on one machine, jps will show the following output:

$ sudo jps
32694 Jps
30674 HRegionServer
29496 HMaster
28781 DataNode
28422 NameNode
30348 QuorumPeerMain

You should also be able to navigate to http://localhost:60010 and verify that the local region server has registered with the master.

Installing the HBase Thrift Server

The HBase Thrift Server is an alternative gateway for accessing the HBase server. Thrift mirrors most of the HBase client APIs while enabling popular programming languages to interact with HBase. The Thrift Server is multi-platform and performs better than REST in many situations. Thrift can be run collocated along with the region servers, but should not be collocated with the NameNode or the JobTracker. For more information about Thrift, visithttp://incubator.apache.org/thrift/.

To enable the HBase Thrift Server on Ubuntu and other Debian systems:

$ sudo apt-get install hadoop-hbase-thrift

To enable the HBase Thrift Server On Red Hat-compatible systems:

$ sudo yum install hadoop-hbase-thrift

To enable the HBase Thrift Server on SUSE systems:

$ sudo zypper install hadoop-hbase-thrift

Deploying HBase in a Distributed Cluster

After you have HBase running in pseudo-distributed mode, the same configuration can be extended to running on a distributed cluster.

Choosing where to Deploy the Processes

For small clusters, Cloudera recommends designating one node in your cluster as the master node. On this node, you will typically run the HBase Master and a ZooKeeper quorum peer. These master processes may be collocated with the Hadoop NameNode and JobTracker for small clusters.

Designate the remaining nodes as slave nodes. On each node, Cloudera recommends running a Region Server, which may be collocated with a Hadoop TaskTracker and a DataNode. When collocating with TaskTrackers, be sure that the resources of the machine are not oversubscribed – it's safest to start with a small number of MapReduce slots and work up slowly.

Configuring for Distributed Operation

After you have decided which machines will run each process, you can edit the configuration so that the nodes may locate each other. In order to do so, you should make sure that the configuration files are synchronized across the cluster. Cloudera strongly recommends the use of a configuration management system to synchronize the configuration files, though you can use a simpler solution such as rsync to get started quickly.

The only configuration change necessary to move from pseudo-distributed operation to fully-distributed operation is the addition of the ZooKeeper Quorum address in hbase-site.xml. Insert the following XML property to configure the nodes with the address of the node where the ZooKeeper quorum peer is running:

<property>
  <name>hbase.zookeeper.quorum</name>
  <value>mymasternode</value>
</property>

To start the cluster, start the services in the following order:

  1. The ZooKeeper Quorum Peer
  2. The HBase Master
  3. Each of the HBase Region Servers

After the cluster is fully started, you can view the HBase Master web interface on port 60010 and verify that each of the slave nodes has registered properly with the master.

Troubleshooting

The Cloudera packages of HBase have been configured to place logs in /var/log/hbase. While getting started, Cloudera recommends tailing these logs to note any error messages or failures.

Viewing the HBase Documentation

For additional HBase documentation, see http://archive.cloudera.com/cdh/3/hbase/.