hadoop 大数据开发1---配置hadoop分布式

来源:互联网 发布:免费手机炒股软件排行 编辑:程序博客网 时间:2024/05/16 01:32


1. 增加用户

adduser hadoop

 

单独的用户来管理hadoop,增加了hadoop用户

 

2. /etc/hosts修改

要把集群里的机器hostname和ip要加入到/etc/hosts中

 

127.0.0.1               localhost.localdomain localhost

192.168.80.129 hadoop1

192.168.80.131 hadoop2

 

(hostname修改见附录)

3. JDK安装

在/etc/profile文件中增加环境变量

 

配置

export JAVA_HOME=/usr/java/jdk1.6.0_31/

exportCLASSPATH=.:$JAVA_HOME/lib:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

export PATH=$PATH:$JAVA_HOME/bin

 

4. 设置SSH

 

1. cd~/.ssh     (进入用户目录下的隐藏文件.ssh)

2. ssh-keygen-t dsa -P '' -f ~/.ssh/id_dsa(用dsa生成密钥)

3. cat~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys  (把id_dsa.pub追加到authorized_keys,这步执行完,应该ssh localhost可以无密码登录本机了)

4. scp authorized_keys hadoop@hadoop_1:/home/hadoop/.ssh  (把重命名后的公钥通过ssh提供的远程复制文件复制到从机hadoop_1上面,每个机器都要拷贝)

5. chmod 600 authorized_keys    (更改公钥的权限)

6. ssh hadoop_1  (可以远程无密码登录hadoop_1这台机子了)

 

 

********************************

********************************

Setuppassphraseless ssh

Now check that you can ssh to the localhost without a passphrase:
$ ssh localhost

If you cannot ssh to localhost without a passphrase, execute thefollowing commands:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

********************************

********************************

 

5. 设置环境变量

 

在hadoop-env.sh文件中增加环境变量

HADOOP_COMMON_HOME

HADOOP_HDFS_HOME

HADOOP_MAPRED_HOME

YARN_HOME  (the same as  $HADOOP_MAPRED_HOME)

HADOOP_LOG_DIR / YARN_LOG_DIR

HADOOP_HEAPSIZE / YARN_HEAPSIZE

 

按具体路径配置

如:

exportHADOOP_HOME=/opt/hadoop

exportHADOOP_COMMON_HOME=$HADOOP_HOME

exportHADOOP_HDFS_HOME=$HADOOP_HOME

exportHADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

exportYARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

6. hadoop设置

1.配置 core-site.xml($HADOOP_HOME /etc/hadoop目录下)

 

 <property>

   <name>fs.trash.interval</name>

   <value>360</value>

   <description>Number of minutes between trash checkpoints.

   </description>

 </property>

 

 

 <property>

   <name>hadoop.tmp.dir</name>

   <value>/opt/dbrg/hadoop</value>

   <description>A base for other temporarydirectories.</description>

 </property>

 

<!--请注意在多个NameNode情况下,core-site.xml不需要设置fs.defaultFS,只需在下面hdfs-site.xml设置对应内容即可。-->

 

2.配置 hdfs-site.xml

 

 

 <property>

   <name>dfs.federation.nameservices</name>

   <value>ns1,ns2</value>

 </property>

 

 

 <property>

   <name>dfs.namenode.rpc-address.ns1</name>

   <value>hdfs:// Myhost1:9000</value>  <!—是否要加hdfs://  ?-->

 </property>

 

 

 <property>

   <name>dfs.namenode.http-address.ns1</name>

   <value> Myhost1:50070</value>

 </property>

 

 <property>

   <name>dfs.namenode.rpc-address.ns2</name>

   <value>hdfs:// Myhost2:9000</value>

 </property>

 <property>

   <name>dfs.namenode.http-address.ns1</name>

   <value> Myhost2:50070</value>

 </property>

 

<property>
<name>dfs.namenode.name.dir</name>
<value>/home/cluster-data</value>
</property>

 

 <property>

   <name>dfs.datanode.data.dir</name>

   <value>/home/yuling.sh/datanode</value>

 </property> 

 

3.配置 mapred-site.xml

<property>

    <name>mapreduce.framework.name</name>

<value>yarn</value>

<description>Execution framework set to Hadoop YARN.</description>

  </property>

 

<property>

    <name>mapreduce.map.memory.mb</name>

<value>1536</value>

<description>Larger resource limit for maps.</description>

  </property>

 

<property>

    <name>mapreduce.map.java.opts</name>

<value>-Xmx1024M</value>

<description> Larger heap-size for child jvms of maps.</description>

  </property>

 

 

<property>

    <name>mapreduce.reduce.memory.mb</name>

<value>3072</value>

<description>Larger resource limit forreduces.</description>

  </property>

 

<property>

   <name>mapreduce.reduce.java.opts</name>

<value>-Xmx2560M</value>

<description>Larger heap-size for child jvms ofreduces.</description>

  </property>

 

<property>

   <name>mapreduce.task.io.sort.mb</name>

<value>512</value>

<description>Higher memory-limit while sorting data forefficiency.</description>

  </property>

 

<property>

    <name>mapreduce.task.io.sort.factor</name>

<value>100</value>

<description>More streams merged at once while sortingfiles.</description>

  </property>

 

<property>

   <name>mapreduce.reduce.shuffle.parallelcopies</name>

<value>50</value>

<description>Higher number of parallel copies run by reducesto fetch outputs from very large number of maps.</description>

  </property>

 

 

 

<property>

   <name>mapreduce.cluster.temp.dir</name>

   <value></value>

    <description>Nodescription</description>

    <final>true</final>

  </property>

 

  <property>

   <name>mapreduce.cluster.local.dir</name>

   <value></value>

    <description>Nodescription</description>

   <final>true</final>

  </property>

4.配置yarn-site.xml

  <property>

   <name>yarn.resourcemanager.resource-tracker.address</name>

   <value>host:port</value>

    <description>host isthe hostname of the resource manager and

    port is the port on whichthe NodeManagers contact the Resource Manager.

    </description>

  </property>

 

  <property>

    <name>yarn.resourcemanager.scheduler.address</name>

   <value>host:port</value>

    <description>host isthe hostname of the resourcemanager and port is the port

    on which the Applicationsin the cluster talk to the Resource Manager.

    </description>

  </property>

 

  <property>

   <name>yarn.resourcemanager.scheduler.class</name>

   <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>

    <description>In caseyou do not want to use the default scheduler</description>

  </property>

 

  <property>

   <name>yarn.resourcemanager.address</name>

   <value>host:port</value>

    <description>thehost is the hostname of the ResourceManager and the port is the port on

    which the clients can talkto the Resource Manager. </description>

  </property>

 

  <property>

   <name>yarn.nodemanager.local-dirs</name>

   <value></value>

    <description>thelocal directories used by the nodemanager</description>

  </property>

 

  <property>

   <name>yarn.nodemanager.address</name>

   <value>0.0.0.0:port</value>

    <description>thenodemanagers bind to this port</description>

  </property> 

 

  <property>

   <name>yarn.nodemanager.resource.memory-mb</name>

   <value>10240</value>

    <description>theamount of memory on the NodeManager in GB</description>

  </property>

 

  <property>

   <name>yarn.nodemanager.remote-app-log-dir</name>

   <value>/app-logs</value>

   <description>directory on hdfs where the application logs aremoved to </description>

  </property>

 

   <property>

   <name>yarn.nodemanager.log-dirs</name>

   <value></value>

    <description>thedirectories used by Nodemanagers as log directories</description>

  </property>

 

  <property>

   <name>yarn.nodemanager.aux-services</name>

   <value>mapreduce.shuffle</value>

    <description>shuffleservice that needs to be set for Map Reduce to run </description>

  </property>

5.配置 capacity-scheduler.xml

  <property>

   <name>yarn.scheduler.capacity.root.queues</name>

    <value>unfunded,default</value>

  </property>

 

  <property>

   <name>yarn.scheduler.capacity.root.capacity</name>

   <value>100</value>

  </property>

 

  <property>

   <name>yarn.scheduler.capacity.root.unfunded.capacity</name>

   <value>50</value>

  </property>

 

  <property>

   <name>yarn.scheduler.capacity.root.default.capacity</name>

   <value>50</value>

  </property>

6编辑etc/hadoop/slaves文件

编辑两台namenode的hadoop目录下 etc/hadoop/slaves文件.加入三台slave机器名

7. 集群启动和停止

 Hadoop Startup

To start aHadoop cluster you will need to start both the HDFS and YARN cluster.

Format a newdistributed filesystem:

$ $HADOOP_PREFIX/bin/hdfs namenode -format <cluster_name>

Start the HDFSwith the following command, run on the designated NameNode:

$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode

Run a script tostart DataNodes on all slaves:

$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start datanode

Start the YARNwith the following command, run on the designated ResourceManager:

$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager

Run a script tostart NodeManagers on all slaves:

$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager

Start astandalone WebAppProxy server. If multiple servers are used with load balancingit should be run on each of them:

$ $HADOOP_YARN_HOME/bin/yarn start proxyserver --config $HADOOP_CONF_DIR

Start theMapReduce JobHistory Server with the following command, run on the designatedserver:

$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR

Hadoop Shutdown

Stop theNameNode with the following command, run on the designated NameNode:

$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop namenode

Run a script tostop DataNodes on all slaves:

$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop datanode

Stop theResourceManager with the following command, run on the designatedResourceManager:

$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR stop resourcemanager

Run a script tostop NodeManagers on all slaves:

$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR stop nodemanager

Stop theWebAppProxy server. If multiple servers are used with load balancing it shouldbe run on each of them:

$ $HADOOP_YARN_HOME/bin/yarn stop proxyserver --config $HADOOP_CONF_DIR

Stop theMapReduce JobHistory Server with the following command, run on the designatedserver:

$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh stop historyserver --config $HADOOP_CONF_DIR

 

################################

 

Hadoop 0.23.x  - simmilar to2.X.X but missing NN HA.(NameNodeHA)

Hbase 0.94.x

Zookeeper 3.4.x

################################

修改linux里的默认JDK

当你已经成功把jdk1.6.0_39 安装到 /home/hadoop/jdk1.6.0_39 ,并且配置好了系统环境变量

执行 # java -version 时就是显示jdk1.4.3,是因为你的linux系统有默认的jdk;执行

1.

# cd /usr/bin

# ln -s -f/home/hadoop/jdk1.6.0_39/bin/java

# ln -s -f /home/hadoop/jdk1.6.0_39/bin/javac

 

2. Remove the old empty java environment

# rm -f /usr/bin/java

# rm -f /usr/bin/javac

# rm -f /etc/alternatives/java

# rm -f /etc/alternatives/javac)

如果执行完1,查看 /usr/bin下的java -version,如果已经修改过来了,则步骤2不必再执行,命令行输入

# java –version

则可以看到jdk版本已经正常

################################

linux SSH无密码验证配置

Namenode作为客户端,要实现无密码公钥认证,连接到服务端datanode上时,需要在namenode上生成一个密钥对,包括一个公钥和一个私钥,而后将公钥复制到datanode上。当namenode通过ssh连接datanode时,datanode就会生成一个随机数并用namenode的公钥对随机数进行加密,并发送给namenode。namenode收到加密数之后再用私钥进行解密,并将解密数回传给datanode,datanode确认解密数无误之后就允许namenode进行连接了。这就是一个公钥认证过程,其间不需要用户手工输入密码。重要过程是将客户端namenode公钥复制到datanode上。

(1)所有机器上生成密码对

所有节点上执行以下命令:

root@hadoop-namenode#ssh-keygen

然后一路回车就可以了。

这将在/root/.ssh/目录下生成一个私钥id_rsa和一个公钥id_rsa.pub。

把namenode节点上面的id_rsa.pub 复制到所有datanode节点/root/.ssh/位置。

(注意:原文没有细说,这是指把id_rsa.pub先拷贝成authorized_keys,再将authorized_keys复制到其它datanode上的)

root@hadoop-namenode#cp id_rsa.pub authorized_keys

namenode的公钥

root@hadoop-namenode#chmod 644 authorized_keys

使用SSH协议将namenode的公钥信息authorized_keys复制到所有DataNode的.ssh目录下(.ssh下最初没有authorized_keys,如果有,则需要复制追加,后面会讲到如何追加)。

root@hadoop-namenode#scp authorized_keys data节点ip地址:/root/.ssh

 

这样配置过后,namenode可以无密码登录所有datanode,可以通过命令

“ssh 节点ip地址”来验证。

*配置每个Datanode无密码登录Namenode

(0)原理

Namenode连接datanode时namenode是客户端,需要将namenode上的公钥复制到datanode上,那么,如果datanode主动连接namenode,datanode是客户端,此时需要将datanode上的公钥信息追加到namenode中的authorized_keys之中。(此时,由于namenode中已经存在authorized_keys文件,所以这里是追加)。

如果进一步需要datanode之间实现公钥无密码验证,则同样需要相互之间追加公钥信息

(1)将各个datanode上的id_rsa.pub追加到namenode的authorized_keys

 

在所有datanode上依次执行如下命令:

root@hadoop-datanode1#scp id_rsa.pub namenode ip地址:/root/.ssh/datanode ip地址.id_rsa.pub

这将datanode上之前产生的公钥id_rsa.pub复制到namenode上的.ssh目录中,并重命名为datanode ip地址.id_rsa.pub,这是为了区分从各个datanode上传过来的公钥。

复制完毕,在namenode上执行以下命令,将每个datanode的公钥信息追加:

root@hadoop-namenode#cat datanode ip地址.id_rsa.pub >> authorized_keys

这样,namenode和datanode之间便可以相互ssh上并不需要密码......

注意:整个过程中只涉及到创建密钥,复制公钥,添加公钥内容,没有更改配置文件,实际上配置文件/etc/ssh/sshd_config中开启了公钥验证

{

RSAAuthenticationyes

PubkeyAuthenticationyes

}

################################

XXX is not in the sudoers file解决方法

用sudo时提示"xxx is not in the sudoers file. This incident will bereported.其中XXX是你的用户名,也就是你的用户名没有权限使用sudo,我们只要修改一下/etc/sudoers文件就行了。

下面是解决方法:

1)进入超级用户模式。也就是输入"su-",系统会让你输入超级用户密码,输入密码后就进入了超级用户模式。(当然,你也可以直接用root用)

(注意有- ,这和su是不同的,在用命令”su”的时候只是切换到root,但没有把root的环境变量传过去,还是当前用户的环境变量,用”su -”命令将环境变量也一起带过去,就象和root登录一样)

2)添加文件的写权限。也就是输入命令"chmodu+w /etc/sudoers"。

3)编辑/etc/sudoers文件。也就是输入命令"gedit/etc/sudoers",进入编辑模式,找到这一行:"root ALL=(ALL) ALL"在起下面添加"xxxALL=(ALL) ALL"(这里的xxx是你的用户名),然后保存退出。

 

4)撤销文件的写权限。也就是输入命令"chmodu-w /etc/sudoers"。

 

################################

Linux中修改hostname

要在 /etc/sysconfig/network

 /etc/hosts中修改