spark搭建与编译

来源:互联网 发布:24点计算器软件 编辑:程序博客网 时间:2024/06/04 18:52
如果安装遇到权限问题:
新建文件夹之后用chown和chgrp之后,把文件夹的归宿者设置为当前用户
搭建:
1.root登录设置
sudo -s
vim /etc/lightdm/lightdm.conf
输入 i 编辑:
[SeatDefaults]
user-session=ubuntu
greeter-session=unity-greeter
greeter-show-manual-login=true
allow-guest=false
按[Esp] 输入:wq退出。
输入root密码
sudo passwd root
reboot -h now

2.安装java
mkdir /usr/lib/java
tar -zxvf jdk-7u79-linux-x64.tar.gz -C /usr/lib/java
vim /etc/profile
输入i 输入
export JAVA_HOME=/usr/lib/java/jdk1.7.0_79
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
按[Esp] 输入:wq退出。
source /etc/profile
配置默认jdk(如果是第一次安装就不用):
sudo update-alternatives --install /usr/bin/java java /usr/lib/java/jdk1.7.0_79/bin/java 300
sudo update-alternatives --install /usr/bin/javac javac /usr/lib/java/jdk1.7.0_79/bin/javac 300
sudo update-alternatives --install /usr/bin/jar jar /usr/lib/java/jdk1.7.0_79/bin/jar 300
sudo update-alternatives --install /usr/bin/javah javah /usr/lib/java/jdk1.7.0_79/bin/javah 300
sudo update-alternatives --install /usr/bin/javap javap /usr/lib/java/jdk1.7.0_79/bin/javap 300
tar zxvf scala-2.11.2.tgz -C /opt/scala
输入 java -version 验证

3.安装scala
mkdir /usr/lib/scala
tar zxvf scala-2.10.5.tgz -C /usr/lib/scala
vim /etc/profile
输入i 输入
export SCALA_HOME=/usr/lib/scala/scala-2.10.5
export PATH=${SCALA_HOME}/bin:$PATH
按[Esp] 输入:wq退出。
source /etc/profile
输入 scala -version 验证

4.安装hadoop
详情参考:http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/SingleCluster.html
apt-get install ssh
apt-get install rsync
最好创建一个hadoop的用户组和用户(这一步可以不做)
addgroup hadoop
adduser --ingroup hadoop hadoop
打开sudoers文件
gedit /etc/sudoers
在# User privilege specification
加上一行 hadoop ALL=(ALL:ALL) ALL
上面一句是用来给hadoop附上root的权限
创建好后使用hadoop登录进行以下操作
apt-get install openssh-server
/etc/init.d/ssh start
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
测试:ssh localhost
Crtl+D退出

mkdir /usr/local/hadoop
tar -zxvf hadoop-2.6.0.tar.gz -C /usr/local/hadoop
cd /usr/local/hadoop/hadoop-2.6.0/etc/hadoop
vim hadoop-env.sh
输入i 输入
export JAVA_HOME=/usr/lib/java/jdk1.7.0_79
按[Esp] 输入:wq退出。
source hadoop-env.sh
再将hadoop的bin路径放进 /etc/profile就可以了

测试:
  • Format the filesystem:
      $ bin/hdfs namenode -format
  • Start NameNode daemon and DataNode daemon:
      $ sbin/start-dfs.sh

    The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to$HADOOP_HOME/logs).

  • Browse the web interface for the NameNode; by default it is available at:
    • NameNode - http://localhost:50070/
  • Make the HDFS directories required to execute MapReduce jobs:
      $ bin/hdfs dfs -mkdir /user  $ bin/hdfs dfs -mkdir /user/<username>
  • Copy the input files into the distributed filesystem:
      $ bin/hdfs dfs -put etc/hadoop input
  • Run some of the examples provided:
      $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs[a-z.]+'
  • Examine the output files:

    Copy the output files from the distributed filesystem to the local filesystem and examine them:

      $ bin/hdfs dfs -get output output  $ cat output/*

    or

    View the output files on the distributed filesystem:

      $ bin/hdfs dfs -cat output/*
  • When you're done, stop the daemons with:
      $ sbin/stop-dfs.sh



5.配置Hadoop伪分布和yarn(需要修改文件:core-site.xml/mapred-site.xml/hdfs-site.xml/yarn-site.xml/通常 hadoop-env.sh/yarn-env.sh 这两个是已经配置好的了)
vim /etc/profile
添加:
export JAVA_HOME=/usr/lib/java/jdk1.7.0_79
export HADOOP_PREFIX=/usr/local/hadoop/hadoop-2.6.0
export HADOOP_HOME=${HADOOP_PREFIX}
#有时候要加下面这个
#export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${HADOOP_PREFIX}/bin:${SCALA_HOME}/bin:${JAVA_HOME}/bin:$PATH

cd /usr/local/hadoop/hadoop-2.6.0/etc/hadoop
#配置hdfs-site.xml和core-site.xml是为了配置伪分布
vim hdfs-site.xml
在<configuration></configuration>之间增加:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
vim core-site.xml
在<configuration></configuration>之间增加:
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/hadoop-2.6.0/tmp</value>
</property>
#配置yarn-site.xml和mapred-site.xml是为了yarn
vim yarn-site.xml
在<configuration></configuration>之间增加:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

cp mapred-site.xml.template mapred-site.xml
vim mapred-site.xml
在<configuration></configuration>之间增加:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

如果报错JAVA_HOME is not set and could not be found. 则要配置hadoop-env.sh 和 yarn-env.sh的JAVA_HOME

格式化一下:
hdfs namenode -format
使用/usr/local/hadoop/hadoop-2.3.0/sbin/start-fds.sh
/usr/local/hadoop/hadoop-2.3.0/sbin/start-yarn.sh

/usr/local/hadoop/hadoop-2.3.0/sbin/start-all.sh
启动。
测试一下:
  1. Start ResourceManager daemon and NodeManager daemon:
      $ sbin/start-yarn.sh
  2. Browse the web interface for the ResourceManager; by default it is available at:
    • ResourceManager - http://localhost:8088/
  3. Run a MapReduce job.
  4. When you're done, stop the daemons with:
      $ sbin/stop-yarn.sh
hadoop的UI地址:端口50075日志文件入口,端口50070集群文件系统入口 文件地址,端口8088 查看任务端口,端口9000配置的core-site.xml这个是API传输数据的端口

6.配置Spark
mkdir /usr/local/spark
tar zxvf spark-1.4.1-bin-hadoop2.6.tgz -C /usr/local/spark
加入环境变量如下:
export SPARK_HOME=/usr/local/spark/spark-1.4.1-bin-hadoop2.6
export PATH=${SPARK_HOME}/bin:PATH
接着配置一下spark-env.sh
进入conf目录,以下是standalone
cd conf
cp spark-env.sh.template spark-env.sh
vim spark-env.sh
最少添加以下内容:
#export SCALA_HOME=/usr/lib/scala/scala-2.10.5
export SPARK_MASTER_IP=localhost
export SPARK_WORKER_MEMORY=2G
export JAVA_HOME=/usr/lib/java/jdk1.7.0_79
接着配置slaves:
cp slaves.template slaves
vim slaves
编辑加入localhost,把主节点当作Master也当作Worker

这样子就可以直接执行样例代码了:
cd /usr/local/spark/spark-1.4.1-bin-hadoop2.6
./bin/run-example org.apache.spark.examples.SparkPi

直接输入spark-shell就可以启动spark的shell指令。
进入spark的sbin页面输入start-all.sh就可以把spark启动了

Spark UI地址:端口8080监控任务

7.安装idea
mkdir /usr/local/intellij
tar -zxvf ideaIC-14.1.4.tar.gz -C /usr/local/intellij
#export DISPLAY=:0.0 #这一句不一定要加
#打开intellij之后设置主题File->setting

9.测试:
在hadoop sbin下启动hadoop
在spark sbin 下启动spark
在hadoop目录下找到README.txt文件上传文件。
hadoop fs -mkdir /input
hadoop fs -copyFromLocal README.txt /input
使用 spark-shell打开shell
val file = sc.textFile("hdfs://localhost:9000/input/README.txt")
val sparks = file.filter(line=>line.contains("hadoop"))
sparks.count

10.在spark上运行jar包
上传jar包格式:
./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]

本地模式最简单的一种就是:
spark-submit --class SparkPi --master local[2] /root/IdeaProjects/SparkPi/out/artifacts/SparkPi/SparkPi.jar

可以启动spark然后运行以下命令进行测试:
$SPARK_HOME/bin/spark-submit --master spark://localhost:7077 --class org.apache.spark.examples.SparkPi --total-executor-cores 2 --executor-memory 500m $SPARK_HOME/lib/spark-examples-1.5.0-hadoop2.3.0.jar 2



0 0