spark上安装graphframes

来源:互联网 发布:网站流量统计java代码 编辑:程序博客网 时间:2024/06/06 02:26
安装环境
java:1.8
centos:6
spark:2.1.0
graphframes:0.5


1、安装和测试graphframes(root账户)
a、下载graphframes的最新版jar包到spark目录下的python/lib目录
cd /usr/hdp/2.6.0.3-8/spark2/python/lib
wget http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.5.0-spark2.1-s_2.11/graphframes-0.5.0-spark2.1-s_2.11.jar


b、配置/etc/profile, 在最后增加一条
PYTHONPATH=/usr/hdp/2.6.0.3-8/spark2/python/lib/graphframes-0.5.0-spark2.1-s_2.11.jar:$PYTHONPATH


c、执行下列命令,使之生效
source /etc/profile


d、安装graphframes
spark-shell --packages graphframes:graphframes:0.5.0-spark2.1-s_2.11


这样会在~/.ivy2/jars/目录中生成5个jar包文件,


e、将前面生成的5个jar包文件复制到spark目录的sharelib子目录
mkdir /usr/hdp/2.6.0.3-8/spark2/sharelib
cp -r ~/.ivy2/jars/*.jar /usr/hdp/2.6.0.3-8/spark2/sharelib/


f、将sharelib子目录复制到整个集群的所有节点上
scp -r /usr/hdp/2.6.0.3-8/spark2/sharelib/ cloud102:/usr/hdp/2.6.0.3-8/spark2/sharelib
scp -r /usr/hdp/2.6.0.3-8/spark2/sharelib/ cloud103:/usr/hdp/2.6.0.3-8/spark2/sharelib
...
scp -r /usr/hdp/2.6.0.3-8/spark2/sharelib/ cloud10*:/usr/hdp/2.6.0.3-8/spark2/sharelib


g、修改spark-env.sh,
vim /usr/hdp/2.6.0.3-8/spark2/conf/spark-env.sh


在其中添加:
jarsPath=.
for i in $SPARK_HOME/sharelib/*.jar;
do jarsPath=$i:$jarsPath;
done
len=${#jarsPath}
jarsPath=${jarsPath:0:len-2}
SPARK_CLASSPATH=$jarsPath;
echo $SPARK_CLASSPATH


这样spark将能搜索到graphframes实现;
若使用oozie启动,则需要将spark-env.sh复制到所有的节点


2、启动jupyter测试
(jupyter自带了new terminal功能,通过它能直接使用启用用户终端登录,所以为安全起见,我们需要另外建立一个账户启动;)


a、创建jupyter组和jupyter用户
groupadd jupyter
useradd -g jupyter jupyter
使用jupyter用户而不是root用户,是为了保证安全性;


b、在hdfs上创建jupyter的个人目录,并授权
hdfs dfs -mkdir /user/jupyter
hdfs dfs -chown -R jupyter:jupyter /user/jupyter
hdfs dfs -chmod 755 /user/jupyter


c、启动jupyter
nohup jupyter-notebook --notebook-dir=./ipython-code/  --no-browser --ip='172.16.11.92' --port=8889  >jupyter_nohup.log 2>&1 &


d、登录jupyter,在浏览器中打开
http://192.168.0.1:8889/
如果要输入token,在后台用jupyter notebook list命令查看


e、在jupyter中新建一个python程序,执行
from graphframes import *
若无错误,则表示graphframes引入成功


f、完整样例
from pyspark.context import SparkContext
from pyspark.sql import SQLContext
from pyspark.context import SparkConf
from graphframes import *


conf=SparkConf().setAppName("jupyter_xuyufei").setMaster("yarn").set("deploy-mode","client")
# conf.set("num-executors", "6").set("executor-cores", 1).set("executor-memory", "3g").set("driver-memory", "1g")


sc=SparkContext(conf=conf)


sqlContext = SQLContext(sc)


v = sqlContext.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
], ["id", "name", "age"])


e = sqlContext.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
], ["src", "dst", "relationship"])


g = GraphFrame(v, e)


g.inDegrees.show()


result = g.labelPropagation(maxIter=2)


result.select("id", "label").show()


3、使用spark-submit测试
将上面的完整样例程序存成demo.py,执行下面语句:
spark-submit --master yarn --deploy-mode client --num-executors 6 --driver-memory 1g --executor-memory 1g --executor-cores 1 demo.py

原创粉丝点击