pyspark notebook的使用

来源:互联网 发布:搜客云优化 编辑:程序博客网 时间:2024/05/16 08:05

IPython Notebook现已更名为Jupyter Notebook,是一种交互式的笔记本,是可以用来重建和分享包含动态代码、等式、可视化和解释文本的文档的Web程序。Spark提供了python解释器pyspark,可以通过IPython Notebook将Spark的pyspark以笔记本这种交互式更强的方式来访问。

所用环境

  • centos 6.5
  • python 3.5.2 (通过Anaconda 4.2.0安装)
  • spark-2.0.0-bin-hadoop2.7
  • hive 1.2.1 (在HDP2.5集群上)

在spark2.0之前,启动的命令为IPYTHON_OPTS=”notebook –ip=1.2.3.4” pyspark
需要提前安装ipython。推荐使用Anaconda进行安装。

[root@master ~]# IPYTHON_OPTS="notebook --ip=1.2.3.4" pysparkSPARK_MAJOR_VERSION is set to 1, using Spark[TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated and will be removed in future versions.[TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook` in the future[I 22:28:23.204 NotebookApp] [nb_conda_kernels] enabled, 2 kernels found[I 22:28:23.269 NotebookApp] ✓ nbpresent HTML export ENABLED[W 22:28:23.269 NotebookApp] ✗ nbpresent PDF export DISABLED: No module named nbbrowserpdf.exporters.pdf[I 22:28:23.271 NotebookApp] [nb_conda] enabled[I 22:28:23.333 NotebookApp] [nb_anacondacloud] enabled[I 22:28:23.337 NotebookApp] Serving notebooks from local directory: /root[I 22:28:23.337 NotebookApp] 0 active kernels [I 22:28:23.337 NotebookApp] The Jupyter Notebook is running at: http://1.2.3.4:8888/[I 22:28:23.337 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).[W 22:28:23.337 NotebookApp] No web browser found: could not locate runnable browser.[I 22:28:34.702 NotebookApp] 302 GET / (172.31.64.222) 0.86ms[I 22:32:28.357 NotebookApp] Creating new notebook in /Documents[I 22:32:28.382 NotebookApp] Writing notebook-signing key to /root/.local/share/jupyter/notebook_secret[I 22:32:36.049 NotebookApp] Kernel started: 4d304d11-f29f-456e-a9c2-c7dc30204cfd

在spark2.0之后的版本,使用上述命令会报错:

[xdwang@dell bin]$ IPYTHON_OPTS="notebook --ip=211.71.76.25" ./pysparkError in pyspark startup:IPYTHON and IPYTHON_OPTS are removed in Spark 2.0+. Remove these from the environment and set PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS instead.

在bashrc中增加环境变量:

vi .bashrc增加如下两行配置:export PYSPARK_DRIVER_PYTHON=jupyterexport PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=211.71.76.25"

重新启动后生效:

[xdwang@dell ~]$ pyspark[I 14:33:18.032 NotebookApp] [nb_conda_kernels] enabled, 2 kernels found[I 14:33:18.045 NotebookApp] Writing notebook server cookie secret to /home/xdwang/.local/share/jupyter/runtime/notebook_cookie_secret[I 14:33:18.491 NotebookApp] ✓ nbpresent HTML export ENABLED[W 14:33:18.491 NotebookApp] ✗ nbpresent PDF export DISABLED: No module named 'nbbrowserpdf'[I 14:33:18.922 NotebookApp] [nb_anacondacloud] enabled[I 14:33:18.933 NotebookApp] [nb_conda] enabled[I 14:33:18.962 NotebookApp] Serving notebooks from local directory: /home/xdwang[I 14:33:18.962 NotebookApp] 0 active kernels [I 14:33:18.962 NotebookApp] The Jupyter Notebook is running at: http://211.71.76.25:8888/[I 14:33:18.963 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).[W 14:33:18.964 NotebookApp] No web browser found: could not locate runnable browser.[I 14:33:44.921 NotebookApp] 302 GET / (202.205.97.62) 1.95ms

Spark SQL增加Hive支持

按照spark官方文档的说明,需要完成三处修改

  1. 复制core-site.xml, hdfs-site.xml,
    hive-site.xml三个配置文件到Spark主目录的conf文件夹下。在HDP2.5上,这三个配置文件的路径分别位于:/etc/hadoop/2.5.0.0-1245/0/core-site.xml,/etc/hadoop/2.5.0.0-1245/0/hdfs-site.xml, /etc/hive/2.5.0.0-1245/0/conf.server/hive-site.xml

  2. 在spark-defaults.xml文件中增加如下属性

spark.sql.hive.metastore.version        1.2.1spark.sql.hive.metastore.jars           mavenspark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc
  1. 创建SparkSession时指定spark.sql.warehouse.dir
    SparkSession配置
0 0
原创粉丝点击