win7 spark运行本地程序文件出错 error：avaSparkContext. : java.lang.NullPointerException

来源：互联网发布：淘宝网关键词编辑：程序博客网时间：2024/06/07 22:16

系统：win7 x64

Spark版本：spark-1.3.0-bin-hadoop2.4

编写了名为“SimpleApp.py”的Spark本地执行文件，内容如下：

""SimpleApp.py"""from pyspark import SparkContextlogFile = "D:/ProgramFiles/spark-1.3.0-bin-hadoop2.4/README.md"  # 这里是Spark解压目录也是spark的主目录下的README.md文件sc = SparkContext("local", "Simple App")logData = sc.textFile(logFile).cache()numAs = logData.filter(lambda s: 'a' in s).count()numBs = logData.filter(lambda s: 'b' in s).count()print("Lines with a: %i, lines with b: %i" % (numAs, numBs))</span>

在spark目录下执行：

bin\spark-submit --master local[4] D:\Files\Python\SimpleApp.py

报错，error的最后部分显示如下：

但其实这并不是主要错误，往上翻一点会看到这个错误提示：

这才是真正出错的地方，不是代码问题，而是免安装Hadoop版本的Spark的问题，下面直接引用下JamCon在stackoverflow上关于

submit .py script on Spark without Hadoop installation

问题的解答，可以解决上述问题，问题链接：http://stackoverflow.com/questions/29746395/submit-py-script-on-spark-without-hadoop-installation。

The good news is you're not doing anything wrong, and your code will run after the error is mitigated.

Despite the statement that Spark will run on Windows without Hadoop, it still looks for some Hadoop components. The bug has a JIRA ticket (SPARK-2356), and a patch is available. As of Spark 1.3.1, the patch hasn't been committed to the main branch yet.

Fortunately, there's a fairly easy work around.

Create a bin directory for winutils under your Spark installation directory. In my case, Spark is installed in D:\Languages\Spark, so I created the following path: D:\Languages\Spark\winutils\bin
Download the winutils.exe from Hortonworks and put it into the bin directory created in the first step. Download link for Win64: http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe
Create a "HADOOP_HOME" environment variable that points to the winutils directory (not the bin subdirectory). You can do this in a couple of ways:
- a. Establish a permanent environment variable via the Control Panel -> System -> Advanced System Settings -> Advanced Tab -> Environment variables. You can create either a user variable or a system variable with the following parameters:
  Variable Name=HADOOP_HOME Variable Value=D:\Languages\Spark\winutils\
- b. Set a temporary environment variable inside your command shell before executing your script
  set HADOOP_HOME=d:\\Languages\\Spark\\winutils
Run your code. It should work without error now.

简单来说就是免安装Hadoop的Spark在submit时还是会去寻找Hadoop目录下的winutils中的bin里的winutils文件，而只安装Spark的话，别说这个文件了，连winutils的路径都不存在。所以解决方法就是创建winutils文件夹，在其中新建bin文件夹，下载64位的winutils.exe放到bin里，把winutils文件夹（*注意不是bin）路径作为HADOOP_HOME路径添加到系统PATH中，重新submit应用，问题就解决了。

结果显示如下：

0 0