关于spark运行FP-growth算法报错com.esotericsoftware.kryo.KryoException

来源:互联网 发布:sqlserver存储过程语法 编辑:程序博客网 时间:2024/05/21 18:28

Spark运行FP-growth异常报错

在spark1.4版上尝试运行频繁子项挖掘算法是,照搬官方提供的python案例源码时,爆出该错误com.esotericsoftware.kryo.KryoException (java.lang.IllegalArgumentException: Can not set final scala.collection.mutable.ListBuffer field org.apache.spark.mllib.fpm.FPTree$Summary.nodes to scala.collection.mutable.ArrayBuffer

网上关于该pyspark的资料不多,最终找到这两篇有价值的博文,
一是apache官网提供类似错误的:https://www.baidu.com/link?url=pChluzupXXYP1bZZGcYLu63HP2GKNFopp-XQnptShCGcDOhswe-7HSmm54rpjXeU7Mh00E0nTtqo7S8xtofy6_&wd=&eqid=e3291c23000d56de0000000556b3fa68

二是stackoverflow上提供的:http://stackoverflow.com/questions/32126007/fpgrowth-algorithm-in-spark/32820883

解决办法

在第二篇中有个指出了这是序列输出可能引起的错误,spark采用的kryo序列化方式比JavaSerializer方式更快,但是在1.4版本的spark上会产生错误,故解决方案是,要么在spark-defaults.conf中替换,要么只运行中直接替换,现提供运行脚本testfp.py:

from pyspark import SparkContext, SparkConffrom pyspark.mllib.fpm import FPGrowthif __name__ == "__main__":        conf = SparkConf().setAppName("pythonFP").set("spark.serializer", "org.apache.spark.serializer.JavaSerializer")        sc = SparkContext(conf=conf)        data = sc.textFile("data/mllib/sample_fpgrowth.txt")        transactions = data.map(lambda line: line.strip().split(' '))        model = FPGrowth.train(transactions, minSupport=0.5, numPartitions=10)        result = model.freqItemsets().collect()        for fi in result:                 print(fi)        sc.stop()

关键指出在于重新* *定义conf环境中的序列化类,然后利用spark提供的脚本运行命令即可正常运行(PS:采用最新版的spark也可解决此问题):

spark-submit –master=spark://namenode1-sit..com:7077,namenode2-sit..com:7077 testfp.py

0 0