pyspark + mongodb

来源:互联网 发布:暂七师军乐队 知乎 编辑:程序博客网 时间:2024/04/25 21:20

一: spark

1.怎样调试:

1. 使用local运行后,再提交到服务器运行, 过滤掉简单的 python语法错误

2. 本地运行也可以把数据从hdfs拖下来,但不能执行saveAsText [hdfs]这类操作

3.使用 yarn-client用于调试;会输出详细的错误信息,而yarn-cluster不会输出这类信息

Traceback (most recent call last):
  File "/home/work/work/test/stability_analysis.py", line 473, in <module>
    stability_analysis()
  File "/home/work/work/test/stability_analysis.py", line 464, in stability_analysis
    number.reduced_then_store_into_file()
  File "/home/work/work/test/stability_analysis.py", line 248, in reduced_then_store_into_file
    rdd_formatter.saveAsTextFile(NumberAbnormalReboot.number_output)
  File "/home/work/tars/infra-client-1.1/bin/current/c3prc-hadoop-spark-pack/python/lib/pyspark.zip/pyspark/rdd.py", line 1506, in saveAsTextFile
  File "/home/work/tars/infra-client-1.1/bin/current/c3prc-hadoop-spark-pack/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/home/work/tars/infra-client-1.1/bin/current/c3prc-hadoop-spark-pack/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o152.saveAsTextFile.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://c3prc-hadoop/user/s_miui_whetstone/Statistics/Stability/number already exists

4. 在云上的文件在使用前要删除,使该文件不存在

如果要多次写入文件,要下把rdd 进行union处理,再一起写入文件


二: mongodb

1. mongodb的启动方法:

sudo systemctl start mongodb
启动 mongodb 不用使用直接启动,要已服务的形式:否则一堆写入权限的问题


2. 常用工具:

客服端应用:mongo 交互查询等/ mongstat看数据库的写入速度等

常用命令看help


3. mongodb写入速度慢:

3.1 MongoClient("mongodb://" + ip + ":27017", maxPoolSize=200)

创建MongoClient 时要使用参数 maxPoolSize

3.2 使用index

self.collection.create_index([('model', DESCENDING), ('version', DESCENDING), ('bn', DESCENDING), ('imei', DESCENDING)],
                                    name=index_name, unique=True, background=True)


3.3 使用bulk


4. mongodb 调试:

        try:
            bulk.execute()
        except errors.BulkWriteError as bwe:
            print bwe.details
            print traceback.format_exc()



0 0