Hadoop、spark的一些基本使用笔记

来源:互联网 发布:鸿星网络 编辑:程序博客网 时间:2024/05/28 23:20

Hadoop、spark的一些基本使用笔记

新项目要用python+Hadoop+spark1:

  • hdfs中没有pwd概念所以 要记得目录呀
  • ipython3 python3.5 spark-2.1.1-bin-hadoop2.7
----------#导入模块创建连接In [1]: from hdfs import ClientIn [2]: client = Client('http://master:50070')----------#获取文件列表In [6]: client.list("/")Out[6]: ['ligq', 'test', 'tmp', 'user']----------#查看文件信息In [7]: client.status('/tmp')Out[7]:{'pathSuffix': '', 'fileId': 16391, 'storagePolicy': 0, 'childrenNum': 0, 'permission': '644', 'accessTime': 1495003819798, 'length': 22628, 'blockSize': 134217728, 'type': 'FILE', 'modificationTime': 1494923718027, 'group': 'supergroup', 'owner': 'cpda', 'replication': 2}----------#获取文件的md5值,算法貌似和python的默认md5算法不同,**待验证**In [43]: client.checksum('/admin/iris_noheader.csv')Out[43]:{'algorithm': 'MD5-of-0MD5-of-512CRC32C', 'bytes': '0000020000000000000000005918d3777bd0fc6ef00741feb23fe79d00000000', 'length': 28}

sparksql

In [1]: from pyspark import SparkConf,SparkContextIn [2]: from pyspark.sql import SQLContextIn [3]: conf = SparkConf().setAppName("spar_sql_test")In [4]: sc = SparkContext(conf = conf)In [5]: sqlContext=SQLContext(sc)In [6]: from pyspark.sql import HiveContextIn [7]: hc = HiveContext(sc)

Apache Spark:How to user pyspark with Python 3 ?

1,edit profile :vim ~/.profile2,add the code into the file: export PYSPARK_PYTHON=python33, execute command : source ~/.profile4, ./bin/pyspark