Spark基础:如何遍历dataframe

来源:互联网 发布:淘宝宝贝描述评分4.7 编辑:程序博客网 时间:2024/06/05 22:25
#coding:utf-8from pyspark import SparkConf,SparkContextfrom pyspark.sql import HiveContextimport datetimeimport sysreload(sys)sys.setdefaultencoding('utf-8')# 初始化conf=SparkConf().setAppName("genUserDescWords").setMaster("local")sc=SparkContext(conf=conf)hc=HiveContext(sc)prt_dt = (datetime.datetime.now()-datetime.timedelta(days=2)).strftime('%Y-%m-%d')sql = "select * from test.user_likes_info where prt_dt='"+prt_dt+"' limit 100"# 生成DataFramedf = hc.sql(sql)# DataFrame转listrows=df.collect()cols=df.columnscols_len=len(cols)all_list=[]for row in rows:user_info=[]likes_info=[]prt_info=[]most_like=[]for idx,col in enumerate(cols):if idx <3:user_info.append(row[col])elif idx == cols_len:prt_info.append(row[col])else:if idx == 3 and row[col]:most_like = row[col]else:likes_info.append(row[col])if most_like and most_like[0] in likes_info:likes_info.pop(likes_info.index(most_like[0]))likes_info+=[most_like[0]+",您大多数时间都花在这上面"]likes_info=list(set(likes_info))all_list.append('\t'.join(user_info+likes_info+prt_info))# list转化成RDDrdd=sc.parallelize(all_list)# 通过RDD将数据保存到HDFSrdd.saveAsTextFile("hdfs://192.168.2.61:8020/user/hhc/spark_test")

将上述代码保存py文件,然后再通过命令:/usr/bin/spark-submit file_name.py,即可运行。

原创粉丝点击