spark | spark 机器学习chapter3 数据的获取、处理与准备
来源:互联网 发布:react.js中文官网 编辑:程序博客网 时间:2024/05/08 03:37
阅读spark机器学习这本书来学习在spark上做机器学习
注意:数据集是电影评分等数据,下载链接:http://files.grouplens.org/datasets/movielens/ml-100k.zip
数据集包括:用户属性文件、电影元素、用户对电影的评级
1、将数据解压到某个目录下,并切换到该目录
unzip ml-100k.zipcd ml-100k
2、查看上述三种数据
用户
电影
评分
3、启动python,分析数据
启动
/home/hadoop/spark/bin/pyspark
4、读数据
from pyspark import SparkContextuser_data = sc.textFile("u.user")user_data.first()
u’1|24|M|technician|85711’
5、基本的分析
#分割数据,函数splituser_fields=user_data.map(lambda line:line.split("|"))#用户数量num_users=user_fields.map(lambda fields:fields[0]).count()#性别数num_genders = user_fields.map(lambda fields:fields[2]).distinct().count()#职业种数num_occupations = user_fields.map(lambda fields:fields[3]).distinct().count()#其他num_zipcodes=user_fields.map(lambda fields:fields[4]).distinct().count()#结果打印print "Users:%d ,genders:%d ,occupations:%d ,ZIP codes:%d" % (num_users,num_genders,num_occupations,num_zipcodes)
Users:943 ,genders:2 ,occupations:21 ,ZIP codes:795
6、画图
对ages这个属性做直方图。
由于在终端下没法画图,这里给出代码
ages = user_fields.map(lambda x: int(x[1])).collect()hist(ages, bins=20, color='lightblue', normed=True)fig = matplotlib.pyplot.gcf()fig.set_size_inches(16, 10)
7、统计职业的种类和数量
import numpy as npcount_by_occupation = user_fields.map(lambda fields:(fields[3],1)).reduceByKey(lambda x,y:x+y).collect()x_axis1 = np.array([c[0] for c in count_by_occupation])y_axis1 = np.array([c[1] for c in count_by_occupation])
打印出结果
print x_axis1
[u’administrator’ u’retired’ u’lawyer’ u’none’ u’student’ u’technician’
u’programmer’ u’salesman’ u’homemaker’ u’executive’ u’doctor’
u’entertainment’ u’marketing’ u’writer’ u’scientist’ u’educator’
u’healthcare’ u’librarian’ u’artist’ u’other’ u’engineer’]
print y_axis1
[ 79 14 12 9 196 27 66 12 7 32 7 18 26 45 31 95 16 51
28 105 67]
y_axis = y_axis1[np.argsort(y_axis1)]
array([ 7, 7, 9, 12, 12, 14, 16, 18, 26, 27, 28, 31, 32,
45, 51, 66, 67, 79, 95, 105, 196])
np.argsort() : 得到升序的下标
画图:
pos = np.arange(len(x_axis))width = 1.0ax = plt.axes()ax.set_xticks(pos + (width / 2))ax.set_xticklabels(x_axis)plt.bar(pos, y_axis, width, color='lightblue')plt.xticks(rotation=30)fig = matplotlib.pyplot.gcf()fig.set_size_inches(16, 10)
计算各个值出现不同次数的方法:
count_by_occupation2 = user_fields.map(lambda fields: fields[3]).countByValue()print "Map-reduce approach:"print dict(count_by_occupation2)print ""print "countByValue approach:"print dict(count_by_occupation)
Map-reduce approach
print(dict(count_by_occupation2))
{u’administrator’: 79, u’retired’: 14, u’lawyer’: 12, u’healthcare’: 16, u’marketing’: 26, u’executive’: 32, u’scientist’: 31, u’student’: 196, u’technician’: 27, u’librarian’: 51, u’programmer’: 66, u’salesman’: 12, u’homemaker’: 7, u’engineer’: 67, u’none’: 9, u’doctor’: 7, u’writer’: 45, u’entertainment’: 18, u’other’: 105, u’educator’: 95, u’artist’: 28}
countByValue approach
{u’administrator’: 79, u’writer’: 45, u’retired’: 14, u’lawyer’: 12, u’doctor’: 7, u’marketing’: 26, u’executive’: 32, u’none’: 9, u’entertainment’: 18, u’healthcare’: 16, u’scientist’: 31, u’student’: 196, u’educator’: 95, u’technician’: 27, u’librarian’: 51, u’programmer’: 66, u’artist’: 28, u’salesman’: 12, u’other’: 105, u’homemaker’: 7, u’engineer’: 67}
探索电影数据
解析电影分类数据的特征
读数据和查看数据
读数据
movie_data = sc.textFile("u.item")
查看数据
#第一行print movie_data.first()
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
电影总数
print "Movies:%d" % num_movies
Movies:1682
对电影发型的时间做处理
先过虑掉缺失值,定义函数,缺失值取为1900
def convert_year(x): try: return int(x[-4:]) except: return 1900
第3列为时间,格式为:01-Jan-1995 ,-4:得到年
数据处理
movie_fields = movie_data.map(lambda lines: lines.split("|"))years = movie_fields.map(lambda fields: fields[2]).map(lambda x: convert_year(x))
过滤掉为1900的
years_filtered = years.filter(lambda x: x != 1900)
计算电影的年龄,该数据发生在1998年,要得到发行时间,需要1998减去时间
movie_ages = years_filtered.map(lambda yr: 1998-yr).countByValue()values = movie_ages.values()bins = movie_ages.keys()print valuesprint bins
[65, 286, 355, 219, 214, 126, 37, 22, 24, 15, 11, 13, 15, 7, 8, 5, 13, 12, 8, 9, 4, 4, 5, 6, 8, 4, 3, 7, 3, 4, 6, 5, 2, 5, 2, 6, 5, 3, 5, 4, 9, 8, 4, 5, 7, 2, 3, 5, 7, 4, 3, 5, 5, 4, 5, 4, 2, 5, 8, 7, 3, 4, 2, 4, 4, 2, 1, 1, 1, 1, 1]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 72, 76]
画图
hist(values, bins=bins, color='lightblue', normed=True)fig = matplotlib.pyplot.gcf()fig.set_size_inches(16,10)
探索评级数据
读数据、数据量
>>> #评级数据... rating_data = sc.textFile("u.data")print rating_data.first()196 242 3 881250949num_ratings = rating_data.count()print "Ratings: %d" % num_ratingsRatings: 100000
196 242 3 881250949
Ratings: 100000
总共10万条数据
一些基本统计
#数据分割rating_data1 = rating_data.map(lambda line:line.split("\t"))#评分ratings = rating_data1.map(lambda fields:int(fields[2]))#最高得分max_rating = ratings.reduce(lambda x,y:max(x,y))#最低得分min_rating = ratings.reduce(lambda x,y:min(x,y))#评价得分mean_rating = ratings.reduce(lambda x,y:x+y) / num_ratings#中位数median_rating = np.median(ratings.collect())#平均每个用户打分数ratings_per_user = num_ratings / num_users#平均每部电影有多少评分ratings_per_movie = num_ratings / num_moviesprint "Min ratings: %d" % min_ratingprint "Max rating: %d" % max_ratingprint "Average rating: %2.2f" % mean_ratingprint "Median rating: %d" % median_ratingprint "Average # of rating per user:%2.2f" % ratings_per_userprint "Average # of ratings per movie: %2.2f" % ratings_per_movie
Min ratings: 1
Max rating: 5
Average rating: 3.00
Median rating: 4
Average # of rating per user:106.00
Average # of ratings per movie: 59.00
类似功能的函数stats()
ratings.stats()
(count: 100000, mean: 3.52986, stdev: 1.12566797076, max: 5.0, min: 1.0)
计算每个用户打分次数:
user_ratings_grouped = rating_data.map(lambda fields:(int(fields[0]),int(fields[2]))).groupByKey()user_ratings_byuser = user_ratings_grouped.map(lambda (k,v):(k,int(len(v))))user_ratings_byuser.take(5) #这里在spark2.1下报错,后续探究
绘图
user_ratings_byuser_local = user_ratings_byuser.map(lambda (k, v): v).collect()hist(user_ratings_byuser_local, bins=200, color='lightblue', normed=True)fig = matplotlib.pyplot.gcf()fig.set_size_inches(16,10)
- spark | spark 机器学习chapter3 数据的获取、处理与准备
- 读书笔记:Spark上数据的获取,处理与准备 上
- 读书笔记:Spark上数据的获取,处理与准备 下
- Spark-ML-数据获取/处理/准备
- 地铁译:Spark for python developers ---Spark与数据的机器学习
- Spark机器学习的主要内容
- Spark 机器学习实践 :Iris数据集的分类
- 大数据-基于Spark的机器项目学习
- 为Apache Spark准备的深度学习
- 地铁译:Spark for python developers ---构建Spark批处理和流处理应用前的数据准备
- Spark机器学习之分类与回归
- Spark与机器学习(一)
- spark学习-53-Spark下Java版HBase下的根据权重获取最真实数据
- spark学习-57-Spark下Scala版HBase下的根据权重获取最真实数据
- Spark学习之一-Spark的概念机器发展简史
- Spark 机器学习《一》
- SPARK机器学习库
- Spark机器学习2
- Android利用资源名称获取其id之getIdentifier()方法
- File/ could only be replicated to 0 nodes instead ofminRepLication (=1) There are 0 datanode(s) run
- 《java与模式》之单例模式
- eclipse配置tomcat图文详解
- 测试Nginx 和 Tomcat 高并发情况下处理静态页面的性能
- spark | spark 机器学习chapter3 数据的获取、处理与准备
- .net面试题整理
- 字符编码(ASCII Unicode UTf-8)的区别
- 模板
- disruptor example
- 模块化开发一 seajs
- 循环神经网络RNN简介
- Java集合框架
- OS模块