spark高级数据分析系列之第三章音乐推荐和 Audioscrobbler 数据集

来源：互联网发布：sqlserver未安装编辑：程序博客网时间：2024/04/30 20:06

3.1数据集和整体思路

数据集
本章实现的是歌曲推荐，使用的是ALS算法，ALS是spark.mllib中唯一的推荐算法，因为只有ALS算法可以进行并行运算。

使用数据集在这里，里面包含该三个文件：
表一：user_artist_data.txt   包含该的是（用户ID、歌曲ID、用户听的次数）

表二：artist_data.txt   这个文件包含的是（歌曲ID，歌曲名字）

表三：artist_alias.txt   输入错误，或者不同某种原因，同一首歌曲可能具有不同ID，这个是歌曲勘误表（bad_id, good_id）

程序结构
第一步：对数据进行数据清理
        ALS要求输入的数据格式是（用户、产品、值），在本实验中就是（用户ID、歌曲ID、播放次数），也就是第一个文件user_artist_data.txt中的数据，但由于输入错误或者别的原因同一首歌曲有多个ID号，需要把一首歌曲的不同ID合并成一个ID（通过第三个文件artist_alias.txt）。表三artist_alias.txt文件中第一列是歌曲错误的ID，第二列是真正的ID，所以在把表一的歌曲ID通过表三来修正。同时表一中存在数据缺失，需要进行缺失处理。最后把数据结构化为（用户ID、歌曲ID、播放次数）
第二步：把数据传给ALS进行训练，并进行预测
       ALS接收到的数据（用户ID、歌曲ID、播放次数）转换为表格形式：
        每一行代表一个用户，每一列代表代表一首歌曲，表格数据是用户播放次数。由于一个用户所听的歌曲很有限，所以该表格是一个稀疏矩阵。ALS的做法是，把该矩阵转化为两个矩阵的相乘
        X矩阵是（用户ID-特征）矩阵，k值可以自己给定。Y矩阵是（歌曲ID-特征）矩阵，k值可以自己给定。这样处理就可以把稀疏矩阵转换为两个矩阵，k代表着特征个数，本节使用的是10。
       现在的问题是如何得到这两个矩阵X和Y，使用的是交替最小二乘推荐算法。基本思想是：要同时确定XY很难，但如果确定一个X，求Y是很简单的。所以就随机给定一个Y，求得最佳X，再反过来求最佳Y，不断重复。随机确定矩阵Y之后，就可以在给定 A 和 Y 的条件下求出 X 的最优解。
       实际上 X 的每一行可以分开计算,所以我们可以将其并行化,而并行化是大规模计算的一大优点。

       要想两边精确相等是不可能的,因此实际的目标是最小化，但实际中是不会求矩阵的逆，是通过QR分解之类的方法求得。

3.2程序走读

准备数据

为了保证内存充足，在启动 spark-shell 时需求指定参数 --driver-memory 6g。

读取数据

val rawUserArtistData =sc.textFile("/home/sam/下载/profiledata_06-May-2005/user_artist_data.txt")val rawArtistData =sc.textFile("/home/sam/下载/profiledata_06-May-2005/artist_data.txt")val rawArtistAlias =sc.textFile("/home/sam/下载/profiledata_06-May-2005/artist_alias.txt")

ALS 算法实现有一个小缺点:它要求用户和产品的 ID 必须是数值型,并且是 32 位非负整数，需要对数据进行范围检查，得到最大值是 2443548 和 10794401，满足要求

rawUserArtistData.map(_.split(' ')(0).toDouble).stats()   //stats方法会返回每一列的最大值，最小值，均值、方差、总数等rawUserArtistData.map(_.split(' ')(1).toDouble).stats()

数据的缺失值处理，把空值和异常值用None代替

val artistByID = rawArtistData.flatMap { line =>              val (id, name) = line.span(_ != '\t')    if (name.isEmpty) {        None    } else {        try {            Some((id.toInt, name.trim))        } catch {            case e: NumberFormatException => None        }    } }

把空值用None代替，同事把字符串类型转为int

val artistAlias = rawArtistAlias.flatMap { line =>            val tokens = line.split('\t')    if (tokens(0).isEmpty) {        None    } else {        Some((tokens(0).toInt, tokens(1).toInt))    }}.collectAsMap()

构建模型

把相关的依赖包导入
把表三的（bad_id,good_id）作为广播变量，广播变量会缓冲到每台机器中，而不是每个任务中（每台机器中有多个任务）。因为每个任务都需要访问artistAlias，如果直接就传递过去，每个任务中都保存一份副本，会增加存储容量。Spark还使用高效的广播算法来分发变量，进而减少通信的开销。
然后把表一数据转换为ALS模型需要的rating类型数据，同时把歌曲ID和表三对照更改歌曲ID。

import org.apache.spark.mllib.recommendation._   val bArtistAlias = sc.broadcast(artistAlias)//整合训练数据val trainData = rawUserArtistData.map { line =>    val Array(userID, artistID, count) = line.split(' ').map(_.toInt)    val finalArtistID =    bArtistAlias.value.getOrElse(artistID, artistID)                //把bad_id替换成good_id    Rating(userID, finalArtistID, count)}.cache()
搭建模型
val model = ALS.trainImplicit(trainData, 10, 5, 0.01, 1.0)    
模型的参数含义
• rank = 10
模型的潜在因素的个数k,即“用户 - 特征”和“产品 - 特征”矩阵的列数;一般来说,它也是矩阵的阶。
• iterations = 5
矩阵分解迭代的次数;迭代的次数越多,花费的时间越长,但分解的结果可能会更好。
• lambda = 0.01
标准的过拟合参数;值越大越不容易产生过拟合,但值太大会降低分解的准确度。
• alpha = 1.0
控制矩阵分解时,被观察到的“用户 - 产品”交互相对没被观察到的交互的权重。

查看结果

首先查看用户2093760所听过的歌曲
val rawArtistsForUser = rawUserArtistData.map(_.split(' ')).filter { case Array(user,_,_) => user.toInt == 2093760 }   //找出ID为2093760的数据val existingProducts =rawArtistsForUser.map {               //把歌曲的ID号转为int型    case Array(_,artist,_) => artist.toInt     }.collect().toSet artistByID.filter { case (id, name) =>                                  //根据表二打印歌曲名    existingProducts.contains(id)    }.values.collect().foreach(println)
利用刚刚训练好的模型给2093760用户推荐5首歌曲
val recommendations = model.recommendProducts(2093760, 5)
输出结果是
Rating(2093760,1300642,0.02833118412903932)
Rating(2093760,2814,0.027832682960168387)
Rating(2093760,1037970,0.02726611004625264)
Rating(2093760,1001819,0.02716011293509426)
Rating(2093760,4605,0.027118271894797333)

结果中最后的得分并不是概率，分数越高代表用户越喜欢。
然后把歌曲ID转为所对应的歌曲名并打印
artistByID.filter { case (id, name) =>        recommendedProductIDs.contains(id)}.values.collect().foreach(println)
输出结果是
Green Day
Linkin Park
Metallica
My Chemical Romance
System of a Down

模型的评估

模型的评估主要是通过AUC曲线来反映，AUC的具体内容这里就不介绍了。

先把数据集划分为训练数据和测试数据

val Array(trainData, cvData) = allData.randomSplit(Array(0.9, 0.1))

训练模型

val allItemIDs = allData.map(_.product).distinct().collect() val bAllItemIDs = sc.broadcast(allItemIDs)val model = ALS.trainImplicit(trainData, 10, 5, 0.01, 1.0)

评估模型

val auc = areaUnderCurve(cvData, bAllItemIDs, model.predict)   //该函数附录中给出

附录：

def areaUnderCurve(      positiveData: DataFrame,      bAllArtistIDs: Broadcast[Array[Int]],      predictFunction: (DataFrame => DataFrame)): Double = {    // What this actually computes is AUC, per user. The result is actually something    // that might be called "mean AUC".    // Take held-out data as the "positive".    // Make predictions for each of them, including a numeric score    val positivePredictions = predictFunction(positiveData.select("user", "artist")).      withColumnRenamed("prediction", "positivePrediction")    // BinaryClassificationMetrics.areaUnderROC is not used here since there are really lots of    // small AUC problems, and it would be inefficient, when a direct computation is available.    // Create a set of "negative" products for each user. These are randomly chosen    // from among all of the other artists, excluding those that are "positive" for the user.    val negativeData = positiveData.select("user", "artist").as[(Int,Int)].      groupByKey { case (user, _) => user }.      flatMapGroups { case (userID, userIDAndPosArtistIDs) =>        val random = new Random()        val posItemIDSet = userIDAndPosArtistIDs.map { case (_, artist) => artist }.toSet        val negative = new ArrayBuffer[Int]()        val allArtistIDs = bAllArtistIDs.value        var i = 0        // Make at most one pass over all artists to avoid an infinite loop.        // Also stop when number of negative equals positive set size        while (i < allArtistIDs.length && negative.size < posItemIDSet.size) {          val artistID = allArtistIDs(random.nextInt(allArtistIDs.length))          // Only add new distinct IDs          if (!posItemIDSet.contains(artistID)) {            negative += artistID          }          i += 1        }        // Return the set with user ID added back        negative.map(artistID => (userID, artistID))      }.toDF("user", "artist")    // Make predictions on the rest:    val negativePredictions = predictFunction(negativeData).      withColumnRenamed("prediction", "negativePrediction")    // Join positive predictions to negative predictions by user, only.    // This will result in a row for every possible pairing of positive and negative    // predictions within each user.    val joinedPredictions = positivePredictions.join(negativePredictions, "user").      select("user", "positivePrediction", "negativePrediction").cache()    // Count the number of pairs per user    val allCounts = joinedPredictions.      groupBy("user").agg(count(lit("1")).as("total")).      select("user", "total")    // Count the number of correctly ordered pairs per user    val correctCounts = joinedPredictions.      filter($"positivePrediction" > $"negativePrediction").      groupBy("user").agg(count("user").as("correct")).      select("user", "correct")    // Combine these, compute their ratio, and average over all users    val meanAUC = allCounts.join(correctCounts, "user").      select($"user", ($"correct" / $"total").as("auc")).      agg(mean("auc")).      as[Double].first()    joinedPredictions.unpersist()    meanAUC  }

阅读全文

0 0