推荐系统：Evaluating_collaborative_filtering_recommender_systems

来源：互联网发布：ubuntu protobuf3安装编辑：程序博客网时间：2024/05/16 05:53

论文名

EvaluatingCollaborative Filtering Recommender Systems
JONATHAN L. HERLOCKER
School of Electrical Engineering & Computer Science, Oregon State University
and
JOSEPH A. KONSTAN, LOREN G. TERVEEN, AND JOHN T. RIEDL

GroupLens Research Group,University of Minnesota

摘要

Recommender systems have been evaluated in many, often incomparable, ways. In this paper we review the key decisions in evaluating collaborative filtering recommender systems: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole. In addition to reviewing the evaluation strategies used by prior researchers, we present empirical results from the analysis of various accuracy metrics on one content domain where all the tested metrics collapsed roughly into three equivalence classes. Metrics within each equivalency class were strongly correlated, whilemetrics from different equivalency classes were uncorrelated.

Categories and Subject Descriptors: H.3.4 [Information Storage and Retrieval]: Systems and Software – Performance Evaluation (efficiency
and effectiveness)
General Terms: Experimentation, Measurement, Performance
Additional Key Words and Phrases: Collaborative filtering, recommender systems, metrics, evaluation

推荐系统常常以许多特别的方式来评价。在本篇论文中，我们回顾了一些用于评价协同过滤推荐系统的关键要点，包括被用于评价时用户的任务，分析类型，被使用的数据，并且测量了这种方式下的预测准确性，评价了预测的属性，以及基于用户的对整个推荐系统的评价。此外，也回顾了一些被之前的研究者所使用的评价的策略。利用各种各样的评价指标（这些指标大约可以分成三种等价类），通过对一个内容分析，我们也得出了一个实践结果：同属于一个等价类的指标是相关的，然后不属于一个类的指标是不相关的。（译者：等价类、相关的、不相关的都涉及数学概念）

INTRODUCTION

Recommender systems use the opinions of a community of users to help individuals in that community more effectively identify content of interest from a potentially overwhelming set of choices [Resnick and Varian 1997]. One of the most successful technologies for recommender systems, called collaborative filtering, has been developed and improved over the past decade to the point where a wide variety of algorithms exist for generating recommendations. Each algorithmic approach has adherents who claim it to be superior for some purpose. Clearly identifying the best algorithm for a given purpose has proven challenging, in part because researchers disagree on which attributes should be measured, and on which metrics should be used for each attribute. Researchers who survey the literature will find over a dozen quantitative metrics and additional qualitative evaluation techniques.

推荐系统是指利用一个社区内用户的意见去帮助一个社区内的用户更有效率地从一份大量集合中发现自己感兴趣的内容。在推荐系统中被成功使用的技术叫作协同过滤，它已经发展和改进了许多年，有许多既存算法可以用来产生推荐结果。每一种算法都被其拥护者宣称从某个角度考虑是最优秀的。很显然，即使拥有明确的目的，选择一个算法也是非常有挑战性的，一方面是因为研究者都在争论哪些属性应该被用于评价推荐系统，并且哪种指标用于评价某个属性。阅读文献的研究者会找到许多定量指标和许多定性的评价技术。

Evaluating recommender systems and their algorithms is inherently difficult for several reasons. First, different algorithms may be better or worse on different data sets. Many collaborative filtering algorithms have been designed specifically for data sets where there are many more users than items (e.g., the MovieLens data set has 65,000 users and 5,000 movies). Such algorithms may be entirely inappropriate in a domain where there are many more items than users (e.g., a research paper recommender with thousands of users but tens or hundreds of thousands of articles to recommend). Similar differences exist for ratings density, ratings scale, and other properties of data sets.

下面几个原因导致评价一个推荐系统是非常困难的。第一，不同的算法在不同的数据上的优劣是不同的。许多协同过滤算法都是被用于设计适用于特别的数据集，这些数据集往往用户多余物品（比如，MovieLens的数据集有65000的用户，却只有5000的电影）.这样的算法也许是不太合适于另一种情况：当物品数量大于用户数量的时候（比如，推荐论文的推荐系统的用户只有几千人，但是论文数量却非常多。同样的差距也存在于评价密度和规模，还有数据集的其他属性。

The second reason that evaluation is difficult is that the goals for which an evaluation is performed may differ. Much early evaluation work focused specifically on the "accuracy" of collaborative filtering algorithms in "predicting" with held ratings. Even early researchers recognized, however, that when recommenders are used to support decisions, it can be more valuable to measure how often the system leads its users to wrong choices. Shardanand and Maes measured "reversals" – large errors between the predicted and actual rating [1995]; we have used the signal-processing measure of the Receiver Operating Characteristic curve [Swets 1963] to measure a recommender's potential as a filter [Konstan et al. 1997]. Other work has speculated that there are properties different from accuracy that have a larger effect on user satisfaction and performance. A range of research and systems have looked at measures including the degree to which the recommendations cover the entire set of items [Mobasher et al. 2001], the degree to which recommendations made are non-obvious [McNee et al. 2002], and the ability of recommenders to explain their recommendations to users [Sinha and Swearingen 2002]. A few researchers have argued that these issues are all details, and that the bottom-line measure of recommender system success should be user satisfaction. Commercial systems measure user satisfaction by the number of products purchased (and not returned!), while non-commercial systems may just ask users how satisfied they are.

第二个特别困难的原因是因为评价推荐系统的角度也许是不同。在很久之前的评价工作主要关注利用协同过滤算法预测用户打分的准确性。更早之前，研究者认识到，如果推荐系统被用于帮助用户做决定，那么测试出推荐系统的推荐导致用户做出错误决定的频率是非常有价值的。Shardanand and Maes 还测量“逆转”，这是存在于预测和真实评分之间的错误。我们还使用了受试者操作特征曲线 (ROC曲线 )的信号处理方法去测试一个推荐系统作为过滤器的潜能[Konstan et al. 1997]。其他的工作还考虑到“准确”的不同的属性，这些都影响着用户的满意度和表现。一系列的研究和系统还关注测量推荐列表涵盖所有物品的程度[Mobasher et al. 2001]，推荐列表涵盖不热门物品的程度[McNee et al. 2002]，以及推荐系统向用户解释其推荐列表的能力[Sinha and Swearingen 2002]。一些研究者争论上述测量都是具体的细节，一个推荐系统是否成功的最低标准还是用户的满意程度。商业推荐系统的测量用户满意度的方式关键还是看购买的数量（并且是不包括退货），而非商业的推荐系统则直接问用户对本网站是否满意。

Finally, there is a significant challenge in deciding what combination of measures to use in comparative evaluation. We have noticed a trend recently -- many researchers find that their newest algorithms yield a mean absolute error of 0.73 (on a five-point rating scale) on movie rating data sets. Though the new algorithms often appear to do better than the older algorithms they are compared to, we find that when each algorithm is tuned to its optimum, they all produce similar measures of quality. We – and others – have speculated that we may be reaching some "magic barrier" where natural variability may prevent us from getting much more accurate. In support of this, Hill et al. [1995] have shown that users provide inconsistent ratings when asked to rate the same movie at different times. They suggest that an algorithm cannot bemore accurate than the variance in a user’s ratings for the same item.

最后，这里还有一个挑战就是：觉得那些测量标准组合在一起使用用于评价一个推荐系统。最近我发现了一个趋势：许多研究者发现了它们的最新的算法使用于电影数据集（一个5星评分系统）的平均绝对误差都为0.73。尽管最新的算法与之前的算法比较有着更好的表现，但是我发现所有的算法都是被略微改进了，使其达到最佳状态，它们都产生了相似的结果。我们，以及其他研究者，都推测我们已经到达了一个有魔力的障碍区域，这是因为自然变异已经阻止了我们能够得到更准确的结果。Hill et al. [1995]找了一个支持这个观点的证据：用户再不同的时间会对相同的电影作出不同的评级。这暗示了，当人类在评分一个相同物品时，算法怎么也不可能精准到超越了人类的变化。

Even when accuracy differences are measurable, they are usually tiny. On a five-point rating scale, are users sensitive to a change in mean absolute error of 0.01? These observations suggest that algorithmic improvements in collaborative filtering systems may come from different directions than just continued improvements in mean absolute error. Perhaps the best algorithms should be measured according to how well they can communicate their reasoning to users, or with how little data they can yield accurate recommendations. If this is true, new metrics will be needed to evaluate these new algorithms.

甚至尽管当准确度的差别是可测得时候，往往是非常小的。考虑对某一个物品评分（最高五分）的情况，用户会受到平均绝对误差只有0.01的改变的影响吗？这些观察发现：关于协同过滤的推荐系统的算法的改进应该从多个角度来考虑，而不仅仅只是平均绝对方差的改进。也许，最佳的算法应该根据它能向用户解释推荐理由有多好来评价，或者即使只有少量数据也能够产生优秀的推荐结果。如果上述是正确的，那么更多新的指标来评价这些新算法。

This paper presents six specific contributions towards evaluation of recommender systems.
1. We introduce a set of recommender tasks that categorize the user goals for a particular recommender system.
2. We discuss the selection of appropriate datasets for evaluation. We explore when evaluation can be completed off-line using existing datasets and when it requires on-line experimentation. We briefly discuss synthetic data sets and more extensively review the properties of datasets that should be considered in selecting them for evaluation.
3. We survey evaluation metrics that have been used to evaluation recommender systems in the past, conceptually analyzing their strengths and weaknesses.
4. We report on experimental results comparing the outcomes of a set of different accuracy evaluation metrics on one data set. We show that the metrics collapse roughly into three equivalence classes.
5. By evaluating a wide set of metrics on a dataset, we show that for some datasets, while many different metrics are strongly correlated, there are classes of metrics that are uncorrelated.
6. We review a wide range of non-accuracy metrics, including measures of the degree to which recommendations cover the set of items, the novelty and serendipity of recommendations, and user satisfaction and behavior in the recommender system.

Throughout our discussion, we separate out our review of what has been done before in the literature from the introduction of new tasks and methods.

本篇论文针对评价推荐系统，提出来六个值得参考的要点。

针对一个特别的推荐系统，我们介绍了一系列能够将用户目标分类的推荐引擎（作为推荐系统的一部分，完成了分类用户目标的功能）。
我们讨论了如何为自己的推荐系统选择适合的数据集。我们即探索了一些完全离线就能计算出评价的现存的数据集，也探索了一些需要在线计算的数据集。我们简略的讨论了综合的数据集和重点回顾了数据集的特性，这些特性都决定了是否选择它们用于评价推荐系统。
我们调查了过去用于评价推荐系统的评价指标，也从概念上去分析了它们的优势和劣势。
我们依据实验结果，比较了不同的评价准确的指标，当然，我们只使用了相同的数据集。我们认为这些指标大致上都可以分为三个等价类。
通过广泛地研究评价推荐系统的指标，我们认为对于某些数据集，有些指标是非常相关的（译者：非常有用的），然后其他的一些指标就没那么相关。
我们大量回顾了一些与准确无关的指标，包括关于推荐系统能否覆盖所有物品的程度，以及新锐性、惊喜度，用户的满意度和行为。

通过更多的讨论，我们解析出了要点：在引入和设计推荐系统之前，应该做一些什么。

We expect that the primary audience of this article will be collaborative filtering researchers who are looking to evaluate new algorithms against previous research and collaborative filtering practitioners who are evaluating algorithms before deploying them in recommender systems.

There are certain aspects of recommender systems that we have specifically left out of the scope of this paper. In particular, we have decided to avoid the large area of marketing-inspired evaluation. There is extensive work on evaluating marketing campaigns based on such measures as offer acceptance and sales lift [Rogers 2001]. While recommenders are widely used in this area, we cannot add much to existing coverage of this topic. We also do not address general usability evaluation of the interfaces. That topic is well covered in the research and practitioner literature (e.g., [Helander 1988, Nielsen 1994]) We have chosen not to discuss computation performance of recommender algorithms. Such performance is certainly important, and in the future we expect there to be work on the quality of time-limited and memory-limited recommendations. This area is just emerging, however (see for example Miller et al.'s recent work on recommendation on handheld devices [Miller et al. 2003]), and there is not yet enough research to survey and synthesize. Finally, we do not address the emerging question of the robustness and transparency of recommender algorithms. We recognize that recommender system robustness to manipulation by attacks (and transparency that discloses manipulation by system operators) is important, but substantially more work needs to occur in this area before there will be accepted metrics for evaluating such robustness and transparency.

译：

我们认为这篇论文的主要读者应该两类人，包括正在根据之前的研究寻求评价推荐系统的与协同过滤有关的新算法的研究者和那些想要在推荐系统上线之前将评价推荐系统的从业人员。

我们明确地排除了几个评价推荐系统的标准。特别地，我们决定避免使用与商业有关的评价标准。这是因为那些需要达到商业要求的评价标准，都需要花大量的时间来完成[Rogers 2001]。当大量的推荐者被使用的同时，我们不会增加太多关于商业的标准的讨论（如果topic是商业的话）。我们也不着重强调这些接口可用性。关于商业的内容在一些研究和从业人员的作品中已经被很好的讨论了(e.g., [Helander 1988, Nielsen 1994])。我们打算不去讨论关于推荐算法的性能表示。这种的性能固然是非常重要的，当然我们也是十分期待出现能够利用有限的时间和有限的内存来完成高质量的推荐的推荐系统。这个领域是新兴的，然而（(请看一个例子， Miller et al.'s最近完成的一个在手持设备上运行的推荐系统[Miller et al. 2003])），这个领域暂时还是缺乏相应的调查和研究。最后，我们也不打算关注一些新兴的标准，比如推荐算法的鲁棒性和透明度，我们也明白推荐系统需要鲁棒性来应付被攻击的情况，也需要透明性来展示推荐系统内部的工作原理，而且这些也是非常重要的，但是还是需要关于此方面的研究才能发现用于衡量鲁棒性和透明性的且被广泛接受的度量办法。

The remainder of the article is arranged as follows:

Section 2 - We identify the key user tasks from which evaluation methods have been determined and suggest new tasks that have not been evaluated extensively.
Section 3- A discussion regarding the factors that can affect selection of a data set on which to perform evaluation.
Section 4 – An investigation of metrics that have been used in evaluating the accuracy of collaborative filtering predictions and recommendations. Accuracy has been by far the most commonly published evaluation method for collaborative filtering systems. This section also includes the results from an empirical study of the correlations between metrics.
Section 5 – A discussion of metrics that evaluate dimensions other than accuracy. In addition to covering the dimensions and methods that have been used in the literature, we introduce new dimensions on which we believe evaluation should be done.
Section 6 – Final conclusions, including a list of areas were we feel future work is particularly warranted.

译：
文章剩余的部分主要讨论一下内容

第二部分——我们根据确定的评价方法来规定了来关键用户的任务（用户完成了这些任务、操作，才能产生用于评价的数据），并且我们清晰的指出出了那些还没有被广泛了解的用户任务（操作）。
第三部分——讨论了选择数据集的关键因素，这一点很重要，毕竟要在这些数据集上执行评价。
第四部分——调查了度量标准，这些度量标准都是用于评价协同过滤算法的准确性，也就是我们使用的推荐系统的准确性。对于一个协同过滤系统来说，准确性的评价是至今使用最为广泛的评估办法。这一部分还有对于各个度量标准的统计学的研究，产生了一些实验性的结果。
第五部分——讨论了度量标准评估维度的问题。除了维度和已经被使用在本文中的方法，我们介绍了一些新的维度，认为必须依据这些新的维度，推荐系统的评价才能完成。
第六部分——最后的总结，包括一份列表，里面列出了今天必须要着重研究的部分。

（未完待续）