GitHub开源推荐系统项目Surprise的安装和使用

来源：互联网发布：网络赚钱的门路和技巧编辑：程序博客网时间：2024/06/06 05:16

转载请注明出处：http://blog.csdn.net/chen19920219/article/details/76905381

最近在GitHub上发现了一个很好的开源推荐系统，Star700多，包含了常用的矩阵分解算法，包括SVD，SVD++，NMF等等，GitHub地址：https://github.com/NicolasHug/Surprise，由于安装和使用过程中有许多坑，特此记录下来：

Surprise安装

官方文档中显示安装环境为Python2.7或者3.5，我的环境为3.5，其他没试过。

首先，文档显示有两种安装方法，这里使用第一种安装方法

$ pip install numpy

$ pip install scikit-surprise

在安装之前首先确认安装了numpy模块，然后在安装surprise时，老是报错，错误为unable to findvcvarsall.bat，网上搜了下解决办法链接为：

http://jingyan.baidu.com/article/adc815138162e8f723bf7387.html

然后重新pipinstall scikit-surprise就好了。

Surprise 使用

Surprise里有自带的数据集，自带的数据集加载方法和加载自己数据集的方法不同。加载项目提供的数据集就不多说了，这里重点说下Surprise怎么加载自己本地的数据集以及经常使用的方法。

官方API提供了加载本地数据集的方法：

Load a custom dataset

You can of course use a custom dataset. Surprise offerstwo ways of loading a custom dataset:

· you can either specify a single file with all the ratingsand use the split () method to performcross-validation ;

· or if your dataset is already split into predefinedfolds, you can specify a list of files for training and testing.

Either way, you will need to define a Reader object for Surprise tobe able to parse the file(s).

上面说到如何加载自己的数据集，如果要加载自己的数据集，提供了两种加载方式：

1. 可以使用官方定义的split()方法来定义k次交叉实验

2. 如果你自己以及分割好k次实验的数据集，那么可以定义一个list来进行训练和测试

事实上，我们更倾向于使用第一种方法，因为系统自动给你进行k次实验，不用我们分割数据集，简单又方便

Load anentire dataset

From file examples/load_custom_dataset.py¶

# path to dataset file

file_path=os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data')

# As we're loading a custom dataset, we need to define a reader. In the

# movielens-100k dataset, each line has the following format:

# 'user item rating timestamp', separated by '\t' characters.

reader=Reader(line_format='user item rating timestamp',sep='\t')

data=Dataset.load_from_file(file_path,reader=reader)

data.split(n_folds=5)

官方API还提供了一个演示加载的Demo，在加载数据集之前需要初始化一个reader，因为加载本地方法需要两个参数

classmethodload_from_file(file_path, reader)

Load a datasetfrom a (custom) file.

Use this if youwant to use a custom dataset and all of the ratings are stored in one file. Youwill have to split your dataset using the split method. See an example inthe User Guide.

Parameters:

· file_path (string) – The path to the file containing ratings.

· reader (Reader) – A reader to read the file.

一个是你的数据集的地址，另一个就是初始化一个Reader对象，Reader类如下：

classsurprise.dataset.Reader(name=None, line_format=None, sep=None, rating_scale=(1, 5),skip_lines=0)

The Reader classis used to parse a file containing ratings.

Such a file isassumed to specify only one rating per line, and each line needs to respect thefollowing structure:

user ;item ;rating ; [timestamp]

where the orderof the fields and the separator (here ‘;’) may be arbitrarily defined (seebelow). brackets indicate that the timestamp field is optional.

Parameters:

· name (string, optional) – If specified, a Reader for one of the built-in datasets is returned and any other parameter is ignored. Accepted values are ‘ml-100k’, ‘ml-1m’, and ‘jester’. Default is None.

· line_format (string) – The fields names, in the order at which they are encountered on a line. Example: 'item user rating'.

· sep (char) – the separator between fields. Example : ';'.

· rating_scale (tuple, optional) – The rating scale used for every rating. Default is(1, 5).

· skip_lines (int, optional) – Number of lines to skip at the beginning of the file. Default is 0.

上面说到Reader类分割文件，文件的数据结构必须为：

user ;item ;rating ; [timestamp]格式，

当然你可以少个timestamp也是没关系的，user为用户的id；item为项目的id；rating为项目所在用户id的评分；

你也可以自己定义数据结构，具体参照API。

Reader里的方法我们一般用line_format属性和sep属性，其他默认就可以了,当然，你也可以把其他属性加进去根据自己的情况来，line_format为数据的行格式，也就是上面的user ; item ; rating ;而seq的意思是要去怎么分割行数据，比如说根据空格或者逗号。

而data.split(n_folds=3)为定义了3次交叉实验，如果不写这句默认为5次.

下节我们将具体讲下怎么来加载自己的数据集实验，以及评估的方法。

阅读全文

0 0