处理关系数据使用libFM块

来源:互联网 发布:animage软件 编辑:程序博客网 时间:2024/06/05 04:07

英文博文:https://thierrysilbermann.wordpress.com/2015/09/17/deal-with-relational-data-using-libfm-with-blocks/

train.libfm

5 0:1 2:1 6:1 9:12.5
5 0:1 3:1 6:1 9:20
4 0:1 4:1 6:1 9:78
1 1:1,2:1 8:1 9:12.5
1 1:1 3:1 8:1 9:20

和test.libfm

0 1:1 4:1 8:1 9:78
0 0:1 5:1 6:1

我会将它们合并,所以就会更容易的整个过程

dataset.libfm

5 0:1 2:1 6:1 9:12.5
5 0:1 3:1 6:1 9:20
4 0:1 4:1 6:1 9:78
1 1:1,2:1 8:1 9:12.5
1 1:1 3:1 8:1 9:20
0 1:1 4:1 8:1 9:78
0 0:1 5:1 6:1

所以如果我们想用块结构。

我们会有5个文件:

  • rel_user。 libfm(features 0,1 and 6-8 are users features)

0 0:1 6:1
0 1:1 8:1

但事实上你可以避免feature_id_number broken like(0 - 1,6 - 8),我们可以将它,所以(0 - 1 - > 0 - 1和6 - 8 - > 2 - 4)

0 0:1 2:1
0 1:1 4:1

  • rel_product。 libfm产品特性(features 2-5 and 9 are products features)同样的事情我们可以压缩:

0 2:1 9:12.5
0 3:1 9:20
0 4:1 9:78
0 5:1

0 0:1 4:12.5
0 1:1 4:20
0 2:1 4:78
0 3:1

  • rel_user.train (which is now the mapping, the first 3 lines correspond to the first line of rel_user.libfm | /!\ we are using a 0 indexing)

0
0
0
1
1

  • rel_product.train (映射)

0
1
2
0
1

    • file y.train which contains the ratings only

5
5
4
1
1

基本完成了…

现在您需要创建。 x和。 xt为用户文件块和产品。 这个你需要脚本可用与libFM /bin/后编译它们。

./bin/convert –ifile rel_user.libfm –ofilex rel_user.x –ofiley rel_user.y

you are forced to used the flag –ofiley even if rel_user.y will never be used. You can delete it every time.

然后

./bin/transpose –ifile rel_user.x –ofile rel_user.xt

Now you can do the same thing for the test set, for test because we merge the train and test dataset at the beginning, we only need to generate rel_user.test, rel_product.test and y.test

At this point, you will have a lot of files: (rel_user.train, rel_user.test, rel_user.x, rel_user.xt, rel_product.train, rel_product.test, rel_product.x, rel_produt.xt, y.train, y.test)

和运行:

./bin/libFM -task r -train y.train -test y.test –relation rel_user,rel_product -out output

它有点多余的问题,但我希望你明白这一点。


现在一个真实的例子

对于这个例子,我将使用ml-1m.zip你可以从MovieLens数据集在这里(100万评)

ratings.dat (sample) / Format: UserID::MovieID::Rating::Timestamp

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291

movies.dat (sample) / Format: MovieID::Title::Genres

1::Toy Story (1995)::Animation|Children’s|Comedy
2::Jumanji (1995)::Adventure|Children’s|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama

users.dat (sample) / Format: UserID::Gender::Age::Occupation::Zip-code

1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455

我将创建三个不同的模型。

  1. Easiest libFM files to train without block. I’ll use those features: UserID, MovieID
  2. Regular libFM files to train without block. I’ll use those features: UserID, MovieID, Gender, Age, Occupation, Genre (of movie)
  3. libFM files to train with block. I’ll also use those features: UserID, MovieID, Gender, Age, Occupation, Genre (of movie)
Model 1 and 2 can be created using the following code:

# -*- coding: utf-8 -*-
__author__='Silbermann Thierry'
__license__='WTFPL'
 
importpandas as pd
importnumpy as np
 
defcreate_libfm(w_filename, model_lvl=1):
 
    # Load the data
    file_ratings='ratings.dat'
    data_ratings=pd.read_csv(file_ratings, delimiter='::', engine='python',
                names=['UserID','MovieID','Ratings','Timestamp'])
 
    file_movies='movies.dat'
    data_movies=pd.read_csv(file_movies, delimiter='::', engine='python',
                names=['MovieID','Name','Genre_list'])
 
    file_users='users.dat'
    data_users=pd.read_csv(file_users, delimiter='::', engine='python',
                names=['UserID','Genre','Age','Occupation','ZipCode'])
 
    # Transform data
    ratings=data_ratings['Ratings']
    data_ratings=data_ratings.drop(['Ratings','Timestamp'], axis=1)
     
    data_movies=data_movies.drop(['Name'], axis=1)
    list_genres=[genres.split('|')forgenres indata_movies['Genre_list']]
    set_genre=[item forsublist inlist_genres foritem insublist]
     
    data_users=data_users.drop(['ZipCode'], axis=1)
     
    print'Data loaded'
 
    # Map the data
    offset_array=[0]
    dict_array=[]
     
    feat=[('UserID', data_ratings), ('MovieID', data_ratings)]
    ifmodel_lvl > 1:
        feat.extend[('Genre', data_users), ('Age', data_users),
            ('Occupation', data_users), ('Genre_list', data_movies)]
 
    for(feature_name, dataset) infeat:
        uniq=np.unique(dataset[feature_name])
        offset_array.append(len(uniq)+offset_array[-1])
        dict_array.append({key: value +offset_array[-2]
            forvalue, key inenumerate(uniq)})
 
    print'Mapping done'
 
    # Create libFM file
     
    w=open(w_filename,'w')
    fori inrange(data_ratings.shape[0]):
        s="{0}".format(ratings[i])
        forindex_feat, (feature_name, dataset) inenumerate(feat):
            ifdataset[feature_name][i] indict_array[index_feat]:
                s+=" {0}:1".format(
                        dict_array[index_feat][dataset[feature_name][i]]
                            +offset_array[index_feat]
                                          )
        s+='\n'
        w.write(s)
    w.close()
 
if__name__ =='__main__':
    create_libfm('model1.libfm',1)
    create_libfm('model2.libfm',2)


So you end up with a file model1.libfm and model2.libfm. Just need to split each of those files in two to create a training et test set file that I’ll call train_m1.libfm, test_m1.libfm (same thing for model2, train_m2.libfm, test_m2.libfm)

所以你最终得到 model1.libfm and model2.libfm。 只需要将这些文件一分为二,来创建训练数据集和测试数据集,分别命名叫 train_m1.libfm, test_m1.libfm

然后你就跑libFM是这样的:

./libFM -train train_m1.libfm -test test_m1.libfm -task r -iter 20 -method mcmc -dim ‘1,1,8’ -output output_m1

但我猜你已经知道如何去做。


0 0
原创粉丝点击