处理关系数据使用libFM块

来源：互联网发布：animage软件编辑：程序博客网时间：2024/06/05 04:07

英文博文：https://thierrysilbermann.wordpress.com/2015/09/17/deal-with-relational-data-using-libfm-with-blocks/

train.libfm

5 0:1 2:1 6:1 9:12.5
5 0:1 3:1 6:1 9:20
4 0:1 4:1 6:1 9:78
1 1:1,2:1 8:1 9:12.5
1 1:1 3:1 8:1 9:20

和test.libfm

0 1:1 4:1 8:1 9:78
0 0:1 5:1 6:1

我会将它们合并,所以就会更容易的整个过程

dataset.libfm

5 0:1 2:1 6:1 9:12.5
5 0:1 3:1 6:1 9:20
4 0:1 4:1 6:1 9:78
1 1:1,2:1 8:1 9:12.5
1 1:1 3:1 8:1 9:20
0 1:1 4:1 8:1 9:78
0 0:1 5:1 6:1

所以如果我们想用块结构。

我们会有5个文件:

rel_user。 libfm(features 0,1 and 6-8 are users features)

~~0 0:1 6:1~~
~~0 1:1 8:1~~

但事实上你可以避免feature_id_number broken like(0 - 1,6 - 8),我们可以将它,所以(0 - 1 - > 0 - 1和6 - 8 - > 2 - 4)

0 0:1 2:1
0 1:1 4:1

rel_product。 libfm产品特性(features 2-5 and 9 are products features)同样的事情我们可以压缩:

~~0 2:1 9:12.5~~
~~0 3:1 9:20~~
~~0 4:1 9:78~~
~~0 5:1~~

到

0 0:1 4:12.5
0 1:1 4:20
0 2:1 4:78
0 3:1

rel_user.train (which is now the mapping, the first 3 lines correspond to the first line of rel_user.libfm | /!\ we are using a 0 indexing)

0
0
0
1
1

rel_product.train (映射)

0
1
2
0
1

- file y.train which contains the ratings only

5
5
4
1
1

基本完成了…

现在您需要创建。 x和。 xt为用户文件块和产品。这个你需要脚本可用与libFM /bin/后编译它们。

./bin/convert –ifile rel_user.libfm –ofilex rel_user.x –ofiley rel_user.y

you are forced to used the flag –ofiley even if rel_user.y will never be used. You can delete it every time.

然后

./bin/transpose –ifile rel_user.x –ofile rel_user.xt

Now you can do the same thing for the test set, for test because we merge the train and test dataset at the beginning, we only need to generate rel_user.test, rel_product.test and y.test

At this point, you will have a lot of files: (rel_user.train, rel_user.test, rel_user.x, rel_user.xt, rel_product.train, rel_product.test, rel_product.x, rel_produt.xt, y.train, y.test)

和运行:

./bin/libFM -task r -train y.train -test y.test –relation rel_user,rel_product -out output

它有点多余的问题,但我希望你明白这一点。

现在一个真实的例子

对于这个例子,我将使用ml-1m.zip你可以从MovieLens数据集在这里(100万评)

ratings.dat (sample) / Format: UserID::MovieID::Rating::Timestamp

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291

movies.dat (sample) / Format: MovieID::Title::Genres

1::Toy Story (1995)::Animation|Children’s|Comedy
2::Jumanji (1995)::Adventure|Children’s|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama

users.dat (sample) / Format: UserID::Gender::Age::Occupation::Zip-code

1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455

我将创建三个不同的模型。

Easiest libFM files to train without block. I’ll use those features: UserID, MovieID
Regular libFM files to train without block. I’ll use those features: UserID, MovieID, Gender, Age, Occupation, Genre (of movie)
libFM files to train with block. I’ll also use those features: UserID, MovieID, Gender, Age, Occupation, Genre (of movie)

Model 1 and 2 can be created using the following code:

# -*- coding: utf-8 -*-

__author__='Silbermann Thierry'

__license__='WTFPL'

importpandas as pd

importnumpy as np

defcreate_libfm(w_filename, model_lvl=1):

    # Load the data

    file_ratings='ratings.dat'

    data_ratings=pd.read_csv(file_ratings, delimiter='::', engine='python',

                names=['UserID','MovieID','Ratings','Timestamp'])

    file_movies='movies.dat'

    data_movies=pd.read_csv(file_movies, delimiter='::', engine='python',

                names=['MovieID','Name','Genre_list'])

    file_users='users.dat'

    data_users=pd.read_csv(file_users, delimiter='::', engine='python',

                names=['UserID','Genre','Age','Occupation','ZipCode'])

    # Transform data

    ratings=data_ratings['Ratings']

    data_ratings=data_ratings.drop(['Ratings','Timestamp'], axis=1)

    data_movies=data_movies.drop(['Name'], axis=1)

    list_genres=[genres.split('|')forgenres indata_movies['Genre_list']]

    set_genre=[item forsublist inlist_genres foritem insublist]

    data_users=data_users.drop(['ZipCode'], axis=1)

    print'Data loaded'

    # Map the data

    offset_array=[0]

    dict_array=[]

    feat=[('UserID', data_ratings), ('MovieID', data_ratings)]

    ifmodel_lvl > 1:

        feat.extend[('Genre', data_users), ('Age', data_users), 

            ('Occupation', data_users), ('Genre_list', data_movies)]

    for(feature_name, dataset) infeat:

        uniq=np.unique(dataset[feature_name])

        offset_array.append(len(uniq)+offset_array[-1])

        dict_array.append({key: value +offset_array[-2]

            forvalue, key inenumerate(uniq)})

    print'Mapping done'

    # Create libFM file

    w=open(w_filename,'w')

    fori inrange(data_ratings.shape[0]):

        s="{0}".format(ratings[i])

        forindex_feat, (feature_name, dataset) inenumerate(feat):

            ifdataset[feature_name][i] indict_array[index_feat]:

                s+=" {0}:1".format(

                        dict_array[index_feat][dataset[feature_name][i]]

                            +offset_array[index_feat]

)

        s+='\n'

        w.write(s)

    w.close()

if__name__ =='__main__':

    create_libfm('model1.libfm',1)

    create_libfm('model2.libfm',2)

So you end up with a file model1.libfm and model2.libfm. Just need to split each of those files in two to create a training et test set file that I’ll call train_m1.libfm, test_m1.libfm (same thing for model2, train_m2.libfm, test_m2.libfm)

所以你最终得到 model1.libfm and model2.libfm。只需要将这些文件一分为二，来创建训练数据集和测试数据集，分别命名叫 train_m1.libfm, test_m1.libfm

然后你就跑libFM是这样的:

./libFM -train train_m1.libfm -test test_m1.libfm -task r -iter 20 -method mcmc -dim ‘1,1,8’ -output output_m1

但我猜你已经知道如何去做。

0 0