SVD Recommendation System in Ruby

来源：互联网发布：福建三千万美元知乎编辑：程序博客网时间：2024/05/02 06:00

By Ilya Grigorik on January 15, 2007

One day, a bunch of friends, who happened to be big Family Guy fans, decided to put together a site to rank and share their thoughts on the show. Soon thereafter they had a Rails site up and running, and all was well, and other fans joined in hordes. A web 2.0 success! Then one day they realized that they could no longer track everyone's ratings, their user-base was too large, and so it occurred to one of the developers: "Wouldn't it be cool if we could use the collective knowledge of our whole community to recommend and rank episodes for each user individually?"

Sounds familiar, right? In fact, recommendation systems are a billion-dollar industry, and growing. In academic jargon this problem is known as Collaborative Filtering, and a lot of ink has been spilled on the matter. Netflix, for one, announced a 1 million dollar competitionlast year for a system that beats their algorithm by +10% percent. It goes without saying that a lot of different systems have been proposed and explored in theory and practice. However, one of the most successful and widely used approaches to this day also happens to be one of the simplest: Singular Value Decomposition (SVD), also affectionately referred to in the literature as LSI (Latent Semantic Indexing), dimensionality reduction, or projection.

Linear Algebra Refresher

SVD methods are a direct consequence of a theorem in linear algebra:

Any MxN matrix A whose number of rows M is greater than or equal to its number of columns N, can be written as the product of an MxM column-orthogonal matrix U, and MxN diagonal matrix W with positive or zero elements (singular values), and the transpose of an NxN orthogonal matrix V.

More intuitively, assume that we have a matrix where every column represents a user, and every row represents a product (or a Family Guy season, in our case). Thus, with M users and N products, we are looking at an MxN matrix. The theorem simply states that we can decompose such a matrix into three components: (MxM) call it U, (MxN) call it S, and (NxN) call it V. More importantly, we can use this decomposition to approximate the original MxN matrix. By taking the first k eigenvalues of the matrix S, we can effectively obtain a compressed representation of the data. So why do we care? (Mathies click here, we'll wait.)

Machine Learning & Information Retrieval

One of the most fundamental, and fun properties of Machine Learning is its close correlation to the concept of data compression - if we can identify significant concepts (clusters of users, for example) then we can represent a large dataset with fewer bits. However, this logic also works in reverse! If we can represent our data with fewer bits (compress our data), then we have identified 'significant' concepts! I bet you see where we're headed - SVD's allow us to compress a large matrix by approximating it in a smaller-dimensional space.

SVD's found wide application in the field of Information Retrieval (IR) where this process is often referred to as Latent Semantic Indexing (LSI). In these applications the columns of the matrix are the documents, and the rows are the individual words. Running SVD allows us to collapse this matrix into a smaller-dimensional space where highly correlated items (for example, words that often occur together) are captured as a single feature. Essentially, we are discarding the noise, and keeping the signal. In practice, the IR guys usually collapse their ginormous matrices to 100, 200, or 300 dimensions (from original 10000+) and then perform similarity calculations. In case you're curious, this same method has also found many uses in image compression and computer vision applications.

Dimensionality Reduction

Back to our Family Guy developers. For the sake of brevity we will use a very simple example with only 4 users, and 6 seasons (User x Rating matrix shown above). Cranking this matrix through the SVD yields three different components: matrix U (6x6), matrix S (6x6), matrix V (4x4). Now, we will collapse this matrix from a (6x4) space into a 2-Dimensional one. To do this, we simply take the first two columns of U, S and V. The end result:

Now, because we are working with a 2-Dimensional space, we can plot our results (below). We can treat the first column of U can as x , and the second column as y - these are the seasons. Same process is repeated for matrix V - these are the users.

Do you see what happened? Because we are working with a small example it's hard to call two users a 'cluster' but you will nonetheless notice that Ben and Fred are located very close to each other - now compare their respective ratings in our original matrix. Very cool, huh! Same pattern re-occurs for Seasons 5 and 6. Our dimensionality reduction technique effectively captured the fact that Ben and Fred seem to have similar taste - we're halfway there!

Finding Similar Users

Next, Bob joins the site and shares with us a few of his season ratings ([5,5,0,0,0,5] for seasons 1-6) - it's our goal to give him a recommendation based on this data. Intuitively, we want to find users similar to Bob, thus if we can 'embed' Bob into our 2-Dimensional space and look where he is located, we will be able to answer this question. To do this, we perform the following calculation:

First line is the general formula to project a new user into our space - I won't motivate the math behind it, but if you're interested, check the document I referenced in the Linear Algebra Refresher section. The important result is that we have the x, and y coordinates for Bob. Let's add them to our earlier graph:

The green triangle represents Bob. It's not immediately evident which user is closest, but if we extend the vector (from the origin - green line), we can see that Ben's and Fred's vectors are, in fact, very similar. A common way to judge similarity between any two vectors is to look at the angles separating them: cosine similarity. From our graph we can intuitively tell that the angle between Ben and Bob is smaller than the one between Ben and Fred. To determine this, let's iterate over all users and compute their cosine similarities. Furthermore, let's discard anyone whose similarity is less than 0.90 (outside of the shaded region). We get: Ben (0.987), Fred (0.955). Hence, we conclude that Ben and Bob have the most similar tastes, though Fred is pretty close also!

What happens now is up to you. Here is one very simple strategy: find the most similar user and compare his/her items against that of the new user; take the items that the similar user has rated and the new user has not and return them in decreasing order of ratings. Thus, Ben rated every season except 4, and Bob rated seasons 1,2 and 6. We take the set difference ([1,2,3,5,6] - [1,2,6] = [3,5]) which are the seasons Ben rated but Bob hasn't seen and return them in the decreasing order of Ben's ratings: Season 5 (5 stars), Season 3 (3 stars).

Will you just give me the code already?

For the brave ones that made it to here, below is the equivalent of what we just did on paper.. in Ruby. First, install the linalg library, and now you're ready to roll:

require 'linalg'users = { 1 => "Ben", 2 => "Tom", 3 => "John", 4 => "Fred" }m = Linalg::DMatrix[           #Ben, Tom, John, Fred            [5,5,0,5], # season 1            [5,0,3,4], # season 2            [3,4,0,3], # season 3            [0,0,5,3], # season 4            [5,4,4,5], # season 5            [5,4,5,5]  # season 6            ]# Compute the SVD Decompositionu, s, vt = m.singular_value_decompositionvt = vt.transpose# Take the 2-rank approximation of the Matrix#   - Take first and second columns of u  (6x2)#   - Take first and second columns of vt (4x2)#   - Take the first two eigen-values (2x2)u2 = Linalg::DMatrix.join_columns [u.column(0), u.column(1)]v2 = Linalg::DMatrix.join_columns [vt.column(0), vt.column(1)]eig2 = Linalg::DMatrix.columns [s.column(0).to_a.flatten[0,2], s.column(1).to_a.flatten[0,2]]# Here comes Bob, our new userbob = Linalg::DMatrix[[5,5,0,0,0,5]]bobEmbed = bob * u2 * eig2.inverse# Compute the cosine similarity between Bob and every other User in our 2-D spaceuser_sim, count = {}, 1v2.rows.each { |x|    user_sim[count] = (bobEmbed.transpose.dot(x.transpose)) / (x.norm * bobEmbed.norm)    count += 1  }# Remove all users who fall below the 0.90 cosine similarity cutoff and sort by similaritysimilar_users = user_sim.delete_if {|k,sim| sim < 0.9 }.sort {|a,b| b[1] <=> a[1] }similar_users.each { |u| printf "%s (ID: %d, Similarity: %0.3f) \n", users[u[0]], u[0], u[1]  }# We'll use a simple strategy in this case:#   1) Select the most similar user#   2) Compare all items rated by this user against your own and select items that you have not yet rated#   3) Return the ratings for items I have not yet seen, but the most similar user has ratedsimilarUsersItems = m.column(similar_users[0][0]-1).transpose.to_a.flattenmyItems = bob.transpose.to_a.flattennot_seen_yet = {}myItems.each_index { |i|  not_seen_yet[i+1] = similarUsersItems[i] if myItems[i] == 0 and similarUsersItems[i] != 0}printf "\n %s recommends: \n", users[similar_users[0][0]]not_seen_yet.sort {|a,b| b[1] <=> a[1] }.each { |item|  printf "\tSeason %d .. I gave it a rating of %d \n", item[0], item[1]}print "We've seen all the same seasons, bugger!" if not_seen_yet.size == 0

svd-recommender-gsl.rb - Ruby/GSL version, courtesy of Joshua Bassett

Running our algorithm produces:

Ben (ID: 1, Similarity: 0.987)Fred (ID: 4, Similarity: 0.955)Ben recommends:  Season 5 .. I gave it a rating of 5  Season 3 .. I gave it a rating of 3

That's it! A 50 line SVD recommendation / collaborative filtering system for a Rails app. with the help of some simple linear algebra.

In other iterations: Decision Tree Learning, Bayes Classification, Support Vector Machines

线性代数相关知识：

任意一个M*N的矩阵A（M行*N列，M>N），可以被写成三个矩阵的乘机：

1.U：（M行M列的列正交矩阵）

2.S：（M*N的对角线矩阵，矩阵元素非负）

3.V：（N*N的正交矩阵的倒置）

即A=U*S*V‘（注意矩阵V需要倒置）

直观地说：

假设我们有一个矩阵，该矩阵每一列代表一个user，每一行代表一个item。

如上图，ben,tom....代表user，season n代表item。

矩阵值代表评分（0代表未评分）：

如 ben对season1评分为5，tom对season1 评分为5，tom对season2未评分。

机器学习和信息检索：

机器学习的一个最根本也是最有趣的特性是数据压缩概念的相关性。

如果我们能够从数据中抽取某些有意义的感念，则我们能用更少的比特位来表述这个数据。

从信息论的角度则是数据之间存在相关性，则有可压缩性。

SVD就是用来将一个大的矩阵以降低维数的方式进行有损地压缩。

降维：

下面我们将用一个具体的例子展示svd的具体过程。

首先是A矩阵。

A =

5 5 0 5
5 0 3 4
3 4 0 3
0 0 5 3
5 4 4 5
5 4 5 5
（代表上图的评分矩阵）

使用matlab调用svd函数：

[U,S,Vtranspose]=svd(A)

U =
-0.4472 -0.5373 -0.0064 -0.5037 -0.3857 -0.3298
-0.3586 0.2461 0.8622 -0.1458 0.0780 0.2002
-0.2925 -0.4033 -0.2275 -0.1038 0.4360 0.7065
-0.2078 0.6700 -0.3951 -0.5888 0.0260 0.0667
-0.5099 0.0597 -0.1097 0.2869 0.5946 -0.5371
-0.5316 0.1887 -0.1914 0.5341 -0.5485 0.2429

S =
17.7139 0 0 0
0 6.3917 0 0
0 0 3.0980 0
0 0 0 1.3290
0 0 0 0
0 0 0 0

Vtranspose =
-0.5710 -0.2228 0.6749 0.4109
-0.4275 -0.5172 -0.6929 0.2637
-0.3846 0.8246 -0.2532 0.3286
-0.5859 0.0532 0.0140 -0.8085

分解矩阵之后我们首先需要明白S的意义。

可以看到S很特别，是个对角线矩阵。

每个元素非负，而且依次减小，具体要讲明白元素值的意思大概和线性代数的特征向量，特征值有关。

但是可以大致理解如下：

在线性空间里，每个向量代表一个方向。

所以特征值是代表该矩阵向着该特征值对应的特征向量的方向的变化权重。

所以可以取S对角线上前k个元素。

当k=2时候即将S（6*4）降维成S（2*2），

同时U(6*6),Vtranspose(4*4)相应地变为 U(6*2),Vtranspose(4*2).

如下图（图片里的usv矩阵元素值和我自己matlab算出的usv矩阵元素值有些正负不一致，但是本质是相同的）：

此时我们用降维后的U，S，V来相乘得到A2

A2=U(1:6,1:2)*S(1:2,1:2)*(V(1:4,1:2))' //matlab语句

A2 =

5.2885 5.1627 0.2149 4.4591
3.2768 1.9021 3.7400 3.8058
3.5324 3.5479 -0.1332 2.8984
1.1475 -0.6417 4.9472 2.3846
5.0727 3.6640 3.7887 5.3130
5.1086 3.4019 4.6166 5.5822

此时我们可以很直观地看出，A2和A很接近，这就是之前说的降维可以看成一种数据的有损压缩。

接下来我们开始分析该矩阵中数据的相关性。

我们将u的第一列当成x值，第二列当成y值。即u的每一行用一个二维向量表示，同理v的每一行也用一个二维向量表示。

如下图：