Cosine Similarity and Term Weight Tutorial

来源:互联网 发布:淘宝分装是什么意思 编辑:程序博客网 时间:2024/05/01 19:25

An Information Retrieval Tutorial on Cosine Similarity Measures, Dot Products and Term Weight Calculations.

Dr. E. Garcia
Mi Islita.com
Email | Last Update: 10/26/06

Article 2 of the series Information Retrieval Tutorials

Topics

On Term Weight and Vector Tools

On Incorrect Term Weight and Vector Information

Demystifying Complexity

What's the Point of a Point?

What's the Product, DOT?

When C says: "How far are you from me, A and B?"

Meet my Uncle Vector

Meet my COSIM: Document Length Normalization

Term Vector Fast Track Tutorials

Tutorial Review

Feedback Questions

References

On Term Weight and Vector Tools

I started the SEWF's thread: Term Vector Theory & Keyword Weights several years ago to introduce search engine marketers to basic information retrieval concepts. In this way they wouldn't have to resource to keyword density speculations and similar myths promoted by SEOs with vested interests. The fact that many search marketers are paying attention to IR satisfies me.

Unfortunately, some SEOs oversimplify term weight and vector theory (TVT), for instance by producing articles and tools that reduce term weights to mere w = tf*idf calculations (w = weight, tf = term frequency, idf = inverse document frequency). In my opinion anything you get from such TVT tools is just garbage in and garbage out.

Those of us involved in IR research know that plain tf*idf-like models are used in computer science grad schools to introduce students to vector analysis and advanced term weight methods. More likely, no current commercial search engine implements plain tf*idf and for good reasons. One is that a raw tf*idf model is easy to deceive via the tf term. A keyword spammer only needs to repeat a keyword many times to increase its weight. This is known as keyword spam. Another reason is that term vector models assume term independence. Often, this is not the case.

On Incorrect Term Weight and Vector Information

Many search engine marketers do not understand term vector theory or have been exposed to misinformation regarding term weight models. As expected, wrong knowledge has been used to construct software "tools" or to provide incorrect advice. To illustrate, some posts at SEOCHAT claim in good faith that term vector theory can be used to determine the importance of terms (2). Fortunately, the promoter of such idea, which is my friend, was mature enough to later on realize that this was an incorrect thesis. At the mentioned forum I explained with examples why neither term weights nor keyword density values are good estimates of the semantic importance of terms.

Before developing term weight tools, writing articles on term vector theory or even discussing the subject in public forums, I feel the marketer should understand first the theory behind TVT and its current state of the art. In this way, others are not misleaded. Take, for instance, the idf term. Ask to these marketers why IDF is used at all? Why some IR researchers define this using logs? Why some use log10 while others use log2? Why some use derivative versions? Keep asking questions to these "toy makers" and soon you will realize this: they don't know what they're talking about.

And how about DOT Products? At the mentioned SEWF thread some have tried to explain DOT Products and ended confusing others. Even one SEWF member that hides behind the name of "Xan Porter" and that has claimed in the past to be an IR practitioner/scientist/educator ("I still teach postgrads at universities" - Xan Porter) has incorrectly stated that

  • the dot product is defined as "dot product = term counts/documents x query term counts".
  • cosine similarity is given by "dot product / (document + query magnitudes) = cosine".
  • term vector theory makes no provision for document normalization: "there's no concept of normalizing the length of the documents".

What a bunch of non sense. All these claims by this alleged "post grad teacher" not only are incorrect, but are flat false --as we will see. My advice to IR students and search engine marketers: don't pay attention to these agents of misinformation.

This is my take on the whole issue of scoring term importance: we need to understand that information can behave as having a

  1. local nature; i.e. at the level of documents
  2. global nature; i.e. at the level of database collections
  3. scaling nature; i.e., through length scales

At least "as is" today, term vector theory, is just one way of trying to accomplish a and b, but not c. However, when you think thoroughly, there are no real reasons for representing documents, queries or snippets as vectors. We just use vector frameworks because these provide a crude way of representing a problem that depends on local and global information. In the process and unfortunately, we overlook the scaling (fractal) nature of the problem.

Demystifying Complexity

At the recent Search Engine Strategies 2005 Conference, San Jose, I presented on Patents on Duplicated Content and Re-Ranking Methods (3) and discussed a bit on localized term vectors. When the session ended one search marketer approached me and expressed that "this thing about vectors is too mystifying and complicated" to be implemented by humans or machines. My answer was

"Hey, DOT. Meet my COSIM and my Uncle Vector."

To me, complexity depends on the scale of observation utilized. Here I use the expression "scale of observation", not to mean scaling but to mean "standpoint", "point of view", "a way to get it", etc... Sorry for the redundancy. Did you get it?

After returning from San Jose, I was thinking about putting together a brief tutorial to explain some basic ideas embedded in term vector theory. The end result was this article. I hope you like it. The tutorial is organized as follows. First I discuss the notion of DOT Products. This is followed by some graphs describing the notion of vectors in layman's terms. Next, cosine similarities are computed. I use a graph approach since I want to "show" you rather than "telling" you.

To simplify the discussion, concepts are presented in non technical terms. However, if you are interested in delving into advanced term weighting theory, I invite you to read my advanced series on Term Vector Theory and Keyword Weights (4, 5). The series is an ongoing project which has been referenced by the MySQL AB Corporation in their authority database development MySQL Internals Manual and in their zillion of educational mirrors from around the world (6).

What's the Point of a Point?

The point of a point can be found in its coordinates, sort of speak. Let's define a point at the origin of the x-y plane as a reference point C with coordinates (x0, y0). We can refer to this point as C(x0, y0). Unless stated otherwise, x0 = 0 and y0 = 0. Similarly, we can refer to any two points, A and B, on this plane as A(x1, y1) and B(x2, y2). This scenario is illustrated in Figure 1.

 Points

Figure 1. Coordinates of points, A, B, and C in a two-dimensional plane.

What's the Product, DOT?

If we multiply the coordinates of A and B and add the products together we get the "mythical" DOT Product, also known as the inner product and scalar product. So the A•B DOT Product is given by

Equation 1: A•B = x1*x2 + y1*y2

It is that simple.

By the way, the little bullet in "A•B" is used to indicate -you guessed right- the dot product between A and B. DOT...!

If points A and B are defined in three dimensions then their coordinates are (x1, y1, z1) and (x2, y2, z2) and these points can be referred to as A(x1, y1, z1) and B(x2, y2, z2). The A•B DOT Product is now given by

Equation 2: A•B = x1*x2 + y1*y2 + z1*z2

For n number of dimensions, we just need to keep adding product terms to Equation 1 and 2. It cannot get any easier than this.

When C says: "How far are you from me, A and B?"

To define a straight line we need at least two points. So, if we draw a straight line from C to either A or B, we can define the distance, d, between the points. This is the so-called Euclidean Distance, which can be computed in four easy steps. For any two points defining a straight line:

  1. take the difference between the coordinates of the points
  2. square all differences
  3. add all squared differences
  4. square root the final result

That's all.

Since we have defined x0 = 0 and y0 = 0, then to find out how far A is from C we write the Euclidean Distance as

Equation 3: dAC = ((x1 - x0)2 + (y1 - y0)2)1/2 = (x12 + y12)1/2

Similarly, to find out how far B is from C we write

Equation 3: dBC = ((x2 - x0)2 + (y2 - y0)2)1/2 = (x22 + y22)1/2. Figure 2 provides a good description of Euclidean Distances.

 Points

Figure 2. Straight lines representing Euclidean Distances between points A and B, with point C.

Meet my Uncle Vector

The straight lines shown in Figure 2 can be replaced by vectors -represented as arrows. A vector is a quantity with direction and magnitude. The head and angle of the arrow indicates the direction of the vector, while its magnitude is defined by the usual Euclidean Distance. Since in our example x0 = 0 and y0 = 0, we can simplify and express the magnitudes of the A and B vectors as dAC = |A| and dBC = |B|. The pipe symbol is used to indicate that we are dealing with absolute magnitudes. Thus, the length of the arrows represents the magnitudes of the vectors and the angle described by the vectors represents their orientation in the two-dimensional space. This is illustrated in Figure 3.

 Points

Figure 3. A and B Vectors.

Meet my COSIM: Document Length Normalization

To normalize the A•B DOT Product we divide this by the Euclidean Distance between A and B; i.e., A•B/(|A||B|). This ratio defines the cosine angle between the vectors, with values between 0 and 1.

In information retrieval applications this ratio is calculated to normalize the length of documents since long documents tend to have large term frequencies. A more advanced document normalization method, known as Pivoted Unique Normalization, is described in MySQL Internals Manual :: 4.7 Full-text Search. This method is based on Singhal, Buckley and Mitra's old method known as Pivoted Document Length Normalization. (6 - 8).

Let's go back to the normalized DOT Product (cosine angle). This ratio is also used as a similarity measure between any two vectors representing documents, queries, snippets or combination of these. The expressions cosine similarity, Sim(A, B), or COSIM are commonly used.

It is now time to meet my COSIM:

 Points

Figure 4. The cosine angle between A and B.

As the angle between the vectors shortens, the cosine angle approaches 1, meaning that the two vectors are getting closer, meaning that the similarity of whatever is represented by the vectors increases.

This is a convenient way of ranking documents; i.e., by measuring how close their vectors are to a query vector. For instance, let say that point A(x1, y1) represents a query and points B(x2, y2), D(x3, y3), E(x4, y4), F(x5, y5), etc represent documents. We should be able to compute the cosine angle between A (the query) and each document and sort these in decreasing order of cosine angles (cosine similarites). This treatment can be extended to entire collection of documents.

To do this we need to construct a term space. The term space is defined by a list (index) of terms. These terms are extracted from the collection of documents to be queried. The coordinates of the points representing documents and queries are defined according to the weighting scheme used.

If weights are defined as mere term counts (w = tf) then point coordinates are given by term frequencies; however, we don't have to define term weights in this manner. As a matter of fact, and as previously mentioned, most commercial search engines do not define term weights in this way, not even in terms of keyword density values.

On and on, it is time now to show you the true IR colors of my COSIM:

 Points

Figure 5. The cosine similarity (cosine angle) between query and documents.

where the sigma symbol means "the sum of", Q is a query, D is a document relevant to Q and w are weights (see reference 4). How these weights are defined determines the significance and usefulness of the cosine similarity measure. By defining tmax as maximum term frequency in a document, N as number of documents in a collection and n as number of documents containing a query term, we can redefine term weights as

  • w = tf/tfmax
  • w = IDF = log(N/n)
  • w = tf*IDF = tf*log(N/n)
  • w = tf*IDF = tf*log((N - n)/n)

or even in terms of variants of tf and IDF, each one with their own customized definition and theoretical interpretation.

This tutorial illustrates that term vector theory is the result of applying an ancillary technique, Vector Analysis, to an information retrieval problem. Vector Analysis itself is independent from the nature of the underlying system.

Term Vector Fast Track Tutorials

Due to many requests, I have created the following fast tracks:

Term Vector Fast Track Tutorial - A Linear Algebra Approach
The working example covers a vector space consisting of many dimensions.

Term Vector Fast Track Tutorial
The working example is limited to a vector space consisting of 2 dimensions only.

These fast tracks walk you through five simple steps, showing you how to compute term frequencies, inverse document frequencies, term weights, dot products, vector magnitudes, and cosine similarity values. With this information, a particular term weighting definition, and given any two documents containing query terms and the query (of course), you should be able to determine which document is the best match for the query. Enjoy it.

Tutorial Review

Here are some nice exercises.

  1. Calculate the A•B DOT Product, Euclidean distances and cosine angle for the points at A(2, 1) and B(3, 4) from the origin of a two-dimensional plane.
  2. Assume that A represents a query and B represents a document and that there is a third document X represented by a point at X(2, 3). Which document, X or B is more relevant to A? Show calculations.
  3. How "far" is X from B? Show calculations.
  4. A database collection consists of 1 million documents, of which 200,000 contain the term holiday while 250,000 contain the term season. A document repeats holiday 7 times and season 5 times. It is known that holiday is repeated more than any other term in the document. Calculate the weight of both terms in this document using three different term weight methods. Try with

    w = tf/tfmax
    w = (tf/tfmax)*IDF = (tf/tfmax)*log(N/n)
    w = (tf/tfmax)*IDF = (tf/tfmax)*log((N - n)/n)

  5. Repeat exercise 4, this time without including the tfmax term in the calculations.

Feedback Questions

Here is a list of feedback questions. I'm trying to answer each one by quoting relevant passages from this tutorial. Additional material is also included.

  1. Q: Are the points A & B terms or phrases in the document and C is the query?

    A: "For instance, let say that point A(x1, y1) represents a query and points B(x2, y2), D(x3, y3), E(x4, y4), F(x5, y5), etc represent documents."

    What this means is that in the figures A is a point representing a query and B is a point representing a document. C is just the origin of the graph starting at 0, 0.

    In the figures, a graph represents the term space and the axes are given in term weight units, w. If weights are defined as w = tf/tfmax, then the units are normalized term frequencies (normalized term occurrences), but as mentioned before, no major search engine defines term weights using only number of term occurrences in a document. If a system defines term weights using w = (tf/tfmax)*log((N - n)/n) or a derivative of this, then the units of the axes must be redefined in that way, too.

    One more thing, the term spaces (the graphs) are created according to number of terms in the index term. We add one axis (dimension) per term. So if the index term consists of 3 terms, we represent docs and queries as vectors in 3 dimensions. If the index consists of n terms, we then need n dimensions to treat docs and queries as vectors. In this case a graphical representation is not possible.

  2. Q: What is the distance and angle representing?

    A: "Thus, the length of the arrows represents the magnitudes of the vectors and the angle described by the vectors represents their orientation in the two-dimensional space".... "As the angle between the vectors shortens, the cosine angle approaches 1, meaning that the two vectors are getting closer, meaning that the similarity of whatever is represented by the vectors increases. This is a convenient way of ranking documents by measuring how close their vectors are to a query vector."

    What this means is that more likely, terms in long docs tend to have more occurrences (high tf values) than in shorter docs. In addition, long documents tend to have more different terms, increasing the number of matches between a query and a long document (thus, the chances of retrieval over shorter documents).

    The dot product is normalized by dividing by the magnitudes to compute a cosine angle. This is done to normalize for the document lengths. Thus, we can visualize the magnitudes and dividing by these as accounting for document lengths and document normalization. This is also known as cosine normalization.

    However, in a strict sense, this approach (cosine normalization) is not entirely correct and has some drawbacks. Let me explain.

    In classic vector models cosine normalization is used to force the magnitude of the weighted document vector and to allow one to compare the angle between the weighted vectors. So far, this makes sense. Fine.

    However, this creates another problem: longer documents are given smaller term weights and smaller documents are favored over longer ones. Pivoted Unique Normalization tries to correct for discrepancies based on document length between the probability that a document is relevant and the probability that the document will be retrieved. This is why the pivot normalization method was introduced by Singhal, et al while at Cornell. I believe Singhal is now at Google.

    Regarding the angle described by the vectors, the cosine angle is a measure of similarity. It represents just that: similarity, how similar or alike a doc is to a query. It does not represent term importance, which is assessed through semantics, contextuality and topic analysis. The angle itself is an "estimate of closeness".

    If documents and queries are very alike their corresponding doc-query angle should be very small and approaching zero (cosine angle approaching 1). On the other hand, if the angle is high, let say, 90 degrees, the vectors would be perpendicular (orthogonal) and the cosine angle would be 0. In such case docs and queries are not related (In IR we actually say that "docs and queries are orthogonal").

    Thus,

    cosine (90) = 0 (completely unrelated)
    cosine (0) = 1 (completely related)

    If the cosine angle is between 0 and 1 this means that documents and queries are some way related.

  3. Q: Are terms/phrases in the document measured against each other or against the whole of the SE's indices?

    A: "Those of us involved in IR research know that plain tf*IDF-like models are used in computer science grad schools to introduce students to vector analysis and advanced term weighting methods."

    The answer here is too obvious to be rediscussed: against the queried index collection, via contributions from the IDF term, in addition to individual terms via the tf term. So, you need both contributions.

    At grad schools students are first exposed to the classic term count model that scores tf only. Then they learn about models that use tf combined with IDF to grasp the basic ideas of term vector models.

    Once they learn why we need to include global information, then more advanced methods are introduced (examples: pivoted normalization, Robertson's BM25 Model, Extended Boolean Model, the Generalized Vector Model, etc). In all cases one needs global weights via the IDF term, not just tf weights.

Next: EF-Ratios Tutorial

Prev: Document Indexing Tutorial

References
  1. SEWF's thread: Term Vector Theory & Keyword Weights; E. Garcia (2004).
  2. Method for Calculating the importance of a term in a document SEOCHAT.com (2005).
  3. Patents on Duplicated Content and Re-Ranking Methods; E. Garcia, SES, San Jose; Advanced Track Issues: The Patent Files, August 8 - 11 (2005).
  4. Term Vector Theory and Keyword Weights; E. Garcia (2005).
  5. The Classic Vector Space Model; E. Garcia (2004).
  6. MySQL Internals Manual :: 4.7 Full-text Search MySQL AB Corporation (2005).
  7. Implementation and Application of Term Weights in a MySQL Environment; E. Garcia (2005).
  8. Pivoted Document Length Normalization; Amit Singhal, Chris Buckley, and Mandar Mitra, Cornell University Ithaca NY.
  9. Term Vector Fast Track; E. Garcia (2005).
 
原创粉丝点击