osine Similarity and Term Weight Tutorial

An Information Retrieval Tutorial on Cosine Similarity Measures, Dot Products and Term Weight Calculations.

Dr. E. Garcia

Mi Islita.com
Email | Last Update: 10/26/06

Article 2 of the series Information Retrieval Tutorials

Topics

On Term Weight and Vector Tools

On Incorrect Term Weight and Vector Information

Demystifying Complexity

What's the Point of a Point?

What's the Product, DOT?

When C says: "How far are you from me, A and B?"

Meet my Uncle Vector

Meet my COSIM: Document Length Normalization

Term Vector Fast Track Tutorials

Tutorial Review

Feedback Questions

References

On Term Weight and Vector Tools

I started the SEWF's thread: Term Vector Theory & Keyword Weights several years ago to introduce search engine marketers to basic information retrieval concepts. In this way they wouldn't have to resource to keyword density speculations and similar myths promoted by SEOs with vested interests. The fact that many search marketers are paying attention to IR satisfies me.

Unfortunately, some SEOs oversimplify term weight and vector theory (TVT), for instance by producing articles and tools that reduce term weights to mere w = tf*idf calculations (w = weight, tf = term frequency, idf = inverse document frequency). In my opinion anything you get from such TVT tools is just garbage in and garbage out.

Those of us involved in IR research know that plain tf*idf-like models are used in computer science grad schools to introduce students to vector analysis and advanced term weight methods. More likely, no current commercial search engine implements plain tf*idf and for good reasons. One is that a raw tf*idf model is easy to deceive via the tf term. A keyword spammer only needs to repeat a keyword many times to increase its weight. This is known as keyword spam. Another reason is that term vector models assume term independence. Often, this is not the case.

On Incorrect Term Weight and Vector Information

Many search engine marketers do not understand term vector theory or have been exposed to misinformation regarding term weight models. As expected, wrong knowledge has been used to construct software "tools" or to provide incorrect advice. To illustrate, some posts at SEOCHAT claim in good faith that term vector theory can be used to determine the importance of terms (2). Fortunately, the promoter of such idea, which is my friend, was mature enough to later on realize that this was an incorrect thesis. At the mentioned forum I explained with examples why neither term weights nor keyword density values are good estimates of the semantic importance of terms.

Before developing term weight tools, writing articles on term vector theory or even discussing the subject in public forums, I feel the marketer should understand first the theory behind TVT and its current state of the art. In this way, others are not misleaded. Take, for instance, the idf term. Ask to these marketers why IDF is used at all? Why some IR researchers define this using logs? Why some use log10 while others use log2? Why some use derivative versions? Keep asking questions to these "toy makers" and soon you will realize this: they don't know what they're talking about.

And how about DOT Products? At the mentioned SEWF thread some have tried to explain DOT Products and ended confusing others. Even one SEWF member that hides behind the name of "Xan Porter" and that has claimed in the past to be an IR practitioner/scientist/educator ("I still teach postgrads at universities" - Xan Porter) has incorrectly stated that

  • the dot product is defined as "dot product = term counts/documents x query term counts".
  • cosine similarity is given by "dot product / (document + query magnitudes) = cosine".
  • term vector theory makes no provision for document normalization: "there's no concept of normalizing the length of the documents".

What a bunch of non sense. All these claims by this alleged "post grad teacher" not only are incorrect, but are flat false --as we will see. My advice to IR students and search engine marketers: don't pay attention to these agents of misinformation.

This is my take on the whole issue of scoring term importance: we need to understand that information can behave as having a

  1. local nature; i.e. at the level of documents
  2. global nature; i.e. at the level of database collections
  3. scaling nature; i.e., through length scales

At least "as is" today, term vector theory, is just one way of trying to accomplish a and b, but not c. However, when you think thoroughly, there are no real reasons for representing documents, queries or snippets as vectors. We just use vector frameworks because these provide a crude way of representing a problem that depends on local and global information. In the process and unfortunately, we overlook the scaling (fractal) nature of the problem.

Demystifying Complexity

At the recent Search Engine Strategies 2005 Conference, San Jose, I presented on Patents on Duplicated Content and Re-Ranking Methods (3) and discussed a bit on localized term vectors. When the session ended one search marketer approached me and expressed that "this thing about vectors is too mystifying and complicated" to be implemented by humans or machines. My answer was

"Hey, DOT. Meet my COSIM and my Uncle Vector."

To me, complexity depends on the scale of observation utilized. Here I use the expression "scale of observation", not to mean scaling but to mean "standpoint", "point of view", "a way to get it", etc... Sorry for the redundancy. Did you get it?

After returning from San Jose, I was thinking about putting together a brief tutorial to explain some basic ideas embedded in term vector theory. The end result was this article. I hope you like it. The tutorial is organized as follows. First I discuss the notion of DOT Products. This is followed by some graphs describing the notion of vectors in layman's terms. Next, cosine similarities are computed. I use a graph approach since I want to "show" you rather than "telling" you.

To simplify the discussion, concepts are presented in non technical terms. However, if you are interested in delving into advanced term weighting theory, I invite you to read my advanced series on Term Vector Theory and Keyword Weights (4, 5). The series is an ongoing project which has been referenced by the MySQL AB Corporation in their authority database development MySQL Internals Manual and in their zillion of educational mirrors from around the world (6).

What's the Point of a Point?

The point of a point can be found in its coordinates, sort of speak. Let's define a point at the origin of the x-y plane as a reference point C with coordinates (x0, y0). We can refer to this point as C(x0, y0). Unless stated otherwise, x0 = 0 and y0 = 0. Similarly, we can refer to any two points, A and B, on this plane as A(x1, y1) and B(x2, y2). This scenario is illustrated in Figure 1.

Figure 1. Coordinates of points, A, B, and C in a two-dimensional plane.

What's the Product, DOT?

If we multiply the coordinates of A and B and add the products together we get the "mythical" DOT Product, also known as the inner product and scalar product. So the A•B DOT Product is given by

Equation 1: A•B = x1*x2 + y1*y2

It is that simple.

By the way, the little bullet in "A•B" is used to indicate -you guessed right- the dot product between A and B. DOT...!

If points A and B are defined in three dimensions then their coordinates are (x1, y1, z1) and (x2, y2, z2) and these points can be referred to as A(x1, y1, z1) and B(x2, y2, z2). The A•B DOT Product is now given by

Equation 2: A•B = x1*x2 + y1*y2 + z1*z2

For n number of dimensions, we just need to keep adding product terms to Equation 1 and 2. It cannot get any easier than this.

When C says: "How far are you from me, A and B?"

To define a straight line we need at least two points. So, if we draw a straight line from C to either A or B, we can define the distance, d, between the points. This is the so-called Euclidean Distance, which can be computed in four easy steps. For any two points defining a straight line:

  1. take the difference between the coordinates of the points
  2. square all differences
  3. add all squared differences
  4. square root the final result

That's all.

Since we have defined x0 = 0 and y0 = 0, then to find out how far A is from C we write the Euclidean Distance as

Equation 3: dAC = ((x1 - x0)2 + (y1 - y0)2)1/2 = (x12 + y12)1/2

Similarly, to find out how far B is from C we write

Equation 3: dBC = ((x2 - x0)2 + (y2 - y0)2)1/2 = (x22 + y22)1/2. Figure 2 provides a good description of Euclidean Distances.

Figure 2. Straight lines representing Euclidean Distances between points A and B, with point C.

Meet my Uncle Vector

The straight lines shown in Figure 2 can be replaced by vectors -represented as arrows. A vector is a quantity with direction and magnitude. The head and angle of the arrow indicates the direction of the vector, while its magnitude is defined by the usual Euclidean Distance. Since in our example x0 = 0 and y0 = 0, we can simplify and express the magnitudes of the A and B vectors as dAC = |A| and dBC = |B|. The pipe symbol is used to indicate that we are dealing with absolute magnitudes. Thus, the length of the arrows represents the magnitudes of the vectors and the angle described by the vectors represents their orientation in the two-dimensional space. This is illustrated in Figure 3.

Figure 3. A and B Vectors.

Meet my COSIM: Document Length Normalization

To normalize the A•B DOT Product we divide this by the Euclidean Distance between A and B; i.e., A•B/(|A||B|). This ratio defines the cosine angle between the vectors, with values between 0 and 1.

In information retrieval applications this ratio is calculated to normalize the length of documents since long documents tend to have large term frequencies. A more advanced document normalization method, known as Pivoted Unique Normalization, is described in MySQL Internals Manual :: 4.7 Full-text Search. This method is based on Singhal, Buckley and Mitra's old method known as Pivoted Document Length Normalization. (6 - 8).

Let's go back to the normalized DOT Product (cosine angle). This ratio is also used as a similarity measure between any two vectors representing documents, queries, snippets or combination of these. The expressions cosine similarity, Sim(A, B), or COSIM are commonly used.

It is now time to meet my COSIM:

Figure 4. The cosine angle between A and B.

As the angle between the vectors shortens, the cosine angle approaches 1, meaning that the two vectors are getting closer, meaning that the similarity of whatever is represented by the vectors increases.

This is a convenient way of ranking documents; i.e., by measuring how close their vectors are to a query vector. For instance, let say that point A(x1, y1) represents a query and points B(x2, y2), D(x3, y3), E(x4, y4), F(x5, y5), etc represent documents. We should be able to compute the cosine angle between A (the query) and each document and sort these in decreasing order of cosine angles (cosine similarites). This treatment can be extended to entire collection of documents.

To do this we need to construct a term space. The term space is defined by a list (index) of terms. These terms are extracted from the collection of documents to be queried. The coordinates of the points representing documents and queries are defined according to the weighting scheme used.

If weights are defined as mere term counts (w = tf) then point coordinates are given by term frequencies; however, we don't have to define term weights in this manner. As a matter of fact, and as previously mentioned, most commercial search engines do not define term weights in this way, not even in terms of keyword density values.

On and on, it is time now to show you the true IR colors of my COSIM:

Figure 5. The cosine similarity (cosine angle) between query and documents.

where the sigma symbol means "the sum of", Q is a query, D is a document relevant to Q and w are weights (see reference 4). How these weights are defined determines the significance and usefulness of the cosine similarity measure. By defining tmax as maximum term frequency in a document, N as number of documents in a collection and n as number of documents containing a query term, we can redefine term weights as

  • w = tf/tfmax
  • w = IDF = log(N/n)
  • w = tf*IDF = tf*log(N/n)
  • w = tf*IDF = tf*log((N - n)/n)

or even in terms of variants of tf and IDF, each one with their own customized definition and theoretical interpretation.

This tutorial illustrates that term vector theory is the result of applying an ancillary technique, Vector Analysis, to an information retrieval problem. Vector Analysis itself is independent from the nature of the underlying system.

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。