Suppose we have the following set of five documents
- d1 : Romeo and Juliet.
- d2 : Juliet: O happy dagger!
- d3 : Romeo died by dagger.
- d4 : “Live free or die”, that’s the New-Hampshire’s motto.
- d5 : Did you know, New-Hampshire is in New-England.
and a search query: dies, dagger.
Clearly, d3 should be ranked top of the list since it contains both dies, dagger. Then, d2 and d4
should follow, each containing a word of the query. However, what about d1 and d5? Should they be
returned as possibly interesting results to this query? As humans we know that d1 is quite related
to the query. On the other hand, d5 is not so much related to the query. Thus, we would like d1 but
not d5, or differently said, we want d1 to be ranked higher than d5.
The question is: Can the machine deduce this? The answer is yes, LSI does exactly that. In this
example, LSI will be able to see that term dagger is related to d1 because it occurs together with
the d1’s terms Romeo and Juliet, in d2 and d3, respectively. Also, term dies is related to d1 and d5
because it occurs together with the d1’s term Romeo and d5’s term New-Hampshire in d3 and d4,
respectively. LSI will also weigh properly the discovered connections; d1 more is related to the query
than d5 since d1 is “doubly” connected to dagger through Romeo and Juliet, and also connected to
die through Romeo, whereas d5 has only a single connection to the query through New-Hampshire.
Reference: Latent Semantic Analysis (Alex Thomo)