Latent Semantic Analysis concepts
Asked Answered
A

3

13

I've read about using Singular Value Decomposition (SVD) to do Latent Semantic Analysis (LSA) in corpus of texts. I've understood how to do that, also I understand mathematical concepts of SVD.

But I don't understand why does it works applying to corpuses of texts (I believe - there must be linguistical explanation). Could anybody explain me this with linguistic point of view?

Thanks

Addend answered 14/8, 2011 at 21:49 Comment(3)
This might be a better fit at cstheory.stackexchange.com.Cassation
Have you read the introductory paragraph of en.wikipedia.org/wiki/Latent_semantic_analysis?Dean
Hi , i have also had the same doubt ! is it mandatory to reduce the dimensions ? why cant we just use the v matrix to find the similarity between documents and the u matrix to find the similarity between terms ?Volcanology
O
14

There is no linguistic interpretation, there is no syntax involved, no handling of equivalence classes, synonyms, homonyms, stemming etc. Neither are any semantics involved, it is just words-occuring-together. Consider a "document" as a shopping cart: it contains a combination of words (purchases). And words tend to occur together with "related" words.

For instance: The word "drug" can occur together with either of {love, doctor, medicine, sports, crime}; each will point you in a different direction. But combined with many other words in the document, your query will probably find documents from a similar field.

Oospore answered 4/10, 2011 at 13:51 Comment(1)
Your answer is a lot better than mine. And the drug example was a home run!Espouse
E
5

Words occurring together (i.e. nearby or in the same document in a corpus) contribute to context. Latent Semantic Analysis basically groups similar documents in a corpus based on how similar they are to each other in terms of context.

I think the example and the word-document plot on this page will help in understanding.

Espouse answered 4/10, 2011 at 10:34 Comment(0)
S
3

Suppose we have the following set of five documents

  • d1 : Romeo and Juliet.
  • d2 : Juliet: O happy dagger!
  • d3 : Romeo died by dagger.
  • d4 : “Live free or die”, that’s the New-Hampshire’s motto.
  • d5 : Did you know, New-Hampshire is in New-England.

and a search query: dies, dagger.

Clearly, d3 should be ranked top of the list since it contains both dies, dagger. Then, d2 and d4 should follow, each containing a word of the query. However, what about d1 and d5? Should they be returned as possibly interesting results to this query? As humans we know that d1 is quite related to the query. On the other hand, d5 is not so much related to the query. Thus, we would like d1 but not d5, or differently said, we want d1 to be ranked higher than d5.

The question is: Can the machine deduce this? The answer is yes, LSI does exactly that. In this example, LSI will be able to see that term dagger is related to d1 because it occurs together with the d1’s terms Romeo and Juliet, in d2 and d3, respectively. Also, term dies is related to d1 and d5 because it occurs together with the d1’s term Romeo and d5’s term New-Hampshire in d3 and d4, respectively. LSI will also weigh properly the discovered connections; d1 more is related to the query

than d5 since d1 is “doubly” connected to dagger through Romeo and Juliet, and also connected to die through Romeo, whereas d5 has only a single connection to the query through New-Hampshire.

Reference: Latent Semantic Analysis (Alex Thomo)

Synchroscope answered 2/12, 2014 at 6:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.