Need help in latent semantic indexing
Asked Answered
I

2

7

I am sorry, if my question sounds stupid :) Can you please recommend me any pseudo code or good algo for LSI implementation in java? I am not math expert. I tried to read some articles on wikipedia and other websites about LSI ( latent semantic indexing ) they were full of math. I know LSI is full of math. But if i see some source code or algo. I understand things more easily. That's why i asked here, because so many GURU are here ! Thanks in advance

Inessential answered 7/1, 2010 at 2:0 Comment(3)
Duplicate: #1747068Janeth
Thanks Amit, But if you read my question. So it is different. Even if you think it's same then you can't find some good answer there :)Inessential
Do we always have to reduce the dimension in lsa ? cant we just use the v matrix to find the similarity between the documents and the u matrix to find the similarity between the terms ?Bordelon
S
14

An idea of LSA is based on one assumption: the more two words occur in same documents, the more similar they are. Indeed, we can expect that words "programming" and "algorithm" will occur in same documents much more often then, say, "programming" and "dog-breeding".

Same for documents: the more common/similar words two documents have, the more similar themselves they are. So, you can express similarity of documents by frequencies of words and vice versa.

Knowing this, we can construct a co-occurrence matrix, where column names represent documents, row names - words and each cells[i][j] represents frequency of word words[i] in document documents[j]. Frequency may be computed in many ways, IIRC, original LSA uses tf-idf index.

Having such matrix, you can find similarity of two documents by comparing corresponding columns. How to compare them? Again, there are several ways. The most popular is a cosine distance. You must remember from school maths, that matrix may be treated as a bunch of vectors, so each column is just a vector in some multidimensional space. That's why this model is called "Vector Space Model". More on VSM and cosine distance here.

But we have one problem with such matrix: it is big. Very very big. Working with it is too computationally expensive, so we have to reduce it somehow. LSA uses SVD technique to keep the most "important" vectors. After reduction matrix is ready to use.

So, algorithm for LSA will look something like this:

  1. Collect all documents and all unique words from them.
  2. Extract frequency information and build co-occurrence matrix.
  3. Reduce matrix with SVD.

If you're going to write LSA library by yourself, the good point to start is Lucene search engine, which will make much easier steps 1 and 2, and some implementation of high-dimensional matrices with SVD capability like Parallel Colt or UJMP.

Also pay attention to other techinques, which grown up from LSA, like Random Indexing. RI uses same idea and shows approximately same results, but doesn't use full matrix stage and is completely incremental, which makes it much more computationally efficient.

Stela answered 14/12, 2010 at 17:59 Comment(9)
Hi i have been working on lsa lately . Do we always have to reduce the dimensions of the diagonal matrix . My need is to find the similarity between two documents . Is it ok if i take the v matrix alone and use it to find the similarity . I read this approach in this paper : miislita.com/information-retrieval-tutorial/…Bordelon
@CTsiddharth: if you don't want to reduce your dimensionality, there's no need in SVD and thus V matrix at all - you can use original term-document matrix as is. However, dimensionality reduction gives you 2 big advantages: 1) it reduces noise; 2) it makes document similarity computation much more fast (less elements to compare). So, it makes sense to use full matrix for relatively small corpora (say, < 5000 documents), for larger documents you need complete dimensionality reduction (SVD + selecting k first vectors).Stela
In my case , i have 35 documents and around 1500 words . If i do an svd the v matrix becomes a 35*35 . So can i use this 35*35 matrix to find the similarity instead of a 1500*35 ? this has been one question i have been thinking of for quite a whileBordelon
And also please help me answer this question : #9582791Bordelon
@CTsiddharth: when you reduce dimensions, you filter noise, but also you lose some information. So there's no single answer to your question - it all depends on you dataset. Make some test set and check results for both full and reduced dataset. Same thing to your another question - try different values for reduced dimensions and take the best.Stela
good description on LSA! but I have a question. if we have a word and then we want to find that this word is related to which part of what document(assuming we have several-not too many- documents) , then is it possible to use this technique to do so?Bee
@user1064929: "document" is pretty abstract concept. For example, if you try to analyse web site, you can consider it all as a single document, each web page as a document or even each paragraph as a document. In a later case co-occurrence matrix will show how often each word occurs in each paragraph. Does it answer your question?Stela
so we dont have to have many words to compare with manydocuments? comparing one word with many documents to see what is the relevancy of the each documents to the single word is still possible in this case?Bee
@user1064929: using multiple words is recommended, since together they uncover context, but even single word ("document" consisting of a single word) should do the job in most cases. However, if your ultimate goal is to get documents (or document parts) from the word, it makes sense to take a look at information retrieval methods. If you have concrete task and need recommendation, you can always post detailed questions here, on CrossValidated or DataScience (depending on what it is about more).Stela
B
2

This maybe a bit late but I always liked Sujit Pal's blog http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html and I have written a bit on my site if you are interested.

The process is way less complicated than it is often written up as. And really all you need is a library that can do single value decomposition of a matrix.

If you are interested I can explain in a couple of the short take away bits:

1) you create a matrix/dataset/etc with word counts of various documents - the different documents will be your columns and the rows the distinct words.

2) Once you've created the matrix you use a library like Jama (for Java) or SmartMathLibrary (for C#) and run the single value decomposition. All this does is take your original matrix and break it up in to three different parts/matrix that essentially represent your documents, your words, and kind of a multiplier (sigma) these are called the vectors.

3) Once you have you word, document, sigma vectors you shrink them equally (k) by just copying smaller parts of the vector/matrix and then multiply them back together. By shrinking them it kind of normalizes your data and this is LSI.

here are some fairly clear resources:

http://puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html

http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf

Hope this help you out a bit.

Eric

Biagio answered 11/2, 2010 at 16:13 Comment(1)
Hi, I learned alot meanwhile. But still your answer is very helpfull +1. I saw sujit pall blog too. It is good but i don't agree with his results. I asked him when there is not context similarity between two documents why it is saying 100% same. He couldn't answer it. Now i am looking how can I use LDA other than LSI. Is it possible to use LDA for this purpose????Inessential

© 2022 - 2024 — McMap. All rights reserved.