Regarding the choice of the number of dimensions:
1) http://en.wikipedia.org/wiki/Latent_semantic_indexing:
Another challenge to LSI has been the alleged difficulty in
determining the optimal number of dimensions to use for performing the
SVD. As a general rule, fewer dimensions allow for broader comparisons
of the concepts contained in a collection of text, while a higher
number of dimensions enable more specific (or more relevant)
comparisons of concepts. The actual number of dimensions that can be
used is limited by the number of documents in the collection. Research
has demonstrated that around 300 dimensions will usually provide the
best results with moderate-sized document collections (hundreds of
thousands of documents) and perhaps 400 dimensions for larger document
collections (millions of documents). However, recent studies indicate
that 50-1000 dimensions are suitable depending on the size and nature
of the document collection.
Checking the amount of variance in the data after computing the SVD
can be used to determine the optimal number of dimensions to retain.
The variance contained in the data can be viewed by plotting the
singular values (S) in a scree plot. Some LSI practitioners select the
dimensionality associated with the knee of the curve as the cut-off
point for the number of dimensions to retain. Others argue that some
quantity of the variance must be retained, and the amount of variance
in the data should dictate the proper dimensionality to retain.
Seventy percent is often mentioned as the amount of variance in the
data that should be used to select the optimal dimensionality for
recomputing the SVD.
2) http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html?showall=1:
The trick in using SVD is in figuring out how many dimensions or
"concepts" to use when approximating the matrix. Too few dimensions
and important patterns are left out, too many and noise caused by
random word choices will creep back in.
The SVD algorithm is a little involved, but fortunately Python has a
library function that makes it simple to use. By adding the one line
method below to our LSA class, we can factor our matrix into 3 other
matrices. The U matrix gives us the coordinates of each word on our
“concept” space, the Vt matrix gives us the coordinates of each
document in our “concept” space, and the S matrix of singular values
gives us a clue as to how many dimensions or “concepts” we need to
include.
def calc(self): self.U, self.S, self.Vt = svd(self.A)
In order to
choose the right number of dimensions to use, we can make a histogram
of the square of the singular values. This graphs the importance each
singular value contributes to approximating our matrix. Here is the
histogram in our example.
For large collections of documents, the number of dimensions used is
in the 100 to 500 range. In our little example, since we want to graph
it, we’ll use 3 dimensions, throw out the first dimension, and graph
the second and third dimensions.
The reason we throw out the first dimension is interesting. For
documents, the first dimension correlates with the length of the
document. For words, it correlates with the number of times that word
has been used in all documents. If we had centered our matrix, by
subtracting the average column value from each column, then we would
use the first dimension. As an analogy, consider golf scores. We don’t
want to know the actual score, we want to know the score after
subtracting it from par. That tells us whether the player made a
birdie, bogie, etc.
3) Landauer, T.K., Foltz, P.W., Laham, D., (1998), 'Introduction to Latent Semantic
Analysis', Discourse Processes, 25, 259-284: