In Latent Semantic Analysis, how do you recombine the decomposed matrices after truncating the singular values?
Asked Answered
B

1

7

I'm reading Matrix decompositions and latent semantic indexing (Online edition © 2009 Cambridge UP)

I'm trying to understand how you reduce the number of dimensions in a matrix. There's an example on page 13 which I'm trying to replicate using Python's numpy.

Let's call the original occurrence matrix "a" and the three SVD (Singular Value Decomposition) decomposed matrices "U", "S" and "V".

The trouble I'm having is that after I zero out the smaller singular values in "S", when I multiply together "U", "S" and "V" using numpy, the answer is not as it is given in the pdf. The bottom 3 rows are not all zeros. The funny thing is that when I just multiply "S" and "V" I get the right answer.

This is sort of surprising but multiplying "S" and "V" is actually what Manning and Schutze's book Foundations of Statistical Natural Language Processing says you have to do. But this is not what the pdf says you have to do in page 10.

So what's going on here?

Blunderbuss answered 2/1, 2014 at 20:33 Comment(16)
Reducing the number of dimensions is a common applied mathematics problem, so you might get a better answer on one of the maths or programming sites if you can get an answer in plain English. I personally could never grok matrices.Chairwoman
Are you using Numpy's SVD function? Perhaps it is actually computing one of these reduced SVD's? en.wikipedia.org/wiki/Singular_value_decomposition#Reduced_SVDsDoubleton
@RussellRichie the trouble is not with the SVD function but with the matrix multiplication.Blunderbuss
@Chairwoman after seeing Russell's link I guess that truncated SVD is a mathematical operation so it should be relevant in a maths stack exchange.Blunderbuss
This question appears to be off-topic because it is about a math problem encountered as an adjunct to NLP. It should be moved to one of the math or programming SE sites. I've voted to close but I'm not sure which site is best to migrate it to.Chairwoman
At first glance Stack Overflow seems to have the most "dimension reduction" questions.Chairwoman
@hippietrail: NLP questions are on-topic here. Please note the text-book that the OP is referring to when asking the question.Popelka
Yes NLP is on-topic but this is a math question.Chairwoman
@mtanti: While I think the question is relevant here, the content is vague. At the very least, you should write down the equation here, and explain what you're looking for.Popelka
@hippietrail: I can't tell at the moment if this is a math question. I have the book, and have read the relevant chapters a while ago, but now refuse to have to read it to get a sense of the question :-).Popelka
The question title is "How to perform Latent Semantic Analysis", which sounds nice and linguisticky, but the question itself is all about "I'm trying to understand how you reduce the number of dimensions in a matrix.", which is not linguisticky at all. You should at least try to bring these into alignment.Chairwoman
Does this question on compsci.SE help? Understanding how Numpy does SVD Also, here's a Stack-Exchange-wide search for related terms that might help: stackexchange.com/…Chairwoman
@Chairwoman it's not really the SVD part that is the problem. I'm multiplying the already decomposed matrices and the answer is not the same as the pdf's.Blunderbuss
@mtanti: It sounds like a debugging problem if not a matrix multiplication/decomposition problem. On SO I saw some other questions about different tools giving different results for SVD. Sadly though even though I'm a former pro programmer and current hobby linguist, I can't follow that PDF \-: I'm just trying to get you the best answer whether it's on-topic here or not (-:Chairwoman
Your edits have definitely made it much more something that's on-topic here. But I still think your chances of getting an answer will improve where more people knowing about this type of math will see than here where this math only likely has one application and most of us even who do linguistic programming are not familiar with it. Also if we migrate it to SO it will still be findable from here too.Chairwoman
I have no problem with that @ChairwomanBlunderbuss
J
4

Multiplying S and V is exactly what you have to do to perform dimensionality reduction with SVD/LSA.

>>> C = np.array([[1, 0, 1, 0, 0, 0],
...               [0, 1, 0, 0, 0, 0],
...               [1, 1, 0, 0, 0, 0],
...               [1, 0, 0, 1, 1, 0],
...               [0, 0, 0, 1, 0, 1]])
>>> from scipy.linalg import svd
>>> U, s, VT = svd(C, full_matrices=False)
>>> s[2:] = 0
>>> np.dot(np.diag(s), VT)
array([[ 1.61889806,  0.60487661,  0.44034748,  0.96569316,  0.70302032,
         0.26267284],
       [-0.45671719, -0.84256593, -0.29617436,  0.99731918,  0.35057241,
         0.64674677],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ]])

This gives a matrix where all but the last few rows are zeros, so they can be removed, and in practice this is the matrix you would use in applications:

>>> np.dot(np.diag(s[:2]), VT[:2])
array([[ 1.61889806,  0.60487661,  0.44034748,  0.96569316,  0.70302032,
         0.26267284],
       [-0.45671719, -0.84256593, -0.29617436,  0.99731918,  0.35057241,
         0.64674677]])

What the PDF describes on page 10 is the recipe to get a low-rank reconstruction of the input C. Rank != dimensionality, and the shear size and density of the reconstruction matrix make it impractical to use in LSA; its purpose is mostly mathematical. One thing you can do with it is check how good the reconstruction is for various values of k:

>>> U, s, VT = svd(C, full_matrices=False)
>>> C2 = np.dot(U[:, :2], np.dot(np.diag(s[:2]), VT[:2]))
>>> from scipy.spatial.distance import euclidean
>>> euclidean(C2.ravel(), C.ravel())   # Frobenius norm of C2 - C
1.6677932876555255
>>> C3 = np.dot(U[:, :3], np.dot(np.diag(s[:3]), VT[:3]))
>>> euclidean(C3.ravel(), C.ravel())
1.0747879905228703

Sanity check against scikit-learn's TruncatedSVD (full disclosure: I wrote that):

>>> from sklearn.decomposition import TruncatedSVD
>>> TruncatedSVD(n_components=2).fit_transform(C.T)
array([[ 1.61889806, -0.45671719],
       [ 0.60487661, -0.84256593],
       [ 0.44034748, -0.29617436],
       [ 0.96569316,  0.99731918],
       [ 0.70302032,  0.35057241],
       [ 0.26267284,  0.64674677]])
Jamikajamil answered 4/1, 2014 at 12:44 Comment(3)
@mtanti: which one, UΣVᵀ?Jamikajamil
the part where they find what C2 is and the bottom rows are all zeros. That's not UΣVᵀ but ΣVᵀ right? It's a mistake in the pdf right?Blunderbuss
@mtanti: Example 18.4, right? Yes, that seems to be a mistake. What they compute is not C₂, the low-rank reconstruction, but the compressed matrix Σ₂Vᵀ. The true reconstruction does not have these zero rows.Jamikajamil

© 2022 - 2024 — McMap. All rights reserved.