In Latent Semantic Analysis, how do you recombine the decomposed matrices after truncating the singular values?

I'm trying to understand how you reduce the number of dimensions in a matrix. There's an example on page 13 which I'm trying to replicate using Python's numpy.

Let's call the original occurrence matrix "a" and the three SVD (Singular Value Decomposition) decomposed matrices "U", "S" and "V".

The trouble I'm having is that after I zero out the smaller singular values in "S", when I multiply together "U", "S" and "V" using numpy, the answer is not as it is given in the pdf. The bottom 3 rows are not all zeros. The funny thing is that when I just multiply "S" and "V" I get the right answer.

This is sort of surprising but multiplying "S" and "V" is actually what Manning and Schutze's book Foundations of Statistical Natural Language Processing says you have to do. But this is not what the pdf says you have to do in page 10.

So what's going on here?

Multiplying S and V is exactly what you have to do to perform dimensionality reduction with SVD/LSA.

>>> C = np.array([[1, 0, 1, 0, 0, 0],
...               [0, 1, 0, 0, 0, 0],
...               [1, 1, 0, 0, 0, 0],
...               [1, 0, 0, 1, 1, 0],
...               [0, 0, 0, 1, 0, 1]])
>>> from scipy.linalg import svd
>>> U, s, VT = svd(C, full_matrices=False)
>>> s[2:] = 0
>>> np.dot(np.diag(s), VT)
array([[ 1.61889806,  0.60487661,  0.44034748,  0.96569316,  0.70302032,
         0.26267284],
       [-0.45671719, -0.84256593, -0.29617436,  0.99731918,  0.35057241,
         0.64674677],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ]])

This gives a matrix where all but the last few rows are zeros, so they can be removed, and in practice this is the matrix you would use in applications:

>>> np.dot(np.diag(s[:2]), VT[:2])
array([[ 1.61889806,  0.60487661,  0.44034748,  0.96569316,  0.70302032,
         0.26267284],
       [-0.45671719, -0.84256593, -0.29617436,  0.99731918,  0.35057241,
         0.64674677]])

What the PDF describes on page 10 is the recipe to get a low-rank reconstruction of the input C. Rank != dimensionality, and the shear size and density of the reconstruction matrix make it impractical to use in LSA; its purpose is mostly mathematical. One thing you can do with it is check how good the reconstruction is for various values of k:

>>> U, s, VT = svd(C, full_matrices=False)
>>> C2 = np.dot(U[:, :2], np.dot(np.diag(s[:2]), VT[:2]))
>>> from scipy.spatial.distance import euclidean
>>> euclidean(C2.ravel(), C.ravel())   # Frobenius norm of C2 - C
1.6677932876555255
>>> C3 = np.dot(U[:, :3], np.dot(np.diag(s[:3]), VT[:3]))
>>> euclidean(C3.ravel(), C.ravel())
1.0747879905228703

Sanity check against scikit-learn's TruncatedSVD (full disclosure: I wrote that):

>>> from sklearn.decomposition import TruncatedSVD
>>> TruncatedSVD(n_components=2).fit_transform(C.T)
array([[ 1.61889806, -0.45671719],
       [ 0.60487661, -0.84256593],
       [ 0.44034748, -0.29617436],
       [ 0.96569316,  0.99731918],
       [ 0.70302032,  0.35057241],
       [ 0.26267284,  0.64674677]])

Recommended topics

Hot tags