Multiplying S
and V
is exactly what you have to do to perform dimensionality reduction with SVD/LSA.
>>> C = np.array([[1, 0, 1, 0, 0, 0],
... [0, 1, 0, 0, 0, 0],
... [1, 1, 0, 0, 0, 0],
... [1, 0, 0, 1, 1, 0],
... [0, 0, 0, 1, 0, 1]])
>>> from scipy.linalg import svd
>>> U, s, VT = svd(C, full_matrices=False)
>>> s[2:] = 0
>>> np.dot(np.diag(s), VT)
array([[ 1.61889806, 0.60487661, 0.44034748, 0.96569316, 0.70302032,
0.26267284],
[-0.45671719, -0.84256593, -0.29617436, 0.99731918, 0.35057241,
0.64674677],
[ 0. , 0. , 0. , 0. , 0. ,
0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. ]])
This gives a matrix where all but the last few rows are zeros, so they can be removed, and in practice this is the matrix you would use in applications:
>>> np.dot(np.diag(s[:2]), VT[:2])
array([[ 1.61889806, 0.60487661, 0.44034748, 0.96569316, 0.70302032,
0.26267284],
[-0.45671719, -0.84256593, -0.29617436, 0.99731918, 0.35057241,
0.64674677]])
What the PDF describes on page 10 is the recipe to get a low-rank reconstruction of the input C
. Rank != dimensionality, and the shear size and density of the reconstruction matrix make it impractical to use in LSA; its purpose is mostly mathematical. One thing you can do with it is check how good the reconstruction is for various values of k
:
>>> U, s, VT = svd(C, full_matrices=False)
>>> C2 = np.dot(U[:, :2], np.dot(np.diag(s[:2]), VT[:2]))
>>> from scipy.spatial.distance import euclidean
>>> euclidean(C2.ravel(), C.ravel()) # Frobenius norm of C2 - C
1.6677932876555255
>>> C3 = np.dot(U[:, :3], np.dot(np.diag(s[:3]), VT[:3]))
>>> euclidean(C3.ravel(), C.ravel())
1.0747879905228703
Sanity check against scikit-learn's TruncatedSVD
(full disclosure: I wrote that):
>>> from sklearn.decomposition import TruncatedSVD
>>> TruncatedSVD(n_components=2).fit_transform(C.T)
array([[ 1.61889806, -0.45671719],
[ 0.60487661, -0.84256593],
[ 0.44034748, -0.29617436],
[ 0.96569316, 0.99731918],
[ 0.70302032, 0.35057241],
[ 0.26267284, 0.64674677]])