If you have an MxN in your case 1000000x3 matrixnumpy.linalg.svd
does not require M==N. In fact this is precisely where the SVD can come in to compute things like rank and pseudo inverse. Methods such as linalg.inv require a square (and full rank) matrix to have a defined result.
@Saullo Castro is right on. The full_matrices=False can turn intractable to tractable because instead of the U matrix being 1Mx1M elements, it is 1Mx3, a huge savings. I am not sure which reduced SVD algorithm numpy uses (I think it might be the Compact SVD, or thin): a brief description of 3 widely used ones is on wikipedia: http://en.wikipedia.org/wiki/Singular_value_decomposition in the Reduced SVDs section. They all center around reducing the computation of the full U matrix, to a reduced form and this is sufficient for some, and perhaps even many, problems. The savings is most when numberObservations>>numberFeatures
Regarding whether you get the same result. The short answer can be yes depending on exactly what you will be doing with the SVD results. For example you will get the same matrix back (to a floating point tolerance level) with the reduced form as with the original, as can be seen in the below code. Note in the top case the size of U is numberObservations x numberObservations, whereas in the full_matrices=False, the size of U is numberObservations x numberFeatures
This code was adapted from the numpy.linalg.svd doc to allow one to experiment with arbitrary rows/columns, singular values to select.
One can always reduce the size of the U matrix to M x min(M,N). Further reductions may be possible depending on the structure of your data and how much noise is present. Just because the numpy.isclose is false does not mean the computed SV's are bad for all contexts. You can experiment with this using the mostSignificantSingularValues
variable, which takes the top SV's from the full SVD.
numberObservations = 900
numberFeatures = 600
mostSignificantSingularValues = 600
a = np.random.randn( numberObservations, numberFeatures) + 1j*np.random.randn(numberObservations, numberFeatures)
#Reconstruction based on full SVD:
U, s, V = np.linalg.svd(a, full_matrices=True)
print(U.shape, V.shape, s.shape)
S = np.zeros((numberObservations, numberFeatures), dtype=complex)
S[:mostSignificantSingularValues, :mostSignificantSingularValues] = np.diag(s[:mostSignificantSingularValues])
print(np.allclose(a, np.dot(U, np.dot(S, V))))
d1 = a - np.dot(U, np.dot(S, V))#
#True
#Reconstruction based on reduced SVD:
U, s, V = np.linalg.svd(a, full_matrices=False)
print(U.shape, V.shape, s.shape)
S = np.diag(s)
print(np.allclose(a, np.dot(U, np.dot(S, V))))
d2 = a - np.dot(U, np.dot(S, V))#
svd
requires2-D arrays
as inputs. It seems you need less memory if you passfull_matrices=False
... – Diplocardiac