In natural language processing (NLP), how do you make an efficient dimension reduction?

In NLP, it's always the case that the dimension of the features are very huge. For example, for one project at hand, the dimension of features is almost 20 thousands (p = 20,000), and each feature is a 0-1 integer to show whether a specific word or bi-gram is presented in a paper (one paper is a data point $x \in R^{p}$).

I know the redundancy among the features is huge, so dimension reduction is necessary. I have three questions:

1) I have 10 thousands data points (n = 10,000), and each data points has 10 thousands features (p = 10,000). What is the effieient way to conduct dimension reduction? The matrix $X \in R^{n \times p}$ is so huge that both PCA (or SVD, truncated SVD is OK, but I don't think SVD is a good way to reduce dimention for binary features) and Bag of Words (or K-means) is hard be be directly conducted on $X$ (Sure, it is sparse). I don't have a server, I just use my PC:-(.

2) How to judge the similarity or distance among two data points? I think the Euclidean distance may not work well for binary features. How about L0 norm? What do you use?

3) If I want to use SVM machine (or other kernel methods) to conduct classification, which kernel should I use?

Many Thanks!

Recommended topics

Hot tags