Supervised Latent Dirichlet Allocation for Document Classification?
Asked Answered
G

3

15

I have a bunch of already human-classified documents in some groups.

Is there a modified version of lda which I can use to train a model and then later classify unknown documents with it?

Gesticulation answered 25/11, 2012 at 20:12 Comment(0)
D
16

For what it's worth, LDA as a classifier is going to be fairly weak because it's a generative model, and classification is a discriminative problem. There is a variant of LDA called supervised LDA which uses a more discriminative criterion to form the topics (you can get source for this in various places), and there's also a paper with a max margin formulation that I don't know the status of source-code-wise. I would avoid the Labelled LDA formulation unless you're sure that's what you want, because it makes a strong assumption about the correspondence between topics and categories in the classification problem.

However, it's worth pointing out that none of these methods use the topic model directly to do the classification. Instead, they take documents, and instead of using word-based features use the posterior over the topics (the vector that results from inference for the document) as its feature representation before feeding it to a classifier, usually a Linear SVM. This gets you a topic model based dimensionality reduction, followed by a strong discriminative classifier, which is probably what you're after. This pipeline is available in most languages using popular toolkits.

Dislocate answered 26/11, 2012 at 10:26 Comment(2)
The other, and newer, approach that might be worth looking into is Partially Labelled LDA. link It relaxes the requirement that every document in the training set must have a label.Margeret
Hey the first link does not, is this the paper I should be looking at arxiv.org/pdf/1003.0783.pdf?Moonlighting
C
7

You can implement supervised LDA with PyMC that uses Metropolis sampler to learn the latent variables in the following graphical model: sLDA graphical model

The training corpus consists of 10 movie reviews (5 positive and 5 negative) along with the associated star rating for each document. The star rating is known as a response variable which is a quantity of interest associated with each document. The documents and response variables are modeled jointly in order to find latent topics that will best predict the response variables for future unlabeled documents. For more information, check out the original paper. Consider the following code:

import pymc as pm
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

train_corpus = ["exploitative and largely devoid of the depth or sophistication ",
                "simplistic silly and tedious",
                "it's so laddish and juvenile only teenage boys could possibly find it funny",
                "it shows that some studios firmly believe that people have lost the ability to think",
                "our culture is headed down the toilet with the ferocity of a frozen burrito",
                "offers that rare combination of entertainment and education",
                "the film provides some great insight",
                "this is a film well worth seeing",
                "a masterpiece four years in the making",
                "offers a breath of the fresh air of true sophistication"]
test_corpus =  ["this is a really positive review, great film"]
train_response = np.array([3, 1, 3, 2, 1, 5, 4, 4, 5, 5]) - 3

#LDA parameters
num_features = 1000  #vocabulary size
num_topics = 4       #fixed for LDA

tfidf = TfidfVectorizer(max_features = num_features, max_df=0.95, min_df=0, stop_words = 'english')

#generate tf-idf term-document matrix
A_tfidf_sp = tfidf.fit_transform(train_corpus)  #size D x V

print "number of docs: %d" %A_tfidf_sp.shape[0]
print "dictionary size: %d" %A_tfidf_sp.shape[1]

#tf-idf dictionary    
tfidf_dict = tfidf.get_feature_names()

K = num_topics # number of topics
V = A_tfidf_sp.shape[1] # number of words
D = A_tfidf_sp.shape[0] # number of documents

data = A_tfidf_sp.toarray()

#Supervised LDA Graphical Model
Wd = [len(doc) for doc in data]        
alpha = np.ones(K)
beta = np.ones(V)

theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])    

z = pm.Container([pm.Categorical('z_%s' % d, p = theta[d], size=Wd[d], value=np.random.randint(K, size=Wd[d])) for d in range(D)])

@pm.deterministic
def zbar(z=z):    
    zbar_list = []
    for i in range(len(z)):
        hist, bin_edges = np.histogram(z[i], bins=K)
        zbar_list.append(hist / float(np.sum(hist)))                
    return pm.Container(zbar_list)

eta = pm.Container([pm.Normal("eta_%s" % k, mu=0, tau=1.0/10**2) for k in range(K)])
y_tau = pm.Gamma("tau", alpha=0.1, beta=0.1)

@pm.deterministic
def y_mu(eta=eta, zbar=zbar):
    y_mu_list = []
    for i in range(len(zbar)):
        y_mu_list.append(np.dot(eta, zbar[i]))
    return pm.Container(y_mu_list)

#response likelihood
y = pm.Container([pm.Normal("y_%s" % d, mu=y_mu[d], tau=y_tau, value=train_response[d], observed=True) for d in range(D)])

# cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic
w = pm.Container([pm.Categorical("w_%i_%i" % (d,i), p = pm.Lambda('phi_z_%i_%i' % (d,i), lambda z=z[d][i], phi=phi: phi[z]),
                  value=data[d][i], observed=True) for d in range(D) for i in range(Wd[d])])

model = pm.Model([theta, phi, z, eta, y, w])
mcmc = pm.MCMC(model)
mcmc.sample(iter=1000, burn=100, thin=2)

#visualize topics    
phi0_samples = np.squeeze(mcmc.trace('phi_0')[:])
phi1_samples = np.squeeze(mcmc.trace('phi_1')[:])
phi2_samples = np.squeeze(mcmc.trace('phi_2')[:])
phi3_samples = np.squeeze(mcmc.trace('phi_3')[:])
ax = plt.subplot(221)
plt.bar(np.arange(V), phi0_samples[-1,:])
ax = plt.subplot(222)
plt.bar(np.arange(V), phi1_samples[-1,:])
ax = plt.subplot(223)
plt.bar(np.arange(V), phi2_samples[-1,:])
ax = plt.subplot(224)
plt.bar(np.arange(V), phi3_samples[-1,:])
plt.show()

Given the training data (observed words and response variables), we can learn the global topics (beta) and regression coefficients (eta) for predicting the response variable (Y) in addition to topic proportions for each document (theta). In order to make predictions of Y given the learned beta and eta, we can define a new model where we do not observe Y and use the previously learned beta and eta to obtain the following result:

sLDA prediction

Here we predicted a positive review (approx 2 given review rating range of -2 to 2) for the test corpus consisting of one sentence: "this is a really positive review, great film" as shown by the mode of the posterior histogram on the right. See ipython notebook for a complete implementation.

Chemar answered 25/7, 2017 at 19:28 Comment(3)
Hi @vadim-smolyakov, is that different from Multinomial Naive Bayes ?Banas
Yes, the purpose of sLDA is to simultaneously learn global topics and local document score (e.g. movie rating), while Multinomial Naive Bayes focuses more on classification. Both models need supervision (score for sLDA, and class label for MNB). I did some analysis for Bernoulli NB, which maybe helpful here: github.com/vsmolyakov/experiments_with_python/blob/master/chp01/…Chemar
@VadimSmolyakov, how can we change the code if the Y is not numerical but text / label?Odel
B
3

Yes you can try the Labelled LDA in the stanford parser at http://nlp.stanford.edu/software/tmt/tmt-0.4/

Birgitbirgitta answered 25/11, 2012 at 21:59 Comment(3)
Thanks, I will take a look at that! Do you know if there is a C/C++/Python implementation of l-LDA?Incident
Sorry I didn't see your message initially. I'm not aware of a c/ python implementation but I haven't looked before. I know Biel (LDA author) usually publishes his code (C/C++) on his personal website so I'd check that out.Birgitbirgitta
The problem with this approach is that it requires a label to match 1-to-1 with a topic, so it is very restrictive.Margeret

© 2022 - 2024 — McMap. All rights reserved.