LSA - Latent Semantic Analysis - How to code it in PHP?
Asked Answered
B

4

9

I would like to implement Latent Semantic Analysis (LSA) in PHP in order to find out topics/tags for texts.

Here is what I think I have to do. Is this correct? How can I code it in PHP? How do I determine which words to chose?

I don't want to use any external libraries. I've already an implementation for the Singular Value Decomposition (SVD).

  1. Extract all words from the given text.
  2. Weight the words/phrases, e.g. with tf–idf. If weighting is too complex, just take the number of occurrences.
  3. Build up a matrix: The columns are some documents from the database (the more the better?), the rows are all unique words, the values are the numbers of occurrences or the weight.
  4. Do the Singular Value Decomposition (SVD).
  5. Use the values in the matrix S (SVD) to do the dimension reduction (how?).

I hope you can help me. Thank you very much in advance!

Breakfast answered 18/6, 2009 at 20:10 Comment(6)
"I've already an implementation for the Singular Value Decomposition" #960560Leftwards
Sorry, I've added the link now.Breakfast
What does this have to do with PHP?Casease
@Novelocrat: I would like to know how the LSA exactly works and how to code this in PHP. That's what it has to do with PHP. ;)Breakfast
how big is your corpus? talk about the size of documents, the number of documents, and the distribution of them. Are they on similar topics? Is it just random blog posts / webpages? Freetext? More context will determine whether or not LSA is the right approach for you.Ave
The documents in my corpus are on different topics. I would say random news articles. The size of each document is 100-200 characters (very short news). The number of documents should be about 600 new ones per day. I don't know how many old documents I have to consider. But is the extraction of tags/entities/names possible with LSA at all? If LSA isn't adequate for named entity recognition, I can forget this question.Breakfast
A
7

LSA links:

Here is the complete algorithm. If you have SVD, you are most of the way there. The papers above explain it better than I do.

Assumptions:

  • your SVD function will give the singular values and singular vectors in descending order. If not, you have to do more acrobatics.

M: corpus matrix, w (words) by d (documents) (w rows, d columns). These can be raw counts, or tfidf or whatever. Stopwords may or may not be eliminated, and stemming may happen (Landauer says keep stopwords and don't stem, but yes to tfidf).

U,Sigma,V = singular_value_decomposition(M)

U:  w x w
Sigma:  min(w,d) length vector, or w * d matrix with diagonal filled in the first min(w,d) spots with the singular values
V:  d x d matrix

Thus U * Sigma * V = M  
#  you might have to do some transposes depending on how your SVD code 
#  returns U and V.  verify this so that you don't go crazy :)

Then the reductionality.... the actual LSA paper suggests a good approximation for the basis is to keep enough vectors such that their singular values are more than 50% of the total of the singular values.

More succintly... (pseudocode)

Let s1 = sum(Sigma).  
total = 0
for ii in range(len(Sigma)):
    val = Sigma[ii]
    total += val
    if total > .5 * s1:
        return ii

This will return the rank of the new basis, which was min(d,w) before, and we'll now approximate with {ii}.

(here, ' -> prime, not transpose)

We create new matrices: U',Sigma', V', with sizes w x ii, ii x ii, and ii x d.

That's the essence of the LSA algorithm.

This resultant matrix U' * Sigma' * V' can be used for 'improved' cosine similarity searching, or you can pick the top 3 words for each document in it, for example. Whether this yeilds more than a simple tf-idf is a matter of some debate.

To me, LSA performs poorly in real world data sets because of polysemy, and data sets with too many topics. It's mathematical / probabilistic basis is unsound (it assumes normal-ish (Gaussian) distributions, which don't makes sense for word counts).

Your mileage will definitely vary.

Tagging using LSA (one method!)

  1. Construct the U' Sigma' V' dimensionally reduced matrices using SVD and a reduction heuristic

  2. By hand, look over the U' matrix, and come up with terms that describe each "topic". For example, if the the biggest parts of that vector were "Bronx, Yankees, Manhattan," then "New York City" might be a good term for it. Keep these in a associative array, or list. This step should be reasonable since the number of vectors will be finite.

  3. Assuming you have a vector (v1) of words for a document, then v1 * t(U') will give the strongest 'topics' for that document. Select the 3 highest, then give their "topics" as computed in the previous step.

Ave answered 24/6, 2009 at 15:17 Comment(4)
Definitely, this is about what I wanted to know. But I still have some questions: Do I need V or VT (transpose)? I use stitchpanorama.sourceforge.net/Python/svd.py which gives you V. As you can see there, the singular values are not in descending order. Is this your pseudo-code function in PHP? paste.bradleygill.com/index.php?paste_id=10532 What does it do?Breakfast
The easy test for whether you need V or Vt is to figure out whether USV = M or USVt = M. That function is a heuristic way of figuring out how much dimensionality to reduce by. In this function, it says, "reduce the basis such that the vectors have 50% or more of the total of the singular values". You could also just say "keep the k biggest, for some value of k, like 50".... basically, determine how many categories there really are, which is the whole point of the LSA.Ave
Was there ever a solution to this LSA in PHP question. I understand the algorithm but have also been struggling to implement it in PHP.Comical
Not by me, alas! I'm a python/R hacker :)Ave
A
1

This answer isn't directly to the posters' question, but to the meta question of how to autotag news items. The OP mentions Named Entity Recognition, but I believe they mean something more along the line of autotagging. If they really mean NER, then this response is hogwash :)

Given these constraints (600 items / day, 100-200 characters / item) with divergent sources, here are some tagging options:

  1. By hand. An analyst could easily do 600 of these per day, probably in a couple of hours. Something like Amazon's Mechanical Turk, or making users do it, might also be feasible. Having some number of "hand-tagged", even if it's only 50 or 100, will be a good basis for comparing whatever the autogenerated methods below get you.

  2. Dimentionality reductions, using LSA, Topic-Models (Latent Dirichlet Allocation), and the like.... I've had really poor luck with LSA on real-world data sets and I'm unsatisfied with its statistical basis. LDA I find much better, and has an incredible mailing list that has the best thinking on how to assign topics to texts.

  3. Simple heuristics... if you have actual news items, then exploit the structure of the news item. Focus on the first sentence, toss out all the common words (stop words) and select the best 3 nouns from the first two sentences. Or heck, take all the nouns in the first sentence, and see where that gets you. If the texts are all in english, then do part of speech analysis on the whole shebang, and see what that gets you. With structured items, like news reports, LSA and other order independent methods (tf-idf) throws out a lot of information.

Good luck!

(if you like this answer, maybe retag the question to fit it)

Ave answered 23/6, 2009 at 13:40 Comment(2)
Thank you very much. You're right, I meant autotagging. But I definitely don't want to manually tag articles (1). Approach 3 is too simple and gives too poor results (already tried this out). But approach 2 sounds good and this is what my question is about. ;) I want to autotag (I didn't use this word, but other words which are wrong, maybe) news articles with LSA. LDA sounds good, too, but it's a method for classification, not for tagging I think.Breakfast
LDA works for tagging too. All of these techniques are attempts to reduce the dimensionality (the basis) of the document space.Ave
B
0

That all looks right, up to the last step. The usual notation for SVD is that it returns three matrices A = USV*. S is a diagonal matrix (meaning all zero off the diagonal) that, in this case, basically gives a measure of how much each dimension captures of the original data. The numbers ("singular values") will go down, and you can look for a drop-off for how many dimensions are useful. Otherwise, you'll want to just choose an arbitrary number N for how many dimensions to take.

Here I get a little fuzzy. The coordinates of the terms (words) in the reduced-dimension space is either in U or V, I think depending on whether they are in the rows or columns of the input matrix. Off hand, I think the coordinates for the words will be the rows of U. i.e. the first row of U corresponds to the first row of the input matrix, i.e. the first word. Then you just take the first N columns of that row as the word's coordinate in the reduced space.

HTH

Update:

This process so far doesn't tell you exactly how to pick out tags. I've never heard of anyone using LSI to choose tags (a machine learning algorithm might be more suited to the task, like, say, decision trees). LSI tells you whether two words are similar. That's a long way from assigning tags.

There are two tasks- a) what are the set of tags to use? b) how to choose the best three tags?. I don't have much of a sense of how LSI is going to help you answer (a). You can choose the set of tags by hand. But, if you're using LSI, the tags probably should be words that occur in the documents. Then for (b), you want to pick out the tags that are closest to words found in the document. You could experiment with a few ways of implementing that. Choose the three tags that are closest to any word in the document, where closeness is measured by the cosine similarity (see Wikipedia) between the tag's coordinate (its row in U) and the word's coordinate (its row in U).

Belshazzar answered 19/6, 2009 at 21:55 Comment(3)
Thank you. My main problem is: How can I determine which words I should chose? Assuming that I always want to have 3 tags: What do I have to do?Breakfast
Thanks. Maybe I've misunderstood something and LSA isn't used to find tags. But if I have a set of tags, e.g. "Sports, Politics, World", then you can surely use LSA to find the best matching tag, right?Breakfast
"But if I have a set of tags, e.g. "Sports, Politics, World","... No. That's not what LSA is for at all really. If you had those tags, and a corpus of articles about those topics, it would make more sense to use a Bayesian classfier. What LSA is for is to say, "the words: baseball, yankees, A-Rod tend to co-occur, and probably reflect some underlying structure, therefore other articles having baseball in them might be related to the same underlying topics". LSA is just factor analysis.Ave
A
0

There is an additional SO thread on the perils of doing this all in PHP at link text.

Specifically, there is a link there to this paper on Latent Semantic Mapping, which describes how to get the resultant "topics" for a text.

Ave answered 23/6, 2009 at 13:51 Comment(2)
The question you linked (the first link) is one of my questions. ;) I've linked it in my question at the top of this page, too. But that one is about SVD, this one here is about LSA ...Breakfast
SVD is part of LSA, and in that SO discussion. Look at Blackkettles answer. You do the SVD, reduce the eigenvalue matrix, then recombine. Read the LSM paper, it has the steps. I think you place a lot more faith in LSM for solving this, than is really warranted for your autotagging project, though.Ave

© 2022 - 2024 — McMap. All rights reserved.