Binarization in Natural Language Processing

Asked 23/2, 2009 at 20:31 Answered 2/3, 2009 at 1:19

Solved machine-learning nlp classification libsvm

Binarization is the act of transforming colorful features of of an entity into vectors of numbers, most often binary vectors, to make good examples for classifier algorithms.

If we where to binarize the sentence "The cat ate the dog", we could start by assigning every word an ID (for example cat-1, ate-2, the-3, dog-4) and then simply replace the word by it's ID giving the vector <3,1,2,3,4>.

Given these IDs we could also create a binary vector by giving each word four possible slots, and setting the slot corresponding to a specific word with to one, giving the vector <0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1>. The latter method is, as far as I know, is commonly referred to as the bag-of-words-method.

Now for my question, what is the best binarization method when it comes to describe features for natural language processing in general, and transition-based dependency parsing (with Nivres algorithm) in particular?

In this context, we do not want to encode the whole sentence, but rather the current state of the parse, for example the top word on the stack en the first word in the input queue. Since order is highly relevant, this rules out the bag-of-words-method.

With best, I am referring to the method that makes the data the most intelligible for the classifier, without using up unnecessary memory. For example I don't want a word bigram to use 400 million features for 20000 unique words, if only 2% the bigrams actually exist.

Since the answer is also depending on the particular classifier, I am mostly interested in maximum entropy models (liblinear), support vector machines (libsvm) and perceptrons, but answers that apply to other models are also welcome.

Isthmian answered 23/2, 2009 at 20:31 Comment(4)

I don't know what binarization is, and I'm sure many other people are in the same boat, so it would be nice if you could give some explanation of what you mean for those of us who are unfamiliar with NLP (if not to help us answer, at least to help with understanding the subject). – Helfand 23/2, 2009 at 20:41

Same here - can you define binarization please? – Stadler 23/2, 2009 at 21:12

Perhaps you could define what you mean by 'best' i.e., the most space efficient, the most processing efficient, the most descriptive. – Gault 24/2, 2009 at 15:35

I think it is clear that the most space efficient will probably be to obscure for most classifiers and as long as you don't truncate the data, which I don't want to do, all representations will be just as descriptive. As for processing efficiency, this is not an issue at the moment... – Isthmian 24/2, 2009 at 17:47

This is actually a really complex question. The first decision you have to make is whether to lemmatize your input tokens (your words). If you do this, you dramatically decrease your type count, and your syntax parsing gets a lot less complicated. However, it takes a lot of work to lemmatize a token. Now, in a computer language, this task gets greatly reduced, as most languages separate keywords or variable names with a well defined set of symbols, like whitespace or a period or whatnot.

The second crucial decision is what you're going to do with the data post-facto. The "bag-of-words" method, in the binary form you've presented, ignores word order, which is completely fine if you're doing summarization of a text or maybe a Google-style search where you don't care where the words appear, as long as they appear. If, on the other hand, you're building something like a compiler or parser, order is very much important. You can use the token-vector approach (as in your second paragraph), or you can extend the bag-of-words approach such that each non-zero entry in the bag-of-words vector contains the linear index position of the token in the phrase.

Finally, if you're going to be building parse trees, there are obvious reasons why you'd want to go with the token-vector approach, as it's a big hassle to maintain sub-phrase ids for every word in the bag-of-words vector, but very easy to make "sub-vectors" in a token-vector. In fact, Eric Brill used a token-id sequence for his part-of-speech tagger, which is really neat.

Do you mind if I ask what specific task you're working on?

Gayton answered 25/2, 2009 at 18:32 Comment(3)

Thank you for a good start of an answer! : ) I will certainly check out the details of Brills token-id sequence. About using the BOW-representation with an integer to represent the tokens linear index, do you really think this would work (give good performance) with an SVM classifier? – Isthmian 26/2, 2009 at 13:24

The specific task is an implementation of Nivres linear time, transition-based parsing algorithm together with the maximum entropy classifier of liblinear. – Isthmian 26/2, 2009 at 13:30

@sganslandt: for SVM classifiers, you might think about using n-grams (bigrams, trigrams etc) instead of tokens - this preserves local contextual order, but ignores global order. You can then use a regular old bag-of-words and still maintain some context information. – Gayton 26/2, 2009 at 16:24

Binarization is the act of transforming colorful features of an entity into vectors of numbers, most often binary vectors, to make good examples for classifier algorithms.

I have mostly come across numeric features that take values between 0 and 1 (not binary as you describe), representing the relevance of the particular feature in the vector (between 0% and 100%, where 1 represents 100%). A common example for this are tf-idf vectors: in the vector representing a document (or sentence), you have a value for each term in the entire vocabulary that indicates the relevance of that term for the represented document.

As Mike already said in his reply, this is a complex problem in a wide field. In addition to his pointers, you might find it useful to look into some information retrieval techniques like the vector space model, vector space classification and latent semantic indexing as starting points. Also, the field of word sense disambiguation deals a lot with feature representation issues in NLP.

Vaticination answered 2/3, 2009 at 1:19 Comment(0)

[Not a direct answer] It all depends on what you are try to parse and then process, but for general short human phrase processing (e.g. IVT) another method is to use neural networks to learn the patterns. This can be very acurate for smallish vocubularies

Haven answered 26/2, 2009 at 12:30 Comment(0)

Recommended topics

Hot tags