Named entities as a feature in text categorization?

It depends a lot on the domain you are working in. You have to define the features based on the domain. Say in a search engine you are working on learning to rank problem, generating a dynamic rank, the NE's wont give you any benefit here. It largerly depends on the domain that you are working and also the output categorization labels (supervised learning) defined.

Now say you are working on classifying documents pertaining to Soccer or Movie or Polictics and so on. In this case Named Entities can work. I will give you an example here, say you are using a Neural Network which categorizes documents into Soccer, Movie, Politics etc. Now say a document comes in "Lionel Messi was invited to attend the premier of "The Social Network", also present were the cast and crew including Jesse Eisenberg, Andrew Garfield and Justin Timberlake" Here the connection between named entities (input features) and movie (output defined) will be stronger and hence it will be classified as a document on Movie.

Another example, say our document is "Tom Cruise is portraying the character of Lionel Messi in the movie "The last soccer game". Here comes the benefit say your neural network has learnt that when an actor and footballer comes together in one document there is high probability of it being a movie. Again it depends on the data and training it may be other way round too (but that is what is learning all about; seeing the past data)

So my answer would be try it out, nobody is stopping you to have named entities as features. It might help for the domain that you are working in.

Recommended topics

Hot tags