How I classify a word of a text in things like names, number, money, date,etc?
Asked Answered
H

4

1

I did some questions about text-mining a week ago, but I was a bit confused and still, but now I know wgat I want to do.

The situation: I have a lot of download pages with HTML content. Some of then can bean be a text from a blog, for example. They are not structured and came from different sites.

What I want to do: I will split all the words with whitespace and I want to classify each one or a group of ones in some pre-defined itens like names, numbers, phone, email, url, date, money, temperature, etc.

What I know: I know the concepts/heard about about Natural Language Processing, Named Entity Reconigzer, POSTagging, NayveBayesian, HMM, training and a lot of things to do classification, etc., but there is some different NLP libraries with differents classifiers and ways to do this and I don't know what use or what do.

WHAT I NEED: I need some code example from a classifier, NLP, whatever, that can classify each word from a text separetely, and not a entire text. Something like this:

//This is pseudo-code for what I want, and not a implementation

classifier.trainFromFile("file-with-train-words.txt");
words = text.split(" ");
for(String word: words){
    classifiedWord = classifier.classify(word);
    System.out.println(classifiedWord.getType());
}

Somebody can help me? I'm confused with various APIs, classifiers and algorithms.

Hilton answered 1/8, 2011 at 2:55 Comment(0)
A
5

You should try Apache OpenNLP. It is easy to use and customize.

If you are doing it for Portuguese there are information on how to do it on the project documentation using Amazonia Corpus. The types supported are:

Person, Organization, Group, Place, Event, ArtProd, Abstract, Thing, Time and Numeric.

  1. Download the OpenNLP and the Amazonia Corpus. Extract both and copy the file amazonia.ad to the apache-opennlp-1.5.1-incubating folder.

  2. Execute the TokenNameFinderConverter tool to convert the Amazonia corpus to the OpenNLP format:

    bin/opennlp TokenNameFinderConverter ad -encoding ISO-8859-1 -data amazonia.ad -lang pt > corpus.txt
    
  3. Train you model (Change the encoding to the encoding of the corpus.txt file, that should be your system default encoding. This command can take several minutes):

    bin/opennlp TokenNameFinderTrainer -lang pt -encoding UTF-8 -data corpus.txt -model pt-ner.bin -cutoff 20
    
  4. Executing it from command line (You should execute only one sentence and the tokens should be separated):

    $ bin/opennlp TokenNameFinder pt-ner.bin 
    Loading Token Name Finder model ... done (1,112s)
    Meu nome é João da Silva , moro no Brasil . Trabalho na Petrobras e tenho 50 anos .
    Meu nome é <START:person> João da Silva <END> , moro no <START:place> Brasil <END> . <START:abstract> Trabalho <END> na <START:abstract> Petrobras <END> e tenho <START:numeric> 50 anos <END> .
    
  5. Executing it using the API:

    InputStream modelIn = new FileInputStream("pt-ner.bin");
    
    try {
      TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
    }
    catch (IOException e) {
      e.printStackTrace();
    }
    finally {
      if (modelIn != null) {
        try {
           modelIn.close();
        }
        catch (IOException e) {
        }
      }
    }
    
    // load the name finder
    NameFinderME nameFinder = new NameFinderME(model);
    
    // pass the token array to the name finder
    String[] toks = {"Meu","nome","é","João","da","Silva",",","moro","no","Brasil",".","Trabalho","na","Petrobras","e","tenho","50","anos","."};
    
    // the Span objects will show the start and end of each name, also the type
    Span[] nameSpans = nameFinder.find(toks);
    
  6. To evaluate your model you can use 10-fold cross validation: (only available in 1.5.2-INCUBATOR, to use it today you need to use the SVN trunk) (it can take several hours)

    bin/opennlp TokenNameFinderCrossValidator -lang pt -encoding UTF-8 -data corpus.txt -cutoff 20
    
  7. Improve the precision/recall by using the Custom Feature Generation (check documentation), for example by adding a name dictionary.

Aret answered 3/8, 2011 at 18:53 Comment(0)
D
2

You can use a Named Entity Recognizer (NER) approach for this task, I would highly recommend you to take a look at Stanford Core NLP page and use the ner functionality in the modules for your task. You can break up your sentences into tokens and then pass them to the Stanford NER system. I think the Stanford Core NLP page has lot of examples that can help you otherwise, please let me know if you need a toy code.

Here goes the sample code this is just the snippet of the whole code:

// creates a StanfordCoreNLP object, with NER
    Properties props = new Properties();
    props.put("annotators", "ner");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

classifier.trainFromFile("file-with-train-words.txt");
words = text.split(" ");
for(String word: words){
     Annotation document = new Annotation(word);
     pipeline.annotate(document);
     System.out.println(Annotation);
}
Duodecimal answered 1/8, 2011 at 16:31 Comment(5)
Oh, thanks. You can explain how I can train the StanfordCoreNLP to use some definitions that I want? I see that it only cames with 7 predefined entities.Hilton
Can you check : nlp.stanford.edu/software/crf-faq.shtml for the details about custom NER based classifiers.Duodecimal
If you go through the FAQ it has steps for training a custom classifier.Duodecimal
-1, you can not classify a word alone. In this way, you miss the word context (other words) and you will be not able to distinguish, e.g., date from a phone number.Hypnosis
Agreed. But if a good model is already trained will it matter if I submit whole string or just the tokens?Duodecimal
B
1

This problem falls at the intersection of several ideas from different areas. You mention named entity recognition, that is one. However, you are probably looking at a mixture of part of speech tagging (for nouns, names and the like) and information extraction (for numbers, phone numbers, emails).

Unfortunately doing this and making it work on real work data will require some effort, and it is not as simple as use this or that API.

Bischoff answered 1/8, 2011 at 4:49 Comment(0)
B
0

You have to create specific functions for extracting and detecting each data type and their errors.

Or as its well known name object orientated way. I.e. for detecting currency what we do is checking for a dollar sign at the beginning or end and check if there are attached non-numeric characters which means error.

You should write what you already do with your mind. It's not that hard if you follow the rules. There are 3 golden rules in Robotics/AI:

  1. analyse it.
  2. simplify it
  3. digitalize it.

That way you can talk with computers.

Bernardinabernardine answered 21/10, 2017 at 16:56 Comment(1)
Please try and give clear answers. The OP may not understand you if the rest of us can'tSander

© 2022 - 2024 — McMap. All rights reserved.