Split string into sentences
Asked Answered
S

4

30

I have written this piece of code that splits a string and stores it in a string array:-

String[] sSentence = sResult.split("[a-z]\\.\\s+");

However, I've added the [a-z] because I wanted to deal with some of the abbreviation problem. But then my result shows up as so:-

Furthermore when Everett tried to instruct them in basic mathematics they proved unresponsiv

I see that I lose the pattern specified in the split function. It's okay for me to lose the period, but losing the last letter of the word disturbs its meaning.

Could someone help me with this, and in addition, could someone help me with dealing with abbreviations? For example, because I split the string based on periods, I do not want to lose the abbreviations.

Signpost answered 21/4, 2010 at 22:29 Comment(0)
I
65

Parsing sentences is far from being a trivial task, even for latin languages like English. A naive approach like the one you outline in your question will fail often enough that it will prove useless in practice.

A better approach is to use a BreakIterator configured with the right Locale.

BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
String source = "This is a test. This is a T.L.A. test. Now with a Dr. in it.";
iterator.setText(source);
int start = iterator.first();
for (int end = iterator.next();
    end != BreakIterator.DONE;
    start = end, end = iterator.next()) {
  System.out.println(source.substring(start,end));
}

Yields the following result:

  1. This is a test.
  2. This is a T.L.A. test.
  3. Now with a Dr. in it.
Inhibit answered 22/4, 2010 at 2:42 Comment(2)
When I use this sentence - "My friend, Mr. Jones, has a new dog." It breaks after Mr. It is happening because of the capitalization of Jones. Do you know a way around it? Otherwise the BreakIterator is great!Tarweed
This works only after Dr., Mr. etc. are lowercase letters. Upper case letters will result in wrong seperation.Guillaume
K
13

It will be difficult to get a regular expression to work in all cases, but to fix your immediate problem you can use a lookbehind:

String sResult = "This is a test. This is a T.L.A. test.";
String[] sSentence = sResult.split("(?<=[a-z])\\.\\s+");

Result:

This is a test
This is a T.L.A. test.

Note that there are abbrevations that do not end with capital letters, such as abbrev., Mr., etc... And there are also sentences that don't end in periods!

Kirmess answered 21/4, 2010 at 22:32 Comment(1)
This will fail in 9.3% of sentences. And sentences that ... use ellipsis. And sentences with typo.s in them. And so on. Whatever you do, your code will make mistakes, viewed from the human perspective.Kwok
R
4

If you can, use a natural language processing tool, such as LingPipe. There are many subtleties which will be very hard to catch using regular expressions, e.g., (e.g. :-)), Mr., abbreviations, ellipsis (...), et cetera.

There is a very easy to follow tutorial on Sentence Detection in the LingPipe website.

Rhett answered 21/4, 2010 at 22:43 Comment(1)
Hi, I checked out the tutorial. It seemed perfect, however I can't seem to figure out how to use it with eclipse. Could you help me out please?Signpost
R
2

Late response but for future visitors such as me and after a long time searching. Use OpenNlP model, that was the best option in my case and it worked with all the text samples here including crucial one mentioned by @nbz in the comment,

My friend, Mr. Jones, has a new dog. This is a test. This is a T.L.A. test. Now with a Dr. in it."

Separated by a line space:

My friend, Mr. Jones, has a new dog.
This is a test.
This is a T.L.A. test.
Now with a Dr. in it.

You need the .jar libraries to import into your project as well as the trained model en-sent.bin.

This is a tutorial which can easily integrate you into a quick and efficient run:

https://www.tutorialkart.com/opennlp/sentence-detection-example-in-opennlp/

And one for setup-ing in eclipse:

https://www.tutorialkart.com/opennlp/how-to-setup-opennlp-java-project/

This is how the code looks like:

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
 
import com.fasterxml.jackson.databind.exc.InvalidFormatException;
 
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
 
/**
* Sentence Detection Example in openNLP using Java
* @author tutorialkart
*/
public class SentenceDetectExample {
 
    public static void main(String[] args) {
        try {
            new SentenceDetectExample().sentenceDetect();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
 
    /**
     * This method is used to detect sentences in a paragraph/string
     * @throws InvalidFormatException
     * @throws IOException
     */
    public void sentenceDetect() throws InvalidFormatException, IOException {
        String paragraph = "This is a statement. This is another statement. Now is an abstract word for time, that is always flying.";
 
        // refer to model file "en-sent,bin", available at link http://opennlp.sourceforge.net/models-1.5/
        InputStream is = new FileInputStream("en-sent.bin");
        SentenceModel model = new SentenceModel(is);
        
        // feed the model to SentenceDetectorME class
        SentenceDetectorME sdetector = new SentenceDetectorME(model);
        
        // detect sentences in the paragraph
        String sentences[] = sdetector.sentDetect(paragraph);
 
        // print the sentences detected, to console
        for(int i=0;i<sentences.length;i++){
            System.out.println(sentences[i]);
        }
        is.close();
    }
}

Since you implement the libraries it works offline too which is a big plus as the correct answer by @Julien Silland says it's not a straight-forward process and having a trained model do it for you is the best option.

Rascon answered 22/1, 2021 at 13:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.