How to parse text into sentences
Asked Answered
I

7

7

I'm trying to break up a paragraph into sentences. Here is my code so far:

import java.util.*;

public class StringSplit {
 public static void main(String args[]) throws Exception{
     String testString = "The outcome of the negotiations is vital, because the current tax levels signed into law by President George W. Bush expire on Dec. 31. Unless Congress acts, tax rates on virtually all Americans who pay income taxes will rise on Jan. 1. That could affect economic growth and even holiday sales.";
     String[] sentences = testString.split("[\\.\\!\\?]");
     for (int i=0;i<sentences.length;i++){  
         System.out.println(i);
      System.out.println(sentences[i]);  
     }  
 }
}

Two problems were found:

  1. The code splits anytime it comes to a period (".") symbol, even when it's actually one sentence. How do I prevent this?
  2. Each sentence that is split starts with a space. How do I delete the redundant space?
Impress answered 7/12, 2010 at 5:13 Comment(0)
E
14

The problem you mentioned is a NLP (Natural Language Processing) problem. It is fine to write a crude rule engine but it might not scale up to support full english text.

To have a deeper insight and a java library check out this link http://nlp.stanford.edu/software/lex-parser.shtml , http://nlp.stanford.edu:8080/parser/index.jsp and similar question for ruby language How do you parse a paragraph of text into sentences? (perferrably in Ruby)

for example : The text -

The outcome of the negotiations is vital, because the current tax levels signed into law by President George W. Bush expire on Dec. 31. Unless Congress acts, tax rates on virtually all Americans who pay income taxes will rise on Jan. 1. That could affect economic growth and even holiday sales.

after tagging becomes :

The/DT outcome/NN of/IN the/DT negotiations/NNS is/VBZ vital/JJ ,/, because/IN the/DT current/JJ tax/NN levels/NNS signed/VBN into/IN law/NN by/IN President/NNP George/NNP W./NNP Bush/NNP expire/VBP on/RP Dec./NNP 31/CD ./. Unless/IN Congress/NNP acts/VBZ ,/, tax/NN rates/NNS on/IN virtually/RB all/RB Americans/NNPS who/WP pay/VBP income/NN taxes/NNS will/MD rise/VB on/IN Jan./NNP 1/CD ./. That/DT could/MD affect/VB economic/JJ growth/NN and/CC even/RB holiday/NN sales/NNS ./. Parse

Check how it has distinguished the full stop (.) and the period after Dec. 31 ...

Elapse answered 7/12, 2010 at 5:30 Comment(0)
G
3

You can try to use the java.text.BreakIterator class for parsing sentences. For example:

BreakIterator border = BreakIterator.getSentenceInstance(Locale.US);
border.setText(text);
int start = border.first();
//iterate, creating sentences out of all the Strings between the given boundaries
for (int end = border.next(); end != BreakIterator.DONE; start = end, end = border.next()) {
    System.out.println(text.substring(start,end));
}
Gilbreath answered 1/8, 2013 at 1:24 Comment(1)
BreakIterator is a good idea, but it suffers from many of these same types of problems. See this question: #17160013Twostep
C
2

The first one is a pretty hard problem to do properly, since you'd have to implement sentence detection. I suggest you don't do that, and just separate sentences with two blank lines after a punctuation mark. For example:

"The outcome of the negotiations is vital, because the current tax levels signed into law by President George W. Bush expire on Dec. 31.  Unless Congress acts, tax rates on virtually all Americans who pay income taxes will rise on Jan. 1.  That could affect economic growth and even holiday sales."

The second one can be solved using String.trim().

Example:

String one = "   and now...    ";
String two = one.trim();
System.out.println(two);          // output: "and now..."
Cassiopeia answered 7/12, 2010 at 5:20 Comment(1)
The problem with your first solution is that in the last decade or so there has been a shift from inserting two spaces between sentences to inserting only one. For writing done with this newer style, your solution won't work. :(Twostep
H
0

Trim it...

Harpy answered 7/12, 2010 at 5:19 Comment(0)
M
0

Given the current input format, it will be difficult to split into sentences. You have to impose some rule additional rule to identify the end of a sentence, in addition to the period. For instance, this rule could be "a sentence should end with a period(.) and two spaces". (This is how the UNIX tool grep identifies sentences.

Morsel answered 7/12, 2010 at 5:20 Comment(0)
G
0

You can use the Class SentenceSplitter provided by this open source library here.

SentenceSplitter sp = new SentenceSplitter("filename");
String str = null;
while((str = sp.next().toString()) != null)
{
    //Your code here.
}
Grabowski answered 22/2, 2015 at 15:29 Comment(1)
Nothing to download at this URL. It returns "You don't have permission to access /page/download_view/ on this server."Twostep
C
-1

first Trim() Your String... and use this link

http://www.java-examples.com/java-string-split-example &http://www.rgagnon.com/javadetails/java-0438.html

and you can also use StringBuffer Class... just use this link i hope it will help you

Curlpaper answered 7/12, 2010 at 5:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.