How do you parse a paragraph of text into sentences? (perferrably in Ruby)

Asked 13/5, 2009 at 22:49 Answered 27/12, 2016 at 2:48

How do you take paragraph or large amount of text and break it into sentences (perferably using Ruby) taking into account cases such as Mr. and Dr. and U.S.A? (Assuming you just put the sentences into an array of arrays)

UPDATE: One possible solution I thought of involves using a parts-of-speech tagger (POST) and a classifier to determine the end of a sentence:

Getting data from Mr. Jones felt the warm sun on his face as he stepped out onto the balcony of his summer home in Italy. He was happy to be alive.

CLASSIFIER Mr./PERSON Jones/PERSON felt/O the/O warm/O sun/O on/O his/O face/O as/O he/O stepped/O out/O onto/O the/O balcony/O of/O his/O summer/O home/O in/O Italy/LOCATION ./O He/O was/O happy/O to/O be/O alive/O ./O

POST Mr./NNP Jones/NNP felt/VBD the/DT warm/JJ sun/NN on/IN his/PRP$ face/NN as/IN he/PRP stepped/VBD out/RP onto/IN the/DT balcony/NN of/IN his/PRP$ summer/NN home/NN in/IN Italy./NNP He/PRP was/VBD happy/JJ to/TO be/VB alive./IN

Can we assume, since Italy is a location, the period is the valid end of the sentence? Since ending on "Mr." would have no other parts-of-speech, can we assume this is not a valid end-of-sentence period? Is this the best answer to the my question?

Thoughts?

Snakemouth answered 13/5, 2009 at 22:49 Comment(2)

Are there any specific rules. If you can tell us the rules in English, I'm sure we (or you) would be able to code the solution. For example: do abbreviations such as 'abbr' have a full stop after them? If you're going to be parsing grammatical textbooks you may be fine with simple solutions, but if you're taking arbitrary text then every solution will have shortcomings, like ... you know? – Jiggered 14/5, 2009 at 1:40

POS tagger is overkill. Use an NLP-based tokenizer and your rules will be simpler. – Bova 14/5, 2009 at 8:5

Try looking at the Ruby wrapper around the Stanford Parser. It has a getSentencesFromString() function.

Waterworks answered 14/5, 2009 at 14:35 Comment(3)

I'll continue to play with the Stanford parser - it's in there somewhere! Thanks! – Snakemouth 14/5, 2009 at 15:11

edu.stanford.nlp.process.DocumentPreprocessor, by the way – Waterworks 14/5, 2009 at 15:39

Yes, either via the Ruby wrapper or directly by calling edu.stanford.nlp.process.DocumentPreprocessor (from code or from the command-line: java edu.stanford.nlp.process.DocumentPreprocessor /u/nlp/data/lexparser/textDocument.txt > oneTokenizedSentencePerLine.txt , you can divide text into sentences. (This is done via a (good but heuristic) FSM, so it's fast; you're not running the probabilistic parser.) – Crampon 16/8, 2010 at 19:28

Just to make it clear, there is no simple solution to that. This is topic of NLP research as a quick Google search shows.

However, it seems that there are some open source projects dealing with NLP supporting sentence detection, I found the following Java-based toolset:

openNLP

Additional comment: The problem of deciding where sentences begin and end is also called sentence boundary disambiguation (SBD) in natural language processing.

Sevenfold answered 13/5, 2009 at 23:13 Comment(2)

I wasn't able to find an easy ruby wrapper for openNLP - have you come across any? They did have an sentence splitter though... – Snakemouth 14/5, 2009 at 1:1

@phillc: Well, so called sentence boundary disambiguation "is the problem in natural language processing of deciding where sentences begin and end". (en.wikipedia.org/wiki/Sentence_boundary_disambiguation) – Sevenfold 18/6, 2009 at 22:55

Looks like this ruby gem might do the trick.

https://github.com/zencephalon/Tactful_Tokenizer

Turoff answered 6/5, 2010 at 16:3 Comment(0)

Take a look at the Python sentence splitter in NLTK (Natural Language Tool Kit):

Punkt sentence tokenizer

It's based on the following paper:

Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525.

The approach in the paper is quite interesting. They reduce the problem of sentence splitting to the problem of determining how strongly a word is associated with following punctuation. The overloading of periods after abbreviations is responsible for most of the ambiguous periods, so if you can identify the abbreviations you can identify the sentence boundaries with a high probability.

I've tested this tool informally a bit and it seems to give good results for a variety of (human) languages.

Porting it to Ruby would be non-trivial, but it might give you some ideas.

Countrywide answered 25/5, 2009 at 21:22 Comment(1)

both links are broken, is there something you can share that won't be removed? – Hydrology 2/2, 2021 at 16:18

This is a hard problem if you really care about getting it right. You'll find that NLP parser packages probably provide this functionality. If you want something faster, you'll need to end up duplicating some of that functionality with a trained probabilistic function of a window of tokens (you'd probably want to count a line feed as a token, since i may drop a period if it's the end of a paragraph).

Edit: I recommend the Stanford parser if you can use Java. I have no recommendation for other languages, but I'm very interested in hearing what else is out there that is open source.

Bova answered 13/5, 2009 at 23:13 Comment(3)

Yes, I've played with the Stanford NLP parser but didn't find a sentence splitter. If you're interested in using it, there is a rjb (ruby to java bridge) wrapper someone created on github which I was able to get working with relative ease. Here is the link for those of you interested github.com/tiendung/ruby-nlp/tree/master NOTE: on windows, you must change colons to semi-colons when loading java libraries. Cheers. – Snakemouth 14/5, 2009 at 1:0

You are right, there's no sentence splitter in the parser package, but there is a tokenizer, which gets you part of the way there. It handles things like those mentioned, "Mr." as a token versus "." as an end of sentence. – Bova 14/5, 2009 at 8:4

There is a sentence splitter: edu.stanford.nlp.process.DocumentPreprocessor . Try the command: java edu.stanford.nlp.process.DocumentPreprocessor /u/nlp/data/lexparser/textDocument.txt > oneTokenizedSentencePerLine.txt . (This is done via a (good but heuristic) FSM, so it's fast; you're not running the probabilistic parser.) – Crampon 16/8, 2010 at 19:30

Unfortunately I'm not a ruby guy but maybe an example in perl will get you headed in the right direction. Using a non matching look behind for the ending punctuation then some special cases in a not behind followed by any amount of space followed by look ahead for a capital letter. I'm sure this isn't perfect but I hope it points you in the right direction. Not sure how you would know if U.S.A. is actually at the end of the sentence...

#!/usr/bin/perl

$string = "Mr. Thompson is from the U.S.A. and is 75 years old. Dr. Bob is a dentist. This is a string that contains several sentances. For example this is one. Followed by another. Can it deal with a question?  It sure can!";

my @sentances = split(/(?:(?<=\.|\!|\?)(?<!Mr\.|Dr\.)(?<!U\.S\.A\.)\s+(?=[A-Z]))/, $string);

for (@sentances) {
    print $_."\n";
}

Arsyvarsy answered 13/5, 2009 at 23:59 Comment(0)

Agree with the accepted answer, using Stanford Core NLP is a no brainer.

However, in 2016 there are some incompatibilities interfacing the Stanford Parser with the later versions of the stanford core nlp (I had issues with Stanford Core NLP v3.5).

Here is what I did to parse text into sentences using Ruby interfacing with Stanford Core NLP:

Install the Stanford CoreNLP gem:

gem install stanford-core-nlp

Then following the instructions on the readme for Using the latest version of the Stanford CoreNLP:

Using the latest version of the Stanford CoreNLP (version 3.5.0 as of 31/10/2014) requires some additional manual steps:

Download Stanford CoreNLP version 3.5.0 from http://nlp.stanford.edu/.

Place the contents of the extracted archive inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/) or inside the directory location configured by setting StanfordCoreNLP.jar_path.

Download the full Stanford Tagger version 3.5.0 from http://nlp.stanford.edu/.

Make a directory named 'taggers' inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/) or inside the directory configured by setting StanfordCoreNLP.jar_path.

Place the contents of the extracted archive inside taggers directory.

Download the bridge.jar file from https://github.com/louismullie/stanford-core-nlp.

Place the downloaded bridger.jar file inside the /bin/ folder of the stanford-core-nlp gem (e.g. [...]/gems/stanford-core-nlp-0.x/bin/taggers/) or inside the directory configured by setting StanfordCoreNLP.jar_path.

Then the ruby code to split text into sentences:

require "stanford-core-nlp"

#I downloaded the StanfordCoreNLP to a custom path:
StanfordCoreNLP.jar_path = "/home/josh/stanford-corenlp-full-2014-10-31/"
  
StanfordCoreNLP.use :english
StanfordCoreNLP.model_files = {}
StanfordCoreNLP.default_jars = [
  'joda-time.jar',
  'xom.jar',
  'stanford-corenlp-3.5.0.jar',
  'stanford-corenlp-3.5.0-models.jar',
  'jollyday.jar',
  'bridge.jar'
]

pipeline =  StanfordCoreNLP.load(:tokenize, :ssplit)

text = 'Mr. Josh Weir is writing some code. ' + 
  'I am Josh Weir Sr. my son may be Josh Weir Jr. etc. etc.'
text = StanfordCoreNLP::Annotation.new(text)
pipeline.annotate(text)
text.get(:sentences).each{|s| puts "sentence: " + s.to_s}
  
#output:
#sentence: Mr. Josh Weir is writing some code.
#sentence: I am Josh Weir Sr. my son may be Josh Weir Jr. etc. etc.

Landbert answered 27/12, 2016 at 2:48 Comment(0)

Maybe try splitting it up by a period followed by a space followed by an uppercase letter? I'm not sure how to find uppercase letters, but that would be the pattern I'd start looking at.

Edit: Finding uppercase letters with Ruby.

Another Edit:

Check for sentence ending punctuation that follow words that don't start with uppercase letters.

Succinic answered 13/5, 2009 at 22:53 Comment(3)

What if you split it up at the periods that followed words that don't start with uppercase letters? – Succinic 13/5, 2009 at 23:4

This was exactly what I came up with, but I wanted to know if there were even better solutions. Granted it wouldn't work if the sentence ended with a proper noun such as "I went to Italy." – Snakemouth 14/5, 2009 at 0:57

one very common case that would fail on is names like "Mr. Dibbler" – Procrustes 20/5, 2009 at 14:13

The answer by Dr. Manning is the most appropriate if you are considering the JAVA (and Ruby too in hard way ;)). It is here-

There is a sentence splitter: edu.stanford.nlp.process.DocumentPreprocessor . Try the command: java edu.stanford.nlp.process.DocumentPreprocessor /u/nlp/data/lexparser/textDocument.txt

oneTokenizedSentencePerLine.txt . (This is done via a (good but heuristic) FSM, so it's fast; you're not running the probabilistic parser.)

But a little suggestion if we modify the command java edu.stanford.nlp.process.DocumentPreprocessor /u/nlp/data/lexparser/textDocument.txt > oneTokenizedSentencePerLine.txt TO java edu.stanford.nlp.process.DocumentPreprocessor -file /u/nlp/data/lexparser/textDocument.txt > oneTokenizedSentencePerLine.txt . It will work fine as you need to specify what kind of file is being presented as input. So -file for Text file, -html for HTML, etc.

Alluvial answered 23/2, 2011 at 11:52 Comment(0)

I've not tried it but if English is the only language you are concerned with I'd suggest giving Lingua::EN::Readability a look.

Lingua::EN::Readability is a Ruby module which calculates statistics on English text. It can supply counts of words, sentences and syllables. It can also calculate several readability measures, such as a Fog Index and a Flesch-Kincaid level. The package includes the module Lingua::EN::Sentence, which breaks English text into sentences heeding abbreviations, and Lingua::EN::Syllable, which can guess the number of syllables in a written English word. If a pronouncing dictionary is available it can look up the number of syllables in the dictionary for greater accuracy

The bit you want is in sentence.rb as follows:

module Lingua
module EN
# The module Lingua::EN::Sentence takes English text, and attempts to split it
# up into sentences, respecting abbreviations.

module Sentence
  EOS = "\001" # temporary end of sentence marker

  Titles   = [ 'jr', 'mr', 'mrs', 'ms', 'dr', 'prof', 'sr', 'sen', 'rep', 
         'rev', 'gov', 'atty', 'supt', 'det', 'rev', 'col','gen', 'lt', 
         'cmdr', 'adm', 'capt', 'sgt', 'cpl', 'maj' ]

  Entities = [ 'dept', 'univ', 'uni', 'assn', 'bros', 'inc', 'ltd', 'co', 
         'corp', 'plc' ]

  Months   = [ 'jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 
         'aug', 'sep', 'oct', 'nov', 'dec', 'sept' ]

  Days     = [ 'mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun' ]

  Misc     = [ 'vs', 'etc', 'no', 'esp', 'cf' ]

  Streets  = [ 'ave', 'bld', 'blvd', 'cl', 'ct', 'cres', 'dr', 'rd', 'st' ]

  @@abbreviations = Titles + Entities + Months + Days + Streets + Misc

  # Split the passed text into individual sentences, trim these and return
  # as an array. A sentence is marked by one of the punctuation marks ".", "?"
  # or "!" followed by whitespace. Sequences of full stops (such as an
  # ellipsis marker "..." and stops after a known abbreviation are ignored.
  def Sentence.sentences(text)

    text = text.dup

    # initial split after punctuation - have to preserve trailing whitespace
    # for the ellipsis correction next
    # would be nicer to use look-behind and look-ahead assertions to skip
    # ellipsis marks, but Ruby doesn't support look-behind
    text.gsub!( /([\.?!](?:\"|\'|\)|\]|\})?)(\s+)/ ) { $1 << EOS << $2 }

    # correct ellipsis marks and rows of stops
    text.gsub!( /(\.\.\.*)#{EOS}/ ) { $1 }

    # correct abbreviations
    # TODO - precompile this regex?
    text.gsub!( /(#{@@abbreviations.join("|")})\.#{EOS}/i ) { $1 << '.' }

    # split on EOS marker, strip gets rid of trailing whitespace
    text.split(EOS).map { | sentence | sentence.strip }
  end

  # add a list of abbreviations to the list that's used to detect false
  # sentence ends. Return the current list of abbreviations in use.
  def Sentence.abbreviation(*abbreviations)
    @@abbreviations += abbreviations
    @@abbreviations
  end
end
end
end

Cargile answered 12/2, 2013 at 23:48 Comment(1)

Great points listed in that, though I found that for speed purposes on large amounts of text, rather than doing so many regex replacements, it worked well to have an array of words that were cycled through, then compared on the terms you mentioned above and with other line ending options. About 1000 times faster in my limited tests for large sized documents. – Haldas 7/8, 2013 at 3:13

I'm not a Ruby guy, but a RegEx that split on

 ^(Mr|Mrs|Ms|Mme|Sta|Sr|Sra|Dr|U\.S\.A)[\.\!\?\"] [A-Z]

would be my best bet, once you've got the paragraph (split on \r\n). This assumes that your sentences are proper cased.

Obviously this is a fairly ugly RegEx. What about forcing two spaces between sentences

Purposeful answered 13/5, 2009 at 22:53 Comment(0)

Breaking on a period followed by a space and a capitalized letter wouldn't fly for titles like "Mr. Brown."

The periods make things difficult, but an easy case to handle is exclamation points and question marks. However, there are cases that would make this not work. i.e. the corporate name of Yahoo!

Narceine answered 13/5, 2009 at 22:53 Comment(0)

Well obviously paragraph.split('.') won't cut it

#split will take a regex as an answer so you might try using a zero-width lookbehind to check for a word starting with a capital letter. Of course this will split on proper nouns so you may have to resort to a regex like this /(Mr\.|Mrs\.|U\.S\.A ...) which would horrendously ugly unless you built the regex programmatically.

Woolley answered 13/5, 2009 at 22:57 Comment(0)

I think this is not always resoluble, but you could split based on ". " (a period followed by and empty space) and verifying that the word before the period isn't in a list of words like Mr, Dr, etc.

But, of course, your list may omit some words, and in that case you will get bad results.

Blooper answered 13/5, 2009 at 22:59 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags