How to break up a paragraph by sentences in Python
Asked Answered
H

2

13

I need to parse sentences from a paragraph in Python. Is there an existing package to do this, or should I be trying to use regex here?

Hammock answered 28/2, 2012 at 0:6 Comment(5)
Are there double-spaces after the end of each sentence?Dolorous
Your problem statement doesn't provide sufficient information for us to work with.Sappy
There are some answers here: #116994Snake
"Purely syntactic approaches using regexps sound problematic... just think of the 5.5 ways that Prof. Smith from the U.S. told us periods can be used."Beograd
These things are usually done by dedicated sentence splitter tools / library modules. Trying to do with regexes alone is not going to produce good results. The better splitters have been trained.Interdiction
O
46

The nltk.tokenize module is designed for this and handles edge cases. For example:

>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."
>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']
Oneeyed answered 28/2, 2012 at 0:34 Comment(0)
H
0

Here is how I am getting the first n sentences:

def get_first_n_sentence(text, n):
    endsentence = ".?!"
    sentences = itertools.groupby(text, lambda x: any(x.endswith(punct) for punct in endsentence))
    for number,(truth, sentence) in enumerate(sentences):
        if truth:
            first_n_sentences = previous+''.join(sentence).replace('\n',' ')
        previous = ''.join(sentence)
        if number>=2*n: break #

    return first_n_sentences

Reference: http://www.daniweb.com/software-development/python/threads/303844

Hammock answered 28/2, 2012 at 0:23 Comment(1)
it will not work if the text contains URL's, or something with punctuation (period) on the term like, Ms., Dr., etc.Chive

© 2022 - 2024 — McMap. All rights reserved.