How can I split a text into sentences?
Asked Answered
K

21

186

I have a text file. I need to get a list of sentences.

How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.

My old regular expression works badly:

re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)
Knot answered 1/1, 2011 at 22:18 Comment(0)
I
198

The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

(I haven't tried it!)

Immensity answered 1/1, 2011 at 22:27 Comment(12)
Thanks, i hope this library will works with Russian language.Knot
@Artyom: It probably can work with Russian -- see can NLTK/pyNLTK work “per language” (i.e. non-english), and how?.Electro
@Artyom: Here's direct link to the online documentation for nltk .tokenize.punkt.PunktSentenceTokenizer.Electro
You might have to execute nltk.download() first and download models -> punktEnsheathe
to save some typing: import nltk then nltk.sent_tokenize(string)Cinderella
This fails on cases with ending quotation marks. If we have a sentence that ends like "this."Decare
@Decare I think that's not a valid sentence, the quotation mark shall precede the period.Brickbat
Okay, you convinced me. But I just tested and it does not seem to fail. My input is 'This fails on cases with ending quotation marks. If we have a sentence that ends like "this." This is another sentence.' and my output is ['This fails on cases with ending quotation marks.', 'If we have a sentence that ends like "this."', 'This is another sentence.'] Seems correct for me.Brickbat
After compiling comments from all people, it still fails to parse the following sentence: "FIG. 1A is a simplified pin out diagram for an integrated circuit which includes a serial peripheral interface I/O according to an embodiment of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims herein." It separates "FIG" and "1A". I have to add a special "if" statement to handle this. if "FIG." in text: text = text.replace("FIG.","FIG<prd>") This is very ad hoc, but I am not sure if there's a better way to generalize it.Insanitary
How to get an array of sentences from the text?Treed
This fails on the simple example 4. Building, sculpting, moving, and mending things in hard to reach places and at small scales (e.g. dig tunnels, deliver adhesives to cracks).Volitant
@Brickbat en.wikipedia.org/wiki/… Anyway, we can't control the grammar of the text we're analyzingTyrothricin
P
164

This function can split the entire text of Huckleberry Finn into sentences in about 0.1 seconds and handles many of the more painful edge cases that make sentence parsing non-trivial e.g. "Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer. He also worked at craigslist.org as a business analyst."

# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|edu|me)"
digits = "([0-9])"
multiple_dots = r'\.{2,}'

def split_into_sentences(text: str) -> list[str]:
    """
    Split the text into sentences.

    If the text contains substrings "<prd>" or "<stop>", they would lead 
    to incorrect splitting because they are used as markers for splitting.

    :param text: text to be split into sentences
    :type text: str

    :return: list of sentences
    :rtype: list[str]
    """
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = [s.strip() for s in sentences]
    if sentences and not sentences[-1]: sentences = sentences[:-1]
    return sentences

Comparison with nltk:

>>> from nltk.tokenize import sent_tokenize

Example 1: split_into_sentences is better here (because it explicitly covers a lot of cases):

>>> text = 'Some sentence. Mr. Holmes...This is a new sentence!And This is another one.. Hi '

>>> split_into_sentences(text)
['Some sentence.',
 'Mr. Holmes...',
 'This is a new sentence!',
 'And This is another one..',
 'Hi']

>>> sent_tokenize(text)
['Some sentence.',
 'Mr.',
 'Holmes...This is a new sentence!And This is another one.. Hi']

Example 2: nltk.tokenize.sent_tokenize is better here (because it uses an ML model):

>>> text = 'The U.S. Drug Enforcement Administration (DEA) says hello. And have a nice day.'

>>> split_into_sentences(text)
['The U.S.',
 'Drug Enforcement Administration (DEA) says hello.',
 'And have a nice day.']

>>> sent_tokenize(text)
['The U.S. Drug Enforcement Administration (DEA) says hello.',
 'And have a nice day.']
Primal answered 19/7, 2015 at 20:50 Comment(14)
This is an awesome solution. However I added two more lines to it digits = "([0-9])" in the declaration of regular expressions and text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text) in the function. Now it does not split the line at decimals such as 5.5. Thank you for this answer.Puduns
How did you parse the entire Huckleberry Fin? Where's that in text format?Rubbish
A great solution. In the function, I added if "e.g." in text: text = text.replace("e.g.","e<prd>g<prd>") if "i.e." in text: text = text.replace("i.e.","i<prd>e<prd>") and it fully solved my problem.Mozellemozes
Great solution with very helpful comments! Just to make it a little more robust though: prefixes = "(Mr|St|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Mt)[.]", websites = "[.](com|net|org|io|gov|me|edu)", and if "..." in text: text = text.replace("...","<prd><prd><prd>")Grimona
Can this function be made to see sentences like this as one sentence: When a child asks her mother "Where do babies come from?", what should one reply to her?Sneakbox
This is super useful for running jobs with pig, where jython doesn't natively come with nltk. However it seems to completely discard non-ascii sentences.Mummy
what about decimal numbers? text = re.sub(" (\d+)[.](\d+) "," \\1<prd>\\2 ",text)Albemarle
Does this work well in comparison to the big data methods?Coda
How to include this corner case also: Thank you for contacting back. Request you to please help us with the transaction ID for $<***>.92 ? - Charlie.Prosthodontics
Hmm sentence "The U.S. Drug Enforcement Administration (DEA) says hello. And have a nice day." is splitting after "U.S." for some reason.Plotinus
Awsome, for some improvement, if the final sentence does not have a dot at the end, it is not included. "A sentence. A second sentence. A third sentence witout a final dot" --> ['A sentence.', 'A second sentence.']Averell
@Averell did you solve the case when the last sentence does not have a dot ?Broncobuster
No, I did not try and used ntlk instead.Averell
@Averell this case (where the last sentence does not end in a dot) is now solvedBasically
E
96

Instead of using regex for spliting the text into sentences, you can also use nltk library.

>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."

>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']

ref: https://mcmap.net/q/137284/-how-to-break-up-a-paragraph-by-sentences-in-python

Ex answered 30/10, 2017 at 13:34 Comment(5)
Great, simpler and more reusable example than the accepted answer.Natch
If you remove a space after a dot, tokenize.sent_tokenize() doesn't work, but tokenizer.tokenize() works! Hmm...Lactation
for sentence in tokenize.sent_tokenize(text): print(sentence)Filide
can i limit it to like 2 sentences only?Justle
I found that nltk.tokenize.sent_tokenize gives results with faulty splitting sentences when it finds i.e., e.g. etc. and other abbreviations.Enchanting
G
20

You can try using Spacy instead of regex. I use it and it does the job.

import spacy
nlp = spacy.load('en')

text = '''Your text here'''
tokens = nlp(text)

for sent in tokens.sents:
    print(sent.string.strip())
Gurevich answered 10/1, 2018 at 12:3 Comment(4)
Space is mega great. but if you just need to separate into sentences passing the text to space will take too long if you are dealing with a data pipeCarpet
@Berlines I agree but couldn't find any other library that does the job as clean as spaCy. But if you have any suggestion, I can try.Gurevich
Also for the AWS Lambda Serverless users out there, spacy's support data files are many 100MB (english large is > 400MB) so you can't use things like this out of the box, very sadly (huge fan of Spacy here)Philter
I found spacy very bad splitting my texts into sentences, giving some phantom sentences containing just a dot.Enchanting
E
10

Here is a middle of the road approach that doesn't rely on any external libraries. I use list comprehension to exclude overlaps between abbreviations and terminators as well as to exclude overlaps between variations on terminations, for example: '.' vs. '."'

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
                 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
   end = True
   sentences = []
   while end > -1:
       end = find_sentence_end(paragraph)
       if end > -1:
           sentences.append(paragraph[end:].strip())
           paragraph = paragraph[:end]
   sentences.append(paragraph)
   sentences.reverse()
   return sentences


def find_sentence_end(paragraph):
    [possible_endings, contraction_locations] = [[], []]
    contractions = abbreviations.keys()
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
    for sentence_terminator in sentence_terminators:
        t_indices = list(find_all(paragraph, sentence_terminator))
        possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
    for contraction in contractions:
        c_indices = list(find_all(paragraph, contraction))
        contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
        max_end_start = max([pe[0] for pe in possible_endings])
        possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
    end = (-1 if not len(possible_endings) else max(possible_endings))
    return end


def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1:
            return
        yield start
        start += len(sub)

I used Karl's find_all function from this entry: Find all occurrences of a substring in Python

Equalizer answered 22/1, 2015 at 15:59 Comment(2)
Perfect approach! The others don't catch ... and ?!.Embosser
Nice work. One not— “i.e.” translates as “That is”, not “For example”.Jerrold
G
10

I love spaCy to death, but I recently discovered two new approaches for sentence tokenization. One is BlingFire from Microsoft (incredibly fast), and the other is PySBD from AI2 (supremely accurate).

text = ...

from blingfire import text_to_sentences
sents = text_to_sentences(text).split('\n')

from pysbd import Segmenter
segmenter = Segmenter(language='en', clean=False)
sents = segmenter.segment(text)

I separated 20k sentences using five different methods. Here are the elapsed times on an AMD Threadripper Linux machine:

  • spaCy Sentencizer: 1.16934s
  • spaCy Parse: 25.97063s
  • PySBD: 9.03505s
  • NLTK: 0.30512s
  • BlingFire: 0.07933s

UPDATE: I tried using BlingFire on all-lowercase text, and it failed miserably. I'm going to use PySBD in my projects for the time being.

Gobioid answered 20/9, 2022 at 18:52 Comment(2)
BlingFire doesn't work on ARM Linux or macOS currently.Volitant
When testing with a subset of the 51 English "golden rules" for sentence segmentation defined here (s3.amazonaws.com/tm-town-nlp-resources/golden_rules.txt, which is from github.com/diasks2/pragmatic_segmenter), BlingFire was the most accurate option and only slightly slower than NLTK with nltk.tokenize.sent_tokenize(text). Said subset was relevant for my purposes and included these 33 rules only: pastebin.com/raw/xqJATfcXRatha
C
9

You can also use sentence tokenization function in NLTK:

from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes.  Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."

sent_tokenize(sentence)
Crowe answered 27/6, 2019 at 9:9 Comment(1)
I tried to use it because nltk is a very, very good library, but it fails on abbreviations where it splits but it should not.Enchanting
D
7

For simple cases (where sentences are terminated normally), this should work:

import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

The regex is *\. +, which matches a period surrounded by 0 or more spaces to the left and 1 or more to the right (to prevent something like the period in re.split being counted as a change in sentence).

Obviously, not the most robust solution, but it'll do fine in most cases. The only case this won't cover is abbreviations (perhaps run through the list of sentences and check that each string in sentences starts with a capital letter?)

Druce answered 1/1, 2011 at 22:34 Comment(5)
You can't think of a situation in English where a sentence doesn't end with a period? Imagine that! My response to that would be, "think again." (See what I did there?)Immensity
@Ned wow, can't believe I was that stupid. I must be drunk or something.Druce
I am using Python 2.7.2 on Win 7 x86, and the regex in the above code gives me this error: SyntaxError: EOL while scanning string literal, pointing to the closing parenthesis (after text). Also, the regex you reference in your text does not exist in your code sample.Dunite
The regex is not completely correct, as it should be r' *[\.\?!][\'"\)\]]* +'Cyanate
It may cause many problems and chunk a sentence to smaller chunks as well. Consider the case that we have " I paid $3.5 for this ice cream" them the chunks are " I paid $3" and "5 for this ice cream". use the default nltk sentence.tokenizer is safer!Committal
H
6

Using spacy:

import spacy

nlp = spacy.load('en_core_web_sm')
text = "How are you today? I hope you have a great day"
tokens = nlp(text)
for sent in tokens.sents:
    print(sent.string.strip())
Hypertension answered 15/12, 2020 at 16:0 Comment(0)
H
3

Might as well throw this in, since this is the first post that showed up for sentence split by n sentences.

This works with a variable split length, which indicates the sentences that get joined together in the end.

import nltk
//nltk.download('punkt')
from more_itertools import windowed

split_length = 3 // 3 sentences for example 

elements = nltk.tokenize.sent_tokenize(text)
segments = windowed(elements, n=split_length, step=split_length)
text_splits = []
for seg in segments:
          txt = " ".join([t for t in seg if t])
          if len(txt) > 0:
                text_splits.append(txt)
Handmaiden answered 14/3, 2021 at 21:36 Comment(0)
Y
3

If NLTK's sent_tokenize is not a thing (e.g. needs a lot of GPU RAM on long text) and regex doesn't work properly across languages, sentence splitter might be try worth.

Yacov answered 30/6, 2021 at 20:13 Comment(0)
C
3

Using Stanza a natural language processing library that works for many human languages.

import stanza

stanza.download('en')
nlp = stanza.Pipeline(lang='en', processors='tokenize')

doc = nlp(t_en)
for sentence in doc.sentences:
    print(sentence.text)
Carnivorous answered 7/12, 2021 at 8:52 Comment(1)
This, is fantastic, though if you are going to use this please use the multilingual model (stanfordnlp.github.io/stanza/langid.html)Kenleigh
F
2

Also, be wary of additional top level domains that aren't included in some of the answers above.

For example .info, .biz, .ru, .online will throw some sentence parsers but aren't included above.

Here's some info on frequency of top level domains: https://www.westhost.com/blog/the-most-popular-top-level-domains-in-2017/

That could be addressed by editing the code above to read:

alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|ai|edu|co.uk|ru|info|biz|online)"
Filiate answered 23/10, 2020 at 8:38 Comment(2)
This is helpful information, but it might be more appropriate to add it as a short comment on the original answer.Thievery
That was my original plan, but I don't have the reputation for that yet apparently. Thought this might help someone so I thought I'd post it as best I could. If there's a way to do it and get around the "you need 50 reputation" first, I'd love to :)Filiate
R
1

You could make a new tokenizer for Russian (and some other languages) using this function:

def russianTokenizer(text):
    result = text
    result = result.replace('.', ' . ')
    result = result.replace(' .  .  . ', ' ... ')
    result = result.replace(',', ' , ')
    result = result.replace(':', ' : ')
    result = result.replace(';', ' ; ')
    result = result.replace('!', ' ! ')
    result = result.replace('?', ' ? ')
    result = result.replace('\"', ' \" ')
    result = result.replace('\'', ' \' ')
    result = result.replace('(', ' ( ')
    result = result.replace(')', ' ) ') 
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.strip()
    result = result.split(' ')
    return result

and then call it in this way:

text = 'вы выполняете поиск, используя Google SSL;'
tokens = russianTokenizer(text)
Rustic answered 28/1, 2012 at 17:42 Comment(1)
This badly splits text into words, not sentences.Bilow
P
1

No doubt that NLTK is the most suitable for the purpose. But getting started with NLTK is quite painful (But once you install it - you just reap the rewards)

So here is simple re based code available at http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html

# split up a paragraph into sentences
# using regular expressions


def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList


if __name__ == '__main__':
    p = """This is a sentence.  This is an excited sentence! And do you think this is a question?"""

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

#output:
#   This is a sentence
#   This is an excited sentence

#   And do you think this is a question 
Pfeiffer answered 14/5, 2012 at 1:59 Comment(1)
Yey but this fails so easily, with: "Mr. Smith knows this is a sentence."Roche
T
1

i hope this will help you on latin,chinese,arabic text

import re

punctuation = re.compile(r"([^\d+])(\.|!|\?|;|\n|。|!|?|;|…| |!|؟|؛)+")
lines = []

with open('myData.txt','r',encoding="utf-8") as myFile:
    lines = punctuation.sub(r"\1\2<pad>", myFile.read())
    lines = [line.strip() for line in lines.split("<pad>") if line.strip()]
Tyrant answered 13/5, 2020 at 6:10 Comment(0)
C
1

Was working on similar task and came across this query, by following few links and working on few exercises for nltk the below code worked for me like magic.

from nltk.tokenize import sent_tokenize 
  
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
sent_tokenize(text) 

output:

['Hello everyone.',
 'Welcome to GeeksforGeeks.',
 'You are studying NLP article']

Source: https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/

Cloddish answered 7/7, 2020 at 9:53 Comment(0)
V
1

Using Spacy v3.5:

import spacy

nlp_sentencizer = spacy.blank("en")
nlp_sentencizer.add_pipe("sentencizer")

text = "How are you today? I hope you have a great day"
tokens = nlp_sentencizer(text)
[str(sent) for sent in tokens.sents]
Volitant answered 17/2, 2023 at 10:27 Comment(0)
S
0

I had to read subtitles files and split them into sentences. After pre-processing (like removing time information etc in the .srt files), the variable fullFile contained the full text of the subtitle file. The below crude way neatly split them into sentences. Probably I was lucky that the sentences always ended (correctly) with a space. Try this first and if it has any exceptions, add more checks and balances.

# Very approximate way to split the text into sentences - Break after ? . and !
fullFile = re.sub("(\!|\?|\.) ","\\1<BRK>",fullFile)
sentences = fullFile.split("<BRK>");
sentFile = open("./sentences.out", "w+");
for line in sentences:
    sentFile.write (line);
    sentFile.write ("\n");
sentFile.close;

Oh! well. I now realize that since my content was Spanish, I did not have the issues of dealing with "Mr. Smith" etc. Still, if someone wants a quick and dirty parser...

Spermatozoid answered 14/3, 2018 at 21:49 Comment(0)
A
0

using spacy

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'This is first.This is second.This is Thired ')
for sentence in doc.sents:
  print(sentence)

But if you want to do get a sentence by index Example:

#don't work
 doc.sents[0]

Use

list( doc.sents)[0]
Astragalus answered 20/2, 2022 at 15:19 Comment(1)
It should be "for sentence in doc.sents:".Dandrea
O
0

(?<!\w\.\w.)(?<![A-Z]\.)(?<=\.|\?)\s(?=[A-Z])

We should rather use this Regex to avoid situation that some shortcut will be handled as end of phrase.

Ogletree answered 4/2 at 15:43 Comment(1)
Can you show an example in which this regex works better than an alternative?Loth

© 2022 - 2024 — McMap. All rights reserved.