How can I split a text into sentences?

Asked 1/1, 2011 at 22:18 Answered 4/2 at 15:43

186

I have a text file. I need to get a list of sentences.

How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.

My old regular expression works badly:

re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)

Knot answered 1/1, 2011 at 22:18 Comment(0)

198

The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

(I haven't tried it!)

Immensity answered 1/1, 2011 at 22:27 Comment(12)

Thanks, i hope this library will works with Russian language. – Knot 1/1, 2011 at 23:10

@Artyom: It probably can work with Russian -- see can NLTK/pyNLTK work “per language” (i.e. non-english), and how?. – Electro 2/1, 2011 at 0:28

@Artyom: Here's direct link to the online documentation for nltk .tokenize.punkt.PunktSentenceTokenizer. – Electro 2/1, 2011 at 0:32

You might have to execute nltk.download() first and download models -> punkt – Ensheathe 12/1, 2015 at 18:36

to save some typing: import nltk then nltk.sent_tokenize(string) – Cinderella 22/3, 2017 at 2:30

This fails on cases with ending quotation marks. If we have a sentence that ends like "this." – Decare 21/2, 2018 at 5:16

@Decare I think that's not a valid sentence, the quotation mark shall precede the period. – Brickbat 31/10, 2019 at 9:55

Okay, you convinced me. But I just tested and it does not seem to fail. My input is

'This fails on cases with ending quotation marks. If we have a sentence that ends like "this." This is another sentence.'

and my output is

['This fails on cases with ending quotation marks.',  'If we have a sentence that ends like "this."',  'This is another sentence.']

Seems correct for me. – Brickbat 31/10, 2019 at 10:37

After compiling comments from all people, it still fails to parse the following sentence: "FIG. 1A is a simplified pin out diagram for an integrated circuit which includes a serial peripheral interface I/O according to an embodiment of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims herein." It separates "FIG" and "1A". I have to add a special "if" statement to handle this. if "FIG." in text: text = text.replace("FIG.","FIG<prd>") This is very ad hoc, but I am not sure if there's a better way to generalize it. – Insanitary 6/5, 2020 at 15:41

How to get an array of sentences from the text? – Treed 21/8, 2021 at 2:30

This fails on the simple example

4. Building, sculpting, moving, and mending things in hard to reach places and at small scales (e.g. dig tunnels, deliver adhesives to cracks)

. – Volitant 17/2, 2023 at 10:8

@Brickbat en.wikipedia.org/wiki/… Anyway, we can't control the grammar of the text we're analyzing – Tyrothricin 3/5, 2023 at 1:3

164

This function can split the entire text of Huckleberry Finn into sentences in about 0.1 seconds and handles many of the more painful edge cases that make sentence parsing non-trivial e.g. "Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer. He also worked at craigslist.org as a business analyst."

# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|edu|me)"
digits = "([0-9])"
multiple_dots = r'\.{2,}'

def split_into_sentences(text: str) -> list[str]:
    """
    Split the text into sentences.

    If the text contains substrings "<prd>" or "<stop>", they would lead 
    to incorrect splitting because they are used as markers for splitting.

    :param text: text to be split into sentences
    :type text: str

    :return: list of sentences
    :rtype: list[str]
    """
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = [s.strip() for s in sentences]
    if sentences and not sentences[-1]: sentences = sentences[:-1]
    return sentences

Comparison with `nltk`:

>>> from nltk.tokenize import sent_tokenize

Example 1: split_into_sentences is better here (because it explicitly covers a lot of cases):

>>> text = 'Some sentence. Mr. Holmes...This is a new sentence!And This is another one.. Hi '

>>> split_into_sentences(text)
['Some sentence.',
 'Mr. Holmes...',
 'This is a new sentence!',
 'And This is another one..',
 'Hi']

>>> sent_tokenize(text)
['Some sentence.',
 'Mr.',
 'Holmes...This is a new sentence!And This is another one.. Hi']

Example 2: nltk.tokenize.sent_tokenize is better here (because it uses an ML model):

>>> text = 'The U.S. Drug Enforcement Administration (DEA) says hello. And have a nice day.'

>>> split_into_sentences(text)
['The U.S.',
 'Drug Enforcement Administration (DEA) says hello.',
 'And have a nice day.']

>>> sent_tokenize(text)
['The U.S. Drug Enforcement Administration (DEA) says hello.',
 'And have a nice day.']

Primal answered 19/7, 2015 at 20:50 Comment(14)

This is an awesome solution. However I added two more lines to it digits = "([0-9])" in the declaration of regular expressions and text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text) in the function. Now it does not split the line at decimals such as 5.5. Thank you for this answer. – Puduns 17/7, 2016 at 11:12

How did you parse the entire Huckleberry Fin? Where's that in text format? – Rubbish 4/2, 2017 at 10:52

A great solution. In the function, I added if "e.g." in text: text = text.replace("e.g.","e<prd>g<prd>") if "i.e." in text: text = text.replace("i.e.","i<prd>e<prd>") and it fully solved my problem. – Mozellemozes 1/6, 2017 at 8:9

Great solution with very helpful comments! Just to make it a little more robust though: prefixes = "(Mr|St|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Mt)[.]", websites = "[.](com|net|org|io|gov|me|edu)", and if "..." in text: text = text.replace("...","<prd><prd><prd>") – Grimona 26/1, 2018 at 19:2

Can this function be made to see sentences like this as one sentence: When a child asks her mother "Where do babies come from?", what should one reply to her? – Sneakbox 29/4, 2018 at 6:54

This is super useful for running jobs with pig, where jython doesn't natively come with nltk. However it seems to completely discard non-ascii sentences. – Mummy 5/2, 2020 at 1:41

what about decimal numbers? text = re.sub(" (\d+)[.](\d+) "," \\1<prd>\\2 ",text) – Albemarle 8/8, 2020 at 15:4

Does this work well in comparison to the big data methods? – Coda 20/4, 2021 at 1:42

How to include this corner case also: Thank you for contacting back. Request you to please help us with the transaction ID for $<***>.92 ? - Charlie. – Prosthodontics 6/8, 2021 at 5:20

Hmm sentence "The U.S. Drug Enforcement Administration (DEA) says hello. And have a nice day." is splitting after "U.S." for some reason. – Plotinus 6/10, 2021 at 9:12

Awsome, for some improvement, if the final sentence does not have a dot at the end, it is not included. "A sentence. A second sentence. A third sentence witout a final dot" --> ['A sentence.', 'A second sentence.'] – Averell 7/12, 2021 at 21:39

@Averell did you solve the case when the last sentence does not have a dot ? – Broncobuster 13/1, 2022 at 8:29

No, I did not try and used ntlk instead. – Averell 13/1, 2022 at 13:50

@Averell this case (where the last sentence does not end in a dot) is now solved – Basically 2/5, 2023 at 14:14

Instead of using regex for spliting the text into sentences, you can also use nltk library.

>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."

>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']

ref: https://mcmap.net/q/137284/-how-to-break-up-a-paragraph-by-sentences-in-python

Ex answered 30/10, 2017 at 13:34 Comment(5)

Great, simpler and more reusable example than the accepted answer. – Natch 8/8, 2019 at 14:49

If you remove a space after a dot, tokenize.sent_tokenize() doesn't work, but tokenizer.tokenize() works! Hmm... – Lactation 8/8, 2019 at 21:32

for sentence in tokenize.sent_tokenize(text): print(sentence) – Filide 27/2, 2020 at 19:35

can i limit it to like 2 sentences only? – Justle 11/11, 2021 at 6:49

I found that nltk.tokenize.sent_tokenize gives results with faulty splitting sentences when it finds i.e., e.g. etc. and other abbreviations. – Enchanting 18/3, 2022 at 10:12

You can try using Spacy instead of regex. I use it and it does the job.

import spacy
nlp = spacy.load('en')

text = '''Your text here'''
tokens = nlp(text)

for sent in tokens.sents:
    print(sent.string.strip())

Gurevich answered 10/1, 2018 at 12:3 Comment(4)

Space is mega great. but if you just need to separate into sentences passing the text to space will take too long if you are dealing with a data pipe – Carpet 19/6, 2019 at 19:22

@Berlines I agree but couldn't find any other library that does the job as clean as spaCy. But if you have any suggestion, I can try. – Gurevich 16/8, 2019 at 11:19

Also for the AWS Lambda Serverless users out there, spacy's support data files are many 100MB (english large is > 400MB) so you can't use things like this out of the box, very sadly (huge fan of Spacy here) – Philter 16/6, 2020 at 4:12

I found spacy very bad splitting my texts into sentences, giving some phantom sentences containing just a dot. – Enchanting 18/3, 2022 at 10:27

Here is a middle of the road approach that doesn't rely on any external libraries. I use list comprehension to exclude overlaps between abbreviations and terminators as well as to exclude overlaps between variations on terminations, for example: '.' vs. '."'

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
                 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
   end = True
   sentences = []
   while end > -1:
       end = find_sentence_end(paragraph)
       if end > -1:
           sentences.append(paragraph[end:].strip())
           paragraph = paragraph[:end]
   sentences.append(paragraph)
   sentences.reverse()
   return sentences


def find_sentence_end(paragraph):
    [possible_endings, contraction_locations] = [[], []]
    contractions = abbreviations.keys()
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
    for sentence_terminator in sentence_terminators:
        t_indices = list(find_all(paragraph, sentence_terminator))
        possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
    for contraction in contractions:
        c_indices = list(find_all(paragraph, contraction))
        contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
        max_end_start = max([pe[0] for pe in possible_endings])
        possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
    end = (-1 if not len(possible_endings) else max(possible_endings))
    return end


def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1:
            return
        yield start
        start += len(sub)

I used Karl's find_all function from this entry: Find all occurrences of a substring in Python

Equalizer answered 22/1, 2015 at 15:59 Comment(2)

Perfect approach! The others don't catch ... and ?!. – Embosser 30/7, 2016 at 7:28

Nice work. One not— “i.e.” translates as “That is”, not “For example”. – Jerrold 3/5, 2023 at 6:54

I love spaCy to death, but I recently discovered two new approaches for sentence tokenization. One is BlingFire from Microsoft (incredibly fast), and the other is PySBD from AI2 (supremely accurate).

text = ...

from blingfire import text_to_sentences
sents = text_to_sentences(text).split('\n')

from pysbd import Segmenter
segmenter = Segmenter(language='en', clean=False)
sents = segmenter.segment(text)

I separated 20k sentences using five different methods. Here are the elapsed times on an AMD Threadripper Linux machine:

spaCy Sentencizer: 1.16934s
spaCy Parse: 25.97063s
PySBD: 9.03505s
NLTK: 0.30512s
BlingFire: 0.07933s

UPDATE: I tried using BlingFire on all-lowercase text, and it failed miserably. I'm going to use PySBD in my projects for the time being.

Gobioid answered 20/9, 2022 at 18:52 Comment(2)

BlingFire doesn't work on ARM Linux or macOS currently. – Volitant 17/2, 2023 at 10:36

When testing with a subset of the 51 English "golden rules" for sentence segmentation defined here (s3.amazonaws.com/tm-town-nlp-resources/golden_rules.txt, which is from github.com/diasks2/pragmatic_segmenter), BlingFire was the most accurate option and only slightly slower than NLTK with nltk.tokenize.sent_tokenize(text). Said subset was relevant for my purposes and included these 33 rules only: pastebin.com/raw/xqJATfcX – Ratha 11/4, 2023 at 23:7

You can also use sentence tokenization function in NLTK:

from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes.  Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."

sent_tokenize(sentence)

Crowe answered 27/6, 2019 at 9:9 Comment(1)

I tried to use it because nltk is a very, very good library, but it fails on abbreviations where it splits but it should not. – Enchanting 18/3, 2022 at 10:29

For simple cases (where sentences are terminated normally), this should work:

import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

The regex is *\. +, which matches a period surrounded by 0 or more spaces to the left and 1 or more to the right (to prevent something like the period in re.split being counted as a change in sentence).

Obviously, not the most robust solution, but it'll do fine in most cases. The only case this won't cover is abbreviations (perhaps run through the list of sentences and check that each string in sentences starts with a capital letter?)

Druce answered 1/1, 2011 at 22:34 Comment(5)

You can't think of a situation in English where a sentence doesn't end with a period? Imagine that! My response to that would be, "think again." (See what I did there?) – Immensity 1/1, 2011 at 22:37

@Ned wow, can't believe I was that stupid. I must be drunk or something. – Druce 1/1, 2011 at 22:39

I am using Python 2.7.2 on Win 7 x86, and the regex in the above code gives me this error: SyntaxError: EOL while scanning string literal, pointing to the closing parenthesis (after text). Also, the regex you reference in your text does not exist in your code sample. – Dunite 23/7, 2013 at 18:35

The regex is not completely correct, as it should be r' *[\.\?!][\'"\)\]]* +' – Cyanate 9/9, 2015 at 20:39

It may cause many problems and chunk a sentence to smaller chunks as well. Consider the case that we have " I paid $3.5 for this ice cream" them the chunks are " I paid $3" and "5 for this ice cream". use the default nltk sentence.tokenizer is safer! – Committal 23/2, 2018 at 19:19

Using spacy:

import spacy

nlp = spacy.load('en_core_web_sm')
text = "How are you today? I hope you have a great day"
tokens = nlp(text)
for sent in tokens.sents:
    print(sent.string.strip())

Hypertension answered 15/12, 2020 at 16:0 Comment(0)

Might as well throw this in, since this is the first post that showed up for sentence split by n sentences.

This works with a variable split length, which indicates the sentences that get joined together in the end.

import nltk
//nltk.download('punkt')
from more_itertools import windowed

split_length = 3 // 3 sentences for example 

elements = nltk.tokenize.sent_tokenize(text)
segments = windowed(elements, n=split_length, step=split_length)
text_splits = []
for seg in segments:
          txt = " ".join([t for t in seg if t])
          if len(txt) > 0:
                text_splits.append(txt)

Handmaiden answered 14/3, 2021 at 21:36 Comment(0)

If NLTK's sent_tokenize is not a thing (e.g. needs a lot of GPU RAM on long text) and regex doesn't work properly across languages, sentence splitter might be try worth.

Yacov answered 30/6, 2021 at 20:13 Comment(0)

Using Stanza a natural language processing library that works for many human languages.

import stanza

stanza.download('en')
nlp = stanza.Pipeline(lang='en', processors='tokenize')

doc = nlp(t_en)
for sentence in doc.sentences:
    print(sentence.text)

Carnivorous answered 7/12, 2021 at 8:52 Comment(1)

This, is fantastic, though if you are going to use this please use the multilingual model (stanfordnlp.github.io/stanza/langid.html) – Kenleigh 8/2, 2022 at 18:40

Also, be wary of additional top level domains that aren't included in some of the answers above.

For example .info, .biz, .ru, .online will throw some sentence parsers but aren't included above.

Here's some info on frequency of top level domains: https://www.westhost.com/blog/the-most-popular-top-level-domains-in-2017/

That could be addressed by editing the code above to read:

alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|ai|edu|co.uk|ru|info|biz|online)"

Filiate answered 23/10, 2020 at 8:38 Comment(2)

This is helpful information, but it might be more appropriate to add it as a short comment on the original answer. – Thievery 23/10, 2020 at 13:32

That was my original plan, but I don't have the reputation for that yet apparently. Thought this might help someone so I thought I'd post it as best I could. If there's a way to do it and get around the "you need 50 reputation" first, I'd love to :) – Filiate 26/10, 2020 at 0:30

You could make a new tokenizer for Russian (and some other languages) using this function:

def russianTokenizer(text):
    result = text
    result = result.replace('.', ' . ')
    result = result.replace(' .  .  . ', ' ... ')
    result = result.replace(',', ' , ')
    result = result.replace(':', ' : ')
    result = result.replace(';', ' ; ')
    result = result.replace('!', ' ! ')
    result = result.replace('?', ' ? ')
    result = result.replace('\"', ' \" ')
    result = result.replace('\'', ' \' ')
    result = result.replace('(', ' ( ')
    result = result.replace(')', ' ) ') 
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.strip()
    result = result.split(' ')
    return result

and then call it in this way:

text = 'вы выполняете поиск, используя Google SSL;'
tokens = russianTokenizer(text)

Rustic answered 28/1, 2012 at 17:42 Comment(1)

This badly splits text into words, not sentences. – Bilow 5/4, 2022 at 9:28

No doubt that NLTK is the most suitable for the purpose. But getting started with NLTK is quite painful (But once you install it - you just reap the rewards)

So here is simple re based code available at http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html

# split up a paragraph into sentences
# using regular expressions


def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList


if __name__ == '__main__':
    p = """This is a sentence.  This is an excited sentence! And do you think this is a question?"""

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

#output:
#   This is a sentence
#   This is an excited sentence

#   And do you think this is a question

Pfeiffer answered 14/5, 2012 at 1:59 Comment(1)

Yey but this fails so easily, with: "Mr. Smith knows this is a sentence." – Roche 11/2, 2014 at 10:15

i hope this will help you on latin,chinese,arabic text

import re

punctuation = re.compile(r"([^\d+])(\.|!|\?|;|\n|。|！|？|；|…|　|!|؟|؛)+")
lines = []

with open('myData.txt','r',encoding="utf-8") as myFile:
    lines = punctuation.sub(r"\1\2<pad>", myFile.read())
    lines = [line.strip() for line in lines.split("<pad>") if line.strip()]

Tyrant answered 13/5, 2020 at 6:10 Comment(0)

Was working on similar task and came across this query, by following few links and working on few exercises for nltk the below code worked for me like magic.

from nltk.tokenize import sent_tokenize 
  
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
sent_tokenize(text)

output:

['Hello everyone.',
 'Welcome to GeeksforGeeks.',
 'You are studying NLP article']

Source: https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/

Cloddish answered 7/7, 2020 at 9:53 Comment(0)

Using Spacy v3.5:

import spacy

nlp_sentencizer = spacy.blank("en")
nlp_sentencizer.add_pipe("sentencizer")

text = "How are you today? I hope you have a great day"
tokens = nlp_sentencizer(text)
[str(sent) for sent in tokens.sents]

Volitant answered 17/2, 2023 at 10:27 Comment(0)

I had to read subtitles files and split them into sentences. After pre-processing (like removing time information etc in the .srt files), the variable fullFile contained the full text of the subtitle file. The below crude way neatly split them into sentences. Probably I was lucky that the sentences always ended (correctly) with a space. Try this first and if it has any exceptions, add more checks and balances.

# Very approximate way to split the text into sentences - Break after ? . and !
fullFile = re.sub("(\!|\?|\.) ","\\1<BRK>",fullFile)
sentences = fullFile.split("<BRK>");
sentFile = open("./sentences.out", "w+");
for line in sentences:
    sentFile.write (line);
    sentFile.write ("\n");
sentFile.close;

Oh! well. I now realize that since my content was Spanish, I did not have the issues of dealing with "Mr. Smith" etc. Still, if someone wants a quick and dirty parser...

Spermatozoid answered 14/3, 2018 at 21:49 Comment(0)

using spacy

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'This is first.This is second.This is Thired ')
for sentence in doc.sents:
  print(sentence)

But if you want to do get a sentence by index Example:

#don't work
 doc.sents[0]

Use

list( doc.sents)[0]

Astragalus answered 20/2, 2022 at 15:19 Comment(1)

It should be "for sentence in doc.sents:". – Dandrea 14/3, 2022 at 15:43

(?<!\w\.\w.)(?<![A-Z]\.)(?<=\.|\?)\s(?=[A-Z])

We should rather use this Regex to avoid situation that some shortcut will be handled as end of phrase.

Ogletree answered 4/2 at 15:43 Comment(1)

Can you show an example in which this regex works better than an alternative? – Loth 4/2 at 15:49

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Comparison with nltk:

Recommended topics

Hot tags

Comparison with `nltk`: