Sentence annotation in text without punctuation
Asked Answered
U

3

10

I'm having difficulty getting the CoreNLP system to correctly find where one sentence ends and another begins in a corpus of poetry.

The reasons why it's struggling:

  • some poems have no punctuation throughout their entire length (and sometimes no case)
  • some poems have sentences that run from one paragraph into another
  • some poems have capitalization at the beginning of every line

This is a particularly tricky one (The system thought the first sentence ended at the "." at the beginning of the second stanza)

Given the lack of capitals and punctuation to go on, I thought that I would try using -tokenizeNLs to see if that improved it, but it went overboard, and cut off any sentence that ran between blank lines (which there are a few of)

These sentences often end at the end of a line, but not always, so what would be slick is if the system could look at a line ending as a potential candidate for a sentence break, and maybe weigh the likelihood of those being the endpoints, but I don't know how I would implement that.

Is there an elegant way to do this? Or an alternative?

Thanks in advance!

(expected sentence output here)

Uncinus answered 6/1, 2015 at 21:18 Comment(3)
Could you provide the output you are expecting in your question? (The array of the sentences from that poem.)Scamander
In the end we need strings of parse trees (The project is a syntactic analysis, looking at the usual patterns like embedded clauses, and where line breaks occur). But if you only need the list of sentences then you can now find that at the bottom of my question.Uncinus
This might be useful / interesting: ptk.jonathanchin.comSpectral
B
9

I built a sentence segmenter that works excellently on unpunctuated or partially punctuated text too. You can find it at https://github.com/bedapudi6788/deepsegment .

This models is based on the idea that Named Entity Recognition can be used for sentence boundary (i.e: beginning of a sentence or ending of a sentence). I utilised data from tatoeba for generating the training data and trained a BiLSTM+CRF model with glove embeddings and character level for this task.

Although this is built in Python, you will be able to setup a simple rest api using flask and use it along with your Java code.

Bournemouth answered 31/1, 2019 at 10:40 Comment(0)
F
2

This would be a neat project! I don't think anyone is working on it in our group at the moment, but I see no reason why we wouldn't incorporate a patch if you make one. The biggest challenge I see is that our sentence splitter is currently entirely rule-based, and therefore these sorts of "soft" decisions are relatively hard to incorporate.

A possible solution for your case could be to use language model "end of sentence" probabilities (Three options, in no particular order: https://kheafield.com/code/kenlm/, https://code.google.com/p/berkeleylm/, http://www.speech.sri.com/projects/srilm/). Then, line ends with a sufficiently high end of sentence probability could get split as new sentences.

Flatfish answered 7/1, 2015 at 1:5 Comment(1)
It seems that the sentence splitting rules are only around punctuation and not consider pos or their interrelationship at all.Merrie
C
2

I recommend NNSPLIT for NLP tasks including sentence boundary detection since it's simple, fast, and easy to use. You can also see the metrics for the following cases in this link.

  • Clean
  • Partial punctuation
  • Partial case
  • Partial punctuation and case
  • No punctuation and case

pip install nnsplit

from nnsplit import NNSplit
splitter = NNSplit.load("en")

# returns `Split` objects
splits = splitter.split(["This is a test This is another test"])[0]

# a `Split` can be iterated over to yield smaller splits or stringified with `str(...)`.
for sentence in splits:
   print(sentence)
Celtic answered 31/7, 2021 at 20:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.