I'm having difficulty getting the CoreNLP system to correctly find where one sentence ends and another begins in a corpus of poetry.
The reasons why it's struggling:
- some poems have no punctuation throughout their entire length (and sometimes no case)
- some poems have sentences that run from one paragraph into another
- some poems have capitalization at the beginning of every line
This is a particularly tricky one (The system thought the first sentence ended at the "." at the beginning of the second stanza)
Given the lack of capitals and punctuation to go on, I thought that I would try using -tokenizeNLs to see if that improved it, but it went overboard, and cut off any sentence that ran between blank lines (which there are a few of)
These sentences often end at the end of a line, but not always, so what would be slick is if the system could look at a line ending as a potential candidate for a sentence break, and maybe weigh the likelihood of those being the endpoints, but I don't know how I would implement that.
Is there an elegant way to do this? Or an alternative?
Thanks in advance!
(expected sentence output here)