I have a large repository of documents in PDF format. The documents come from different sources, and have no one single style. I use Tika to extract the text from the documents, and now I'd like to segment the text into paragraphs.
I can't use regexes, because the documents have no single style:
- The number of
\nl
between paragraphs vary between 2 and 4. - In some documents the lines within a single paragraph are separated by 2
\nl
, some with single\nl
.
So I turn to machine learning. In the (great) Python NLTK book there's an excellent use of classification for segmentation of sentences using attributes like characters before and after a '.' with a Bayesian network, but no paragraph segmentation.
So my questions are:
- Is there another way for paragraph segmentation?
- If I go with machine learning, is there tagged data of segmented paragraphs I can use for training?
<p>...</p>
? – Trenttrento\nl
with<p>
so the problem stays the same. – Discomposure