Paragraph Segmentation using Machine Learning

Asked 23/1, 2017 at 8:16 Answered 15/1, 2023 at 6:52

python machine-learning nlp apache-tika text-segmentation

I have a large repository of documents in PDF format. The documents come from different sources, and have no one single style. I use Tika to extract the text from the documents, and now I'd like to segment the text into paragraphs.

I can't use regexes, because the documents have no single style:

The number of \nl between paragraphs vary between 2 and 4.
In some documents the lines within a single paragraph are separated by 2 \nl, some with single \nl.

So I turn to machine learning. In the (great) Python NLTK book there's an excellent use of classification for segmentation of sentences using attributes like characters before and after a '.' with a Bayesian network, but no paragraph segmentation.

So my questions are:

Is there another way for paragraph segmentation?
If I go with machine learning, is there tagged data of segmented paragraphs I can use for training?

Discomposure answered 23/1, 2017 at 8:16 Comment(6)

Ask Apache Tika for the HTML version of the document, rather than the plain text one, then split on <p>...</p> ? – Trenttrento 24/1, 2017 at 16:53

Already tried that. It just replaces \nl with <p> so the problem stays the same. – Discomposure 24/1, 2017 at 22:13

We are also facing the exact same problem. do stay in touch at [email protected] – Ballata 30/9, 2017 at 4:50

@virusrocks, I finally used regexes, and I get about 90% success. How did you solve it? – Discomposure 20/11, 2017 at 7:45

@Gino: I haven't solved the problem yet. We got higher priority issues so its on hold for the time being. Will keep you posted. – Ballata 22/11, 2017 at 9:11

Any progress with this? – Stefansson 25/7, 2021 at 17:24

The task has several names: document segmentation, paragraph detection {3}, paragraph identification {3}, paragraph segmentation, section segmentation, text segmentation, topic segmentation.

One of the most famous unsupervised algorithms for text segmentation is TextTiling {2}. It's implemented in NLTK in the nltk.tokenize.texttiling module.

Regarding supervised algorithms: https://github.com/hyunbool/Text-Segmentation has a list of papers published in 2020 and before.

Google published a paper at EMNLP 2020 on text segmentation {1}. Architecture:

No official code release. More recent papers:

3 main issues:

Papers often focus on WikiSections, which are too long for paragraphs.
Papers for that task often don't release their code.
Supervised algorithms tend to be specialized to the domain of the training set (e.g., being effective for WikiSections is no guarantee of being effective in open domain).

Other potentially useful code bases:

References:

{1} Lukasik, Michal, Boris Dadachev, Kishore Papineni, and Gonçalo Simões. "Text Segmentation by Cross Segment Attention." In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4707-4716. 2020.
{2} Hearst, Marti A. "Text tiling: Segmenting text into multi-paragraph subtopic passages." Computational linguistics 23, no. 1 (1997): 33-64.
{3} Sporleder, Caroline, and Mirella Lapata. "Automatic paragraph identification: A study across languages and domains." In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 72-79. 2004.

Windhover answered 15/1, 2023 at 6:52 Comment(0)

There is surprisingly little research on this topic of automatic detection of paragraph boundaries. I have found the following, all of which are quite old:

Sporleder and Lapata (2004): Automatic Paragraph Identification: A Study across Languages and Domains

Sporleder and Lapata (2005): Broad coverage paragraph segmentation across languages and domains

Filippova and Strube (2006): Using Linguistically Motivated Features for Paragraph Boundary Identification

Genzel (2005) A Paragraph Boundary Detection System

Henghold answered 7/7, 2021 at 7:29 Comment(0)

Recommended topics

Hot tags