Perform sentence segmentation on paragraphs without punctuation?
Asked Answered
W

1

9

I have a bunch of badly formatted text with lots of missing punctuation. I want to know if there was any method to segment text into sentences when periods, semi-colons, capitalization, etc. are missing.

For example, consider the paragraph: "the lion is called the king of the forest it has a majestic appearance it eats flesh it can run very fast the roar of the lion is very famous".

This text should be segmented as separate sentences:

  • the lion is called the king of the forest
  • it has a majestic appearance
  • it eats flesh
  • it can run very fast
  • the roar of the lion is very famous

Can this be done or is it impossible? Any suggestion is much appreciated!

Wolter answered 2/6, 2017 at 12:12 Comment(5)
You can train a sequence classifier. It's very easy to get tons of training material: use any corpus containing punctuation, perform sentence splitting, remove punctuation – voilà.Trunkfish
@Trunkfish Which is the easiest way to create a sequence classifier in Python? Can you do this in NLTK?Wolter
Yes, NLTK has a classification module. Typically, beginners are introduced to supervised machine learning with a Naive-Bayes classifier, which is conceptually pretty straight-forward.Trunkfish
@Trunkfish What would be the input and the output of this classifier?Wolter
Have a look at this answer I posted recently.Trunkfish
C
2

You can try using the following Python implementation from here.

import torch

model, example_texts, languages, punct, apply_te = torch.hub.load(repo_or_dir='snakers4/silero-models', model='silero_te')

#your text goes here. I imagine it is contained in some list

input_text = input('Enter input text\n') 
apply_te(input_text, lan='en')
Critique answered 20/10, 2022 at 9:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.