Tools for text simplification (Java) [closed]
Asked Answered
S

4

18

What is the best tool that can do text simplification using Java?

Here is an example of text simplification:

John, who was the CEO of a company, played golf.
                       ↓
John played golf. John was the CEO of a company.
Seaside answered 7/3, 2012 at 4:47 Comment(5)
I think doing that with any real degree of accuracy probably requires IBM's Watson system.Bultman
can you be specific abit, is there any tool that i can use to apply this !Seaside
The short answer is NO, there is no tool that will do what you have shown as an example.Bultman
The specific example you gave involves two major capabilities of the system you are looking for: (a) Syntax parsing, including the detection of relative clauses, (b) Coreference analysis (specifically, to detect that the relative pronoun 'who' refers to 'John'). If that is all you are looking for it is still not trivial, and it will never work 100% correctly, but to some extent it is solvable. Tell us: Is that all you are looking for? Extrapolation of relative clauses? Or do you have many other kinds of simplification in mind, and if so which ones?Psalmody
This is not a definite question. "Text simplification" does not mean "shorter sentences" nor "semantically simpler structure", "simpler Chomsky tree": it can actually mean anything.Shipe
U
33

I see your problem as a task of converting complex or compound sentence into simple sentences. Based on literature Sentence Types, a simple sentence is built from one independent clause. A compound and complex sentence is built from at least two clauses. Also, clause must have subject and verb.
So your task is to split sentence into clauses that form your sentence.

Dependency parsing from Stanford CoreNLP is a perfect tools to split compound and complex sentence into simple sentence. You can try the demo online.
From your sample sentence, we will get parse result in Stanford typed dependency (SD) notation as shown below:

nsubj(CEO-6, John-1)
nsubj(played-11, John-1)
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
det(company-9, a-8)
prep_of(CEO-6, company-9)
root(ROOT-0, played-11)
dobj(played-11, golf-12)

A clause can be identified from relation (in SD) which category is subject, e.g. nsubj, nsubjpass. See Stanford Dependency Manual
Basic clause can be extracted from head as verb part and dependent as subject part. From SD above, there are two basic clause i.e.

  • John CEO
  • John played

After you get basic clause, you can add another part to make your clause a complete and meaningful sentence. To do so, please consult Stanford Dependency Manual.

By the way, your question might be related with Finding meaningful sub-sentences from a sentence


Answer to 3rd comment:

Once you got the pair of subject an verb, i.e. nsubj(CEO-6, John-1), get all dependencies that have link to that dependency, except any dependency which category is subject, then extract unique word from these dependencies.

Based on example, nsubj(CEO-6, John-1), if you start traversing from John-1, you'll get nsubj(played-11, John-1) but you should ignore it since its category is subject.

Next step is traversing from CEO-6 part. You'll get

cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
prep_of(CEO-6, company-9)

From result above, you got new dependencies to traverse (i.e. find another dependencies that have was-4, the-5, company-9 in either head or dependent).
Now your dependencies are

cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
prep_of(CEO-6, company-9)
det(company-9, a-8)

In this step, you've finished traversing all dependecies linked to nsubj(CEO-6, John-1). Next, extract words from all head and dependent, then arrange the word in ascending order based on number appended to these words. This number indicating word order in original sentence.

John was the CEO a company

Our new sentence is missing one part, i.e of. This part is hidden in prep_of(CEO-6, company-9). If you read Stanford Dependency Manual, there are two kinds of SD, collapsed and non-collapsed. Please read them to understand why this of is hidden and how to get the word order of this hidden part.

With same approach, you'll get second sentence

John played golf

Uncovenanted answered 7/3, 2012 at 17:51 Comment(7)
is there any paper which is applied this things and get the same result that shown in the question...Seaside
Unfortunately no, AFAIK. My answer above is based on my experience on extracting use case diagram from textual user requirement where I use Use Case Paths Model Revealing Through Natural Language Requirements Analysis as my main reference.Uncovenanted
from the output that we got up there, how i can construct the two sentences that i want ... is this way guarantee that i got the out put that i want !Seaside
I've edited my answer showing how to construct new simpler sentences. I can not guarantee you'll get what you want but you can implement that algorithm, then test on number of sentences. Let say 100 sentences. Then you compare result from algorithm with your manual result. If your the result is, let say 90% correct, then go for it. You decide the threshold. I'm not good at statistics.Uncovenanted
I have to do the same thing right now.@Shams Did you already implement such an algorithm. Otherwise i have to do it :DMistrustful
Nice answer. The link to the old demo is deprecated; here is the current demo with the test sentence. corenlp.run/….Middling
Thanks @VictoriaStuart, demo link is updated.Uncovenanted
C
6

I think one can design a very simple algorithm for the basic cases of this situation, while real world cases may be too many, that such an approach will become unruly :)

Still I thought I should think aloud and write my approach and maybe add some python code. My basic idea is that derive a solution from first principles, mostly by explicitly exposing our model of what is really happening. And not to rely on other theories, models, libraries BEFORE we do one by HAND and from SCRATCH.


Goal: given a sentence, extract subsentences from it.

Example: John, who was the ceo of the company, played Golf.

Expected output: John was the CEO of the company. John played Golf.


Here is my model of what is happening here written out in the form of model assumptions: (axioms?)

MA1. Simple sentences can be expanded by inserting subsentences. MA2. A subsentence is a qualification/modification(additional information) on one or more of the entities. MA3. To insert a subsentence, we put a comma right next to the entity we want to expand on (provide more information on) and attach the subsentence, I am going to call it an extension - and place another comma when the extension ends.

Given this model, the algorithm can be straightforward at least to address the simple cases first.

  1. DETECT: Given a sentence, detect if it has an extension clause, by looking for a pair of commas in the sentence.
  2. EXTRACT: If you find two commas, generate two sentences: 2.1 EXTRACT-BASE: base sentence: delete everything out between the two commas, You get the base sentence. 2.2 EXTRACT-EXTENSION: extension sentence: take everything inside the extension sentence, replace 'who' with the word right before it. That is your second sentence.
  3. PRINT: In fact you should print the extension sentence first, because the base sentence depends on it.

Well, that is our algorithm. Yes it sounds like a hack. It is. But something I am learning now, is that, if you use a trick in one program it is a hack, if it can handle more stuff, it is a technique.

So let us expand and complicate the situation a bit.

Compounding cases: Example 2. John, who was the CEO of the company, played Golf with Ram, the CFO.

As I am writing it, I noticed that I had omitted the 'who was' phrase for the CFO! That brings us to the complicating case that our algorithm will fail. Before going there, let me create a simpler version of 2 that WILL work.

Example 3. John, who was the CEO of the company, played Golf with Ram, who was the CFO.

Example 4. John, the CEO of the company, played Golf with Ram, the CFO.

Wait we are not done yet!

Example 5. John, who is the CEO and Ram, who was the CFO at that time, played Golf, which is an engaging game.

To allow for this I need to extend my model assumptions:

MA4. More than one entities may be expanded likewise, but should not cause confusion because the extension clause occurs right next to the entity being informed about. (accounts for example 3)

MA5. The 'who was' phrase may be omitted since it can be inferred by the listener. (accounts for example 4)

MA6. Some entities are persons, they will be extended using a 'who' and some entities are things, extended using a 'which'. Either of these extension heads may be omitted.

Now how do we handle these complications in our algorithm?

Try this:

  1. SPLIT-SENTENCE-INTO-BASE-AND-EXTENSIONS: If sentence contains a comma, look for the following comma, and extract whatever is in between into extension sentence. Continue until you find no more closing comma or opening comma left. At this point you should have list with base sentence and one or more extension sentences.

  2. PROCESS_EXTENSIONS: For each extension, if it has 'who is' or 'which is', replace it by name before the extension headword. If extension does not have a 'who is' or 'which is', place the leading word and and an is.

  3. PRINT: all extension sentences first and then the base sentences.

Not scary.

When I get some time in the next few days, I will add a python implementation.

Thank you

Ravi Annaswamy

Culver answered 11/1, 2013 at 22:34 Comment(1)
thanks, great illustration, could you plz provide some tutorial for this task, or some site that I can learn from ...Seaside
I
4

You are unlikely to solve this problem using any known algorithm in the general case - this is getting into strong AI territory. Even humans can't parse grammar very well!

Note that the problem is quite ambiguous regarding how far you simplify and what assumptions you are willing to make. You could take your example further and say:

John is assumed to be the name of a being. The race of John is unknown. John played golf at some point in the past. Golf is assumed to refer to the ball game called golf, but the variant of golf that John played is unknown. At some point in the past John was the CEO of a company. CEO is assumed to mean "Chief Executive Officer" in the context of a company but this is not specified. The company is unknown.

In case the lesson is not obvious: the more you try to determine the exact meaning of words, the more cans of worms you start to open up...... it takes human-like levels of judgement and interpretation to know when to stop.

You may be able to solve some simpler cases using various Java-based NLP tools: see Is there a good natural language processing library

Indoeuropean answered 7/3, 2012 at 6:39 Comment(0)
C
1

I believe AlchemyApi is your best option. Still it will require a lot of work on your side to do exactly what you need, and how the most commentators have alredy told you, most probably you'll not get 100% quality results.

Cushitic answered 7/3, 2012 at 13:40 Comment(1)
The link is broken. Please update it.Yung

© 2022 - 2024 — McMap. All rights reserved.