Javascript (jQuery) remove last sentence of long text
Asked Answered
D

3

6

I'm looking for a javascript function that is smart enough to remove the last sentence of a long chunk of text (one paragraph actually). Some example text to show the complexity:

<p>Blabla, some more text here. Sometimes <span>basic</span> html code is used but that should not make the "selection" of the sentence any harder! I looked up the window and I saw a plane flying over. I asked the first thing that came to mind: "What is it doing up there?" She did not know, "I think we should move past the fence!", she quickly said. He later described it as: "Something insane."</p>

Now I could split on . and remove the last entry of the array but that would not work for sentences ending with ? or ! and some sentences end with quotes like something: "stuff."

function removeLastSentence(text) {
  sWithoutLastSentence = ...; // ??
  return sWithoutLastSentence;
}

How to do this? What's a proper algorithm?

Edit - By long text I mean all the content in my paragraph and by sentence I mean an actual sentence (not a line), so in my example the last sentence is: He later described it as: "Something insane." When that one is removed, the next one is She did not know, "I think we should move past the fence!", she quickly said."

Diantha answered 23/9, 2011 at 15:54 Comment(6)
Define "last sentence" and "long string". If you're looking for a method on limiting the number of lines in a text, see this answer.Lesalesak
Edited my question, by sentence I mean a real sentence, see above. :)Diantha
He later described it as: "Something insane." I'm not an English Major.. but is this correct? or should it be He later described it as, "Something insane".Strathspey
I agree with you, I prefer the latter, but the book I'm processing uses both, so.. Editting the source is cheating and the source is quite big.Diantha
It's hard to split up a paragraph by sentence if the sentences are not all structured properly... I wouldn't have much faith that the solution will be consistent.Strathspey
@Strathspey yes, if text is ignoring language rules, you cannot detect individual sentences easily.Frontier
N
3

Define your rules: // 1. A sentence Starts with a Capital letter // 2. A sentence is preceded by nothing or [.!?], but not [,:;] // 3. A sentence may be preceded by quotes if not formatted properly, such as ["'] // 4. A sentence may be incorrectly in this case if the word following a quote is a Name

Any additional Rules?

Define your Purpose: // 1. Remove the last sentence

Assumptions: If you started from the last character in the string of text and worked backwards, then you'd identify the beginning of the sentence as: 1. The string of text before the character is [.?!] OR 2. The string of text before the character is ["'] and preceded by a Capital letter 3. Every [.] is preceded by a space 4. We aren't correcting for html tags 5. These assumptions are not robust and will need to be adapted regularly

Possible Solution: Read in your string and split it on the space character to give us chunks of strings to review in reverse.

var characterGroups = $('#this-paragraph').html().split(' ').reverse();

If your string is:

Blabla, some more text here. Sometimes basic html code is used but that should not make the "selection" of the sentence any harder! I looked up the window and I saw a plane flying over. I asked the first thing that came to mind: "What is it doing up there?" She did not know, "I think we should move past the fence!", she quickly said. He later described it as: "Something insane."

var originalString = 'Blabla, some more text here. Sometimes <span>basic</span> html code is used but that should not make the "selection" of the sentence any harder! I looked up the window and I saw a plane flying over. I asked the first thing that came to mind: "What is it doing up there?" She did not know, "I think we should move past the fence!", she quickly said. He later described it as: "Something insane."';

Then your array in characterGroups would be:

    ["insane."", ""Something", "as:", "it", "described", "later", "He",
 "said.", "quickly", "she", "fence!",", "the", "past", "move", "should", "we",
 "think", ""I", "know,", "not", "did", "She", "there?"", "up", "doing", "it",
 "is", ""What", "mind:", "to", "came", "that", "thing", "first", "the", "asked",
 "I", "over.", "flying", "plane", "a", "saw", "I", "and", "window", "the", "up",
 "looked", "I", "harder!", "any", "sentence", "the", "of", ""selection"", "the",
 "make", "not", "should", "that", "but", "used", "is", "code", "html", "basic",
 "Sometimes", "here.", "text", "more", "some", "Blabla,"]

Note: the '' tags and others would be removed using the .text() method in jQuery

Each block is followed by a space, so when we have identified our sentence start position (by array index) we'll know what index the space had and we can split the original string in the location where the space occupies that index from the end of the sentence.

Give ourselves a variable to mark if we've found it or not and a variable to hold the index position of the array element we identify as holding the start of the last sentence:

var found = false;
var index = null;

Loop through the array and look for any element ending in [.!?] OR ending in " where the previous element started with a capital letter.

var position     = 1,//skip the first one since we know that's the end anyway
    elements     = characterGroups.length,
    element      = null,
    prevHadUpper = false,
    last         = null;

while(!found && position < elements) {
    element = characterGroups[position].split('');

    if(element.length > 0) {
       last = element[element.length-1];

       // test last character rule
       if(
          last=='.'                      // ends in '.'
          || last=='!'                   // ends in '!'
          || last=='?'                   // ends in '?'
          || (last=='"' && prevHadUpper) // ends in '"' and previous started [A-Z]
       ) {
          found = true;
          index = position-1;
          lookFor = last+' '+characterGroups[position-1];
       } else {
          if(element[0] == element[0].toUpperCase()) {
             prevHadUpper = true;
          } else {
             prevHadUpper = false;
          }
       }
    } else {
       prevHadUpper = false;
    }
    position++;
}

If you run the above script it will correctly identify 'He' as the start of the last sentence.

console.log(characterGroups[index]); // He at index=6

Now you can run through the string you had before:

var trimPosition = originalString.lastIndexOf(lookFor)+1;
var updatedString = originalString.substr(0,trimPosition);
console.log(updatedString);

// Blabla, some more text here. Sometimes <span>basic</span> html code is used but that should not make the "selection" of the sentence any harder! I looked up the window and I saw a plane flying over. I asked the first thing that came to mind: "What is it doing up there?" She did not know, "I think we should move past the fence!", she quickly said.

Run it again and get: Blabla, some more text here. Sometimes basic html code is used but that should not make the "selection" of the sentence any harder! I looked up the window and I saw a plane flying over. I asked the first thing that came to mind: "What is it doing up there?"

Run it again and get: Blabla, some more text here. Sometimes basic html code is used but that should not make the "selection" of the sentence any harder! I looked up the window and I saw a plane flying over.

Run it again and get: Blabla, some more text here. Sometimes basic html code is used but that should not make the "selection" of the sentence any harder!

Run it again and get: Blabla, some more text here.

Run it again and get: Blabla, some more text here.

So, I think this matches what you're looking for?

As a function:

function trimSentence(string){
    var found = false;
    var index = null;

    var characterGroups = string.split(' ').reverse();

    var position     = 1,//skip the first one since we know that's the end anyway
        elements     = characterGroups.length,
        element      = null,
        prevHadUpper = false,
        last         = null,
        lookFor      = '';

    while(!found && position < elements) {
        element = characterGroups[position].split('');

        if(element.length > 0) {
           last = element[element.length-1];

           // test last character rule
           if(
              last=='.' ||                // ends in '.'
              last=='!' ||                // ends in '!'
              last=='?' ||                // ends in '?'
              (last=='"' && prevHadUpper) // ends in '"' and previous started [A-Z]
           ) {
              found = true;
              index = position-1;
              lookFor = last+' '+characterGroups[position-1];
           } else {
              if(element[0] == element[0].toUpperCase()) {
                 prevHadUpper = true;
              } else {
                 prevHadUpper = false;
              }
           }
        } else {
           prevHadUpper = false;
        }
        position++;
    }


    var trimPosition = string.lastIndexOf(lookFor)+1;
    return string.substr(0,trimPosition);
}

It's trivial to make a plugin for it if, but beware the ASSUMPTIONS! :)

Does this help?

Thanks, AE

Nessie answered 1/10, 2011 at 8:31 Comment(0)
P
2

This ought to do it.

/*
Assumptions:
- Sentence separators are a combination of terminators (.!?) + doublequote (optional) + spaces + capital letter. 
- I haven't preserved tags if it gets down to removing the last sentence. 
*/
function removeLastSentence(text) {

    lastSeparator = Math.max(
        text.lastIndexOf("."), 
        text.lastIndexOf("!"), 
        text.lastIndexOf("?")
    );

    revtext = text.split('').reverse().join('');
    sep = revtext.search(/[A-Z]\s+(\")?[\.\!\?]/); 
    lastTag = text.length-revtext.search(/\/\</) - 2;

    lastPtr = (lastTag > lastSeparator) ? lastTag : text.length;

    if (sep > -1) {
        text1 = revtext.substring(sep+1, revtext.length).trim().split('').reverse().join('');
        text2 = text.substring(lastPtr, text.length).replace(/['"]/g,'').trim();

        sWithoutLastSentence = text1 + text2;
    } else {
        sWithoutLastSentence = '';
    }
    return sWithoutLastSentence;
}

/*
TESTS: 

var text = '<p>Blabla, some more text here. Sometimes <span>basic</span> html code is used but that should not make the "selection" of the text any harder! I looked up the window and I saw a plane flying over. I asked the first thing that came to mind: "What is it doing up there?" She did not know, "I think we should move past the fence!", she quickly said. He later described it as: "Something insane. "</p>';

alert(text + '\n\n' + removeLastSentence(text));
alert(text + '\n\n' + removeLastSentence(removeLastSentence(text)));
alert(text + '\n\n' + removeLastSentence(removeLastSentence(removeLastSentence(text))));
alert(text + '\n\n' + removeLastSentence(removeLastSentence(removeLastSentence(removeLastSentence(text)))));
alert(text + '\n\n' + removeLastSentence(removeLastSentence(removeLastSentence(removeLastSentence(removeLastSentence(text))))));
alert(text + '\n\n' + removeLastSentence(removeLastSentence(removeLastSentence(removeLastSentence(removeLastSentence(removeLastSentence(text)))))));
alert(text + '\n\n' + removeLastSentence('<p>Blabla, some more text here. Sometimes <span>basic</span> html code is used but that should not make the "selection" of the text any harder! I looked up the '));
*/
Pyorrhea answered 3/10, 2011 at 6:48 Comment(0)
B
0

This is a good one. Why don't you create a temp variable, convert all '!' and '?' into '.', split that temp variable, remove the last sentence, merge that temp array into a string and take it's length? Then substring the original paragraph up until that length

Bathometer answered 23/9, 2011 at 16:2 Comment(4)
Or hey, just use Regex and it's a ton easier =PBathometer
Actually by replacing ." at the end of a sentence I might get away with just /[\.!?]/, the regexp that @omnosis mentioned.Diantha
You'll still run into a problem with sentences that contain quotes with end punctuation, as in your sample.Quarantine
You could replace all quotes followed by a space and a capital letter with periods for the temp string, and then axe any empty array members.Bathometer

© 2022 - 2024 — McMap. All rights reserved.