Regex to match . (periods marking end of sentences) but not Mr. (as in Mr. Hopkins)
Asked Answered
H

4

10

I'm trying to parse a text file into sentences ending in periods, but names like Mr. Hopkins are throwing false alarms on matching for periods.

What regex identifies "." but not "Mr."

For bonus, I'm also using ! to find end of sentences, so my current Regex is /(!/./ and I'd love an answer that incorporates my !'s too.

Harping answered 31/5, 2010 at 21:31 Comment(2)
What about other abbreviations (e.g., "Ms."), punctuated acronyms ("A.C.M.E."), or ellipses ("...")?Durer
If someone knows how to do Mr. that would be get me leaps ahead.Harping
D
14

Use negative look behind.

(?<!Mr|Mrs|Dr|Ms)\.

This will match a period only if it does not come after Mr, Mrs, Dr or Ms

<?
   $str = "This is Mr. Someone and Mrs. Somebody. They are here to meet Dr. SomeoneElse.";
   $str = preg_replace("/(?<!Mr|Mrs|Dr|Ms)\\./", "\n", $str);
   echo($str);
?>
//outputs:
This is Mr. Someone and Mrs. Somebody
 They are here to meet Dr. SomeoneElse
Diagnostic answered 1/6, 2010 at 4:8 Comment(3)
I knew someone who lived on Lincoln Dr. I lived on Albert Rd.Humphreys
OK, I complain too much because this problem is solvable for Mr. It only fails on Dr. Miss has no period and Ms. and Mrs. work.Humphreys
Is this possible without negative lookbehind? My web app doesn't work because IOS Safari doesn't support lookbehind regex.Knorr
G
6

This can't be done with any simple mechanism. It's hopelessly ambiguous. Sentences can end with abbreviations, and in those cases they aren't written with two periods.

See Unicode TR29. Also see the ICU open source library, which includes a basic implementation.

Gymnosophist answered 31/5, 2010 at 21:36 Comment(0)
S
1

Are your sentences always followed by two spaces? If so you could just check for that...

/\.\s{2}/

and incorporating other end of sentence punctuation: /[\.\!\?]\s{2}/

You could also check other things which could be indicators of the end of a sentence, like if the next word is capitalized, is it followed by a carriage return, etc. But at best you'll just be able to make an educated guess, as pointed out above the period is just too ambiguous.

Short answered 1/6, 2010 at 2:14 Comment(0)
M
0

The regex (?<=[\.\!\?]\s[A-Z]) almost works after being tested, buts it sadly leaves the capital letter in the pervious match. A fix to this would be taking that letter and removing it from the previous match while adding it back to the match itself.

Example:

//the string
string s = "The fox jumps over the dog. The dog jumps over the fox.";

string[] answer = Regex.Split(@"(?<=[\.\!\?]\s[A-Z])");

Console.WriteLine(answer);

The output would be: ["The fox jumps over the dog. T","he dog jumps over the fox."]

To fix this:

            //make sure there is a split
            if (lines.Length > 1)
            {
                for (int i = 0; i < lines.Length; i++)
                {
                    //store letter
                    char misplacedLetter = lines[i].TrimEnd().Last();

                    //remove letter
                    lines[i] = lines[i].Substring(0,lines[i].Length-1);

                    //place on front of next sentence.
                    lines[i + 1] = misplacedLetter + lines[i + 1];
                }
            }

This worked for me well. (you may chose to cache lines[i] instead of accessing it over and over)

Mesomorphic answered 30/8, 2021 at 23:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.