Does anyone have a PHP snippet of code for grabbing the first "sentence" in a string?
Asked Answered
C

7

12

If I have a description like:

"We prefer questions that can be answered, not just discussed. Provide details. Write clearly and simply."

And all I want is:

"We prefer questions that can be answered, not just discussed."

I figure I would search for a regular expression, like "[.!\?]", determine the strpos and then do a substr from the main string, but I imagine it's a common thing to do, so hoping someone has a snippet lying around.

Communize answered 16/7, 2009 at 5:5 Comment(1)
This is a genuinely hard problem. I recommend looking into an NLP package if you require robust results. A tokenizer can identify sentence ending characters (either "?", ".", ";" etc depending on your intended use), and you can split on that.Downy
D
23

A slightly more costly expression, however will be more adaptable if you wish to select multiple types of punctuation as sentence terminators.

$sentence = preg_replace('/([^?!.]*.).*/', '\\1', $string);

Find termination characters followed by a space

$sentence = preg_replace('/(.*?[?!.](?=\s|$)).*/', '\\1', $string);
Deliquesce answered 16/7, 2009 at 5:9 Comment(13)
Thanks for this. I suppose I can accept the cost, as it will be cached.Communize
Actually, just realized, this was missing one piece. Because it grabs everything up to the end, it drops the actual punctuation char. A "." at the end of the search expression within the parens seems to resolve. preg_replace('/([^?!.]*.).*/', '\\1', $str);Communize
You must have grabbed the code before I modified :) If you look again that's what I posted.Deliquesce
yes, i saw that right after I posted my comment. Someone below makes the point that it should be period (or other sentence terminator) followed by at least one blank space (to allow for domain names for example). I took a stab but wasn't able to figure the right expression for that and adding "\s" didn't work.Communize
This regex will fail if the string contains a real number such as 3.14, it will then snip it at the first decimal point.Helminthology
Test string for previous comment: We prefer prices below US$ 7.50. Any higher, we won't buy.Helminthology
That wasn't in the requirements given, but can be easily changed by checking for a whitespace character \sDeliquesce
FWIW, just adding \s didn't work for me(see above). Thanks guys, this is a helpful snippet.Communize
Yeah, I realized afterwards that a simple \s wouldn't suffice, so I included an example using a positive lookahead to find whitespace.Deliquesce
Nice work Ian. Didn't see your improved regex so I provided an alternative below. Yours looks more elegant though. Kudos.Helminthology
Okay, so not to beat a dead horse here, but I ended up trying to use this code recently on results returned from YouTube's API, and strangely when using Playlist Feeds, it did not work as expected. I then used dyve's solution, and it did.. Wonder if Unicode strings are a factor.Communize
This regex fails if the period is followed by a new line instead of a space. You might want to run it through preg_replace( '/\s+/', ' ', $text); first.Redingote
You have to use the s modifier. Eg. '/^(.*?[?!.])(\s|$).*/s'Portuna
P
8
<?php
$text = "We prefer questions that can be answered, not just discussed. Provide details. Write clearly and simply.";
$array = explode('.',$text);
$text = $array[0];
?>
Potato answered 16/7, 2009 at 5:8 Comment(4)
+1 to this response. It should be noted though that this will explode on all .'s (i.e. the period character). So if the sentence contains abbreviations such as 'i.e.' or 'e.g.' you will run into problems. Apart from that it's the easiest option.Puddling
However, not all sentences end with "."s. I need something that would deal with "!" and "?" as well I'm pretty sure, so it would have to use regexp I think.Communize
You can further split elements of $array by '!', '?', etc.Potato
But you can't dynamically select which to split by.Deliquesce
H
5

My previous regex seemed to work in the tester but not in actual PHP. I have edited this answer to provide full, working PHP code, and an improved regex.

$string = 'A simple test!';
var_dump(get_first_sentence($string));

$string = 'A simple test without a character to end the sentence';
var_dump(get_first_sentence($string));

$string = '... But what about me?';
var_dump(get_first_sentence($string));

$string = 'We at StackOverflow.com prefer prices below US$ 7.50. Really, we do.';
var_dump(get_first_sentence($string));

$string = 'This will probably break after this pause .... or won\'t it?';
var_dump(get_first_sentence($string));

function get_first_sentence($string) {
    $array = preg_split('/(^.*\w+.*[\.\?!][\s])/', $string, -1, PREG_SPLIT_DELIM_CAPTURE);
    // You might want to count() but I chose not to, just add   
    return trim($array[0] . $array[1]);
}
Helminthology answered 16/7, 2009 at 14:8 Comment(3)
This doesn't appear to work actually. Did you change it since you first posted?Communize
so this not only worked now, but in the end, it actually handled my real-world problem, whereas Ian's did not... (though at first it did). As I commented there above, perhaps this is due to the fact that the results are Unicode strings... not sure, but food for thought. Thanks for the function - I'll defin. use it again and again.Communize
Just to point out that you have to add /m to the preg_slit pattern to make it working with multiline sentences.Harlotry
C
3

Try this:

$content = "My name is Younas. I live on the pakistan. My email is **[email protected]** and skype name is "**fromyounas**". I loved to work in **IOS development** and website development . ";

$dot = ".";

//find first dot position     

$position = stripos ($content, $dot); 

//if there's a dot in our soruce text do

if($position) { 

    //prepare offset

    $offset = $position + 1; 

    //find second dot using offset

    $position2 = stripos ($content, $dot, $offset); 

    $result = substr($content, 0, $position2);

   //add a dot

   echo $result . '.'; 

}

Output is:

My name is Younas. I live on the pakistan.

Corpulence answered 29/3, 2013 at 20:49 Comment(0)
S
0

Try this:

reset(explode('.', $s, 2));
Shulins answered 16/7, 2009 at 5:9 Comment(0)
B
0
current(explode(".",$input));
Brotherhood answered 16/7, 2009 at 5:11 Comment(0)
E
0

I'd probably use any of the multitudes of substring/string-split functions in PHP (some mentioned here already). But also look for ". " OR ".\n" (and possibly ".\n\r") instead of just ".". Just in case for whatever reason, the sentence contains a period that isn't followed by a space. I think it will harden the likelihood of you getting genuine results.

Example, searching for just "." on:

"I like stackoverflow.com."

Will get you:

"I like stackoverflow."

When really, I'm sure you'd prefer:

"I like stackoverflow.com."

And once you have that basic search, you'll probably come across one or two occasions where it may miss something. Tune as you run with it!

Exobiology answered 16/7, 2009 at 5:19 Comment(2)
Most strings probably won't have newlines inside them.Deliquesce
I do think however that many strings (and some in my project) will have URLs... so it would be good to figure out the solution for that, though the answer accepted above is good for now.Communize

© 2022 - 2024 — McMap. All rights reserved.