Create array of words from a string of text

Asked 26/4, 2009 at 10:21 Answered 10/10, 2012 at 0:23

I would like to split a text into single words using PHP. Do you have any idea how to achieve this?

My approach:

function tokenizer($text) {
    $text = trim(strtolower($text));
    $punctuation = '/[^a-z0-9äöüß-]/';
    $result = preg_split($punctuation, $text, -1, PREG_SPLIT_NO_EMPTY);
    for ($i = 0; $i < count($result); $i++) {
        $result[$i] = trim($result[$i]);
    }
    return $result; // contains the single words
}
$text = 'This is an example text, it contains commas and full-stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
print_r(tokenizer($text));

Is this a good approach? Do you have any idea for improvement?

Thanks in advance!

Fanny answered 26/4, 2009 at 10:21 Comment(0)

Use the class \p{P} which matches any unicode punctuation character, combined with the \s whitespace class.

$result = preg_split('/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY);

This will split on a group of one or more whitespace characters, but also suck in any surrounding punctuation characters. It also matches punctuation characters at the beginning or end of the string. This discriminates cases such as "don't" and "he said 'ouch!'"

Budde answered 26/4, 2009 at 10:24 Comment(10)

+1, not sure, tho, how this will deal with äöüß. Does regex normally classify äöüß as word characters? – Carboy 26/4, 2009 at 10:28

Thank you. This would't probably work for English texts but I also want to extract German umlauts (ä, ö, ü), the "ß" and numbers in a string. The "\W" wouldn't extract "Fri3nd", would it? – Fanny 26/4, 2009 at 10:31

Seems it does not, but updated answer with something similar that works. – Budde 26/4, 2009 at 10:34

Updated answer works with perl (which php regex are based on): $ echo "äöüß, test" | perl -e 'while (<>) { if (/([\p{P}\s]+)/) { print "$1\n"; } }' , – Budde 26/4, 2009 at 10:37

Should one split don't into don and t? – Holleran 26/4, 2009 at 10:59

Updated it to handle such a case :) – Budde 26/4, 2009 at 11:20

Thanks, marcog, it works perfectly! But is it really better than my updated code above? Actually, what is the difference between our approaches? Is one faster than the other one? – Fanny 26/4, 2009 at 11:39

In your approach you're specifying the non-punctuation characters. You will be therefore be missing some cases, e.g. á. Why try manually specify them when the whole set of unicode punctuation characters has already been defined? And like eed3si9n pointed out with my original answer, yours will break up words such as don't. – Budde 26/4, 2009 at 12:24

@marcog Any idea what would be the Javascript equivalent of this? I tried doing str.split(/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/); but it doesn't work. Please help.. – Expellee 20/4, 2018 at 11:9

What to do if i want to remove all numbers? – Cowpuncher 31/8, 2018 at 6:52

Tokenize - strtok.

<?php
$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$delim = ' \n\t,.!?:;';

$tok = strtok($text, $delim);

while ($tok !== false) {
    echo "Word=$tok<br />";
    $tok = strtok($delim);
}
?>

Holleran answered 26/4, 2009 at 10:23 Comment(4)

This won't work if you get a : or ; or any other punctuation character you haven't accounted for. – Budde 26/4, 2009 at 10:41

@marcog, I added : and ;. Doesn't {P} catch apostrophe and hyphen? – Holleran 26/4, 2009 at 10:57

What about cases such quoting? My updated answer discriminates between these cases. – Budde 26/4, 2009 at 11:23

Excellent idea. Added +1. The only thing is that there should be double quotes around $delim = " \n\t,.!?:;"; With the single quotes it does not work correctly, it splits by the letter n too. – Ledet 2/12, 2017 at 18:33

I would first make the string to lower-case before splitting it up. That would make the i modifier and the array processing afterwards unnecessary. Additionally I would use the \W shorthand for non-word characters and add a + multiplier.

$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$result = preg_split('/\W+/', strtolower($text), -1, PREG_SPLIT_NO_EMPTY);

Edit Use the Unicode character properties instead of \W as marcog suggested. Something like [\p{P}\p{Z}] (punctuation and separator characters) would cover the characters more specific than \W.

Ernaldus answered 26/4, 2009 at 10:35 Comment(2)

Thanks, the idea to perform strtolower() before is very good. I'll use this. – Fanny 26/4, 2009 at 10:40

What purpose does strtolower() serve if you are splitting with \W? Do you want to add a u pattern modifier? A note to researchers... \W will not match an underscore. – Modify 13/9, 2020 at 2:37

Do:

str_word_count($text, 1);

Or if you need unicode support:

function str_word_count_Helper($string, $format = 0, $search = null)
{
    $result = array();
    $matches = array();

    if (preg_match_all('~[\p{L}\p{Mn}\p{Pd}\'\x{2019}' . preg_quote($search, '~') . ']+~u', $string, $matches) > 0)
    {
        $result = $matches[0];
    }

    if ($format == 0)
    {
        return count($result);
    }

    return $result;
}

Oleviaolfaction answered 26/4, 2009 at 10:24 Comment(4)

Thanks but this wouldn't work. "Fri3nd" wouldn't be extracted but it should. – Fanny 26/4, 2009 at 10:29

I don't understand why "Fri3nd" should be extracted. Removed from the array, broken down into "Fri3" and "nd" (or similar)? O.o – Chokefull 26/4, 2009 at 11:7

If you want to consider numbers as words just do str_word_count_Helper($string, 1, '0123456789'); – Oleviaolfaction 26/4, 2009 at 11:56

Native PHP functions that allow double-dot range syntax demonstrates that str_word_count($string, 1, '0..9') will do. – Modify 15/9, 2023 at 22:52

you can also use PHP strtok() function to fetch string tokens from your large string. you can use it like this:

 $result = array();
 // your original string
 $text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
 // you pass strtok() your string, and a delimiter to specify how tokens are separated. words are seperated by a space.
 $word = strtok($text,' ');
 while ( $word !== false ) {
     $result[] = $word;
     $word = strtok(' ');
 }

see more on php documentation for strtok()

Polled answered 26/4, 2009 at 10:29 Comment(2)

what is the difference between this and explode(' ', $text); – Vacation 24/6, 2015 at 11:47

The code sample in the question is a tokenizer, my answer was implying that PHP has a string tokenizer built-in. Also explode() will return all of the words of the text at once, but using strtok() the caller has the choice to stop searching for words in the text, as soon as a desired condition is met. Other than this, I can't think of any other difference. – Polled 25/6, 2015 at 3:46

You can also use the method explode : http://php.net/manual/en/function.explode.php

$words = explode(" ", $sentence);

Laubin answered 10/10, 2012 at 0:23 Comment(1)

not works with 2 or more consecutive spaces. you have to use a foreach with explode(" ", $sentence) within if($word == "") continue; so you could avoid empty words. – Bedpost 6/4, 2017 at 13:55

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags