strpos() with multiple needles? [duplicate]
Asked Answered
G

6

25

I am looking for a function like strpos() with two significant differences:

  1. To be able to accept multiple needles. I mean thousands of needles at ones.
  2. To search for all occurrences of the needles in the haystack and to return an array of starting positions.

Of course it has to be an efficient solution not just a loop through every needle. I have searched through this forum and there were similar questions to this one, like:

but nether of them was what I am looking for. I am using strpos just to illustrate my question better, probably something entirely different has to be used for this purpose.

I am aware of Zend_Search_Lucene and I am interested if it can be used to achieve this and how (just the general idea)?

Thanks a lot for Your help and time!

Graphemics answered 1/8, 2011 at 9:38 Comment(5)
What kind of the data are you processing? What are possible needle values? Are you e.g. looking for whole words or subsequences?Korney
what's the overall objective?Pleader
This is for my PhD thesis. I have to find all Named Entities in a text. For example, let us suppose that I have a dictionary of all countries and another one of many cities. I would like to search given text against those dictionaries.Graphemics
Do you really want to do this with PHP? Python has much more powerful string processing features. Maybe you can use the Natural Language Toolkit: nltk.orgKorney
I thinks php is powerful enough to do this. I know that languages like Perl and Python are capable of doing this but why php has to in the background.Graphemics
E
10

try preg match for multiple

if (preg_match('/word|word2/i', $str))

Checking for multiple strpos values

Encomium answered 14/1, 2016 at 19:32 Comment(0)
C
7

Here's some sample code for my strategy:

function strpos_array($haystack, $needles, $offset=0) {
    $matches = array();

    //Avoid the obvious: when haystack or needles are empty, return no matches
    if(empty($needles) || empty($haystack)) {
        return $matches;
    }

    $haystack = (string)$haystack; //Pre-cast non-string haystacks
    $haylen = strlen($haystack);

    //Allow negative (from end of haystack) offsets
    if($offset < 0) {
        $offset += $heylen;
    }

    //Use strpos if there is no array or only one needle
    if(!is_array($needles)) {
        $needles = array($needles);
    }

    $needles = array_unique($needles); //Not necessary if you are sure all needles are unique

    //Precalculate needle lengths to save time
    foreach($needles as &$origNeedle) {
        $origNeedle = array((string)$origNeedle, strlen($origNeedle));
    }

    //Find matches
    for(; $offset < $haylen; $offset++) {
        foreach($needles as $needle) {
            list($needle, $length) = $needle;
            if($needle == substr($haystack, $offset, $length)) {
                $matches[] = $offset;
                break;
            }
        }
    }

    return($matches);
}

I've implemented a simple brute force method above that will work with any combination of needles and haystacks (not just words). For possibly faster algorithms check out:


Other Solution

function strpos_array($haystack, $needles, $theOffset=0) {
    $matches = array();

    if(empty($haystack) || empty($needles)) {
        return $matches;
    }

    $haylen = strlen($haystack);

    if($theOffset < 0) {  // Support negative offsets
        $theOffest += $haylen;
    }

    foreach($needles as $needle) {
        $needlelen = strlen($needle);
        $offset = $theOffset;

        while(($match = strpos($haystack, $needle, $offset)) !== false) {
            $matches[] = $match;
            $offset = $match + $needlelen;
            if($offset >= $haylen) {
                break;
            }
        }
    }

    return $matches;
}
Comical answered 1/8, 2011 at 9:56 Comment(9)
Good idea! But I am worried about the algorithm's complicity. For 10k needles and 10k chars in the haystack it makes 100M calls of the substr() function. Anyway I will check this out and let you know about the result. Thanks!Graphemics
@Nikola Obreshkov substr() is internal so it's fairly fast. My function does all the optimizations it can (precalculating needle length and pre-casting needles to strings). The other option I contemplated on was much less efficient than this, but I'll keep thinking. If you want to save some time and you know all needles will be unique you can remove the array_unique() near the top.Comical
thanks for your effort in this question. I thinks that ` //Use strpos if there is no array or only one needle if(!is_array($needles) || (count($needles) == 1 && ($needles = $needle[0] || true))) { return strpos($haystack, $needles, $offset); } ` has to be omitted because it will return only the first occurrence.Graphemics
@Nikola Obreshkov Ommitted. I didn't consider that that bit would only return the first occurance. Nice catch! How are the benchmarks going?Comical
It might be worth giving this research paper on multi-pattern matching a look-over: bio.informatics.indiana.edu/sunkim/c/s06/I690/aomtp.pdfComical
@PhoMyCoder I have complied a dictionary of all countries (taken from Wikipedia) which are the needles and using Olympic Games Boycotts article as a haystack. However I am receiving zero matches from your code. Obviously there is something I am doing wrong. All this is done in Bulgarian with is my native language with UTF-8 encoding. I just need some more time.Graphemics
@PhpMyCoder let us continue this discussion in chatGraphemics
I've tested the function on my own data and for some reason it only partially works (it worked for 2 of 3 of my needles). Let me do some testing to see why.Comical
I've discovered the problem. By reusing $needle I had reference issues. I've edited my code to fix this.Comical
G
2

I know this doesn't answer the OP's question but wanted to comment since this page is at the top of Google for strpos with multiple needles. Here's a simple solution to do so (again, this isn't specific to the OP's question - sorry):

    $img_formats = array('.jpg','.png');
    $missing = array();
    foreach ( $img_formats as $format )
        if ( stripos($post['timer_background_image'], $format) === false ) $missing[] = $format;
    if (count($missing) == 2)
        return array("save_data"=>$post,"error"=>array("message"=>"The background image must be in a .jpg or .png format.","field"=>"timer_background_image"));

If 2 items are added to the $missing array that means that the input doesn't satisfy any of the image formats in the $img_formats array. At that point you know that you can return an error, etc. This could easily be turned into a little function:

    function m_stripos( $haystack = null, $needles = array() ){
        //return early if missing arguments 
        if ( !$needles || !$haystack ) return false; 
        // create an array to evaluate at the end
        $missing = array(); 
        //Loop through needles array, and add to $missing array if not satisfied
        foreach ( $needles as $needle )
            if ( stripos($haystack, $needle) === false ) $missing[] = $needle;
        //If the count of $missing and $needles is equal, we know there were no matches, return false..
        if (count($missing) == count($needles)) return false; 
        //If we're here, be happy, return true...
        return true;
    }

Back to our first example using then the function instead:

    $needles = array('.jpg','.png');
    if ( !m_strpos( $post['timer_background_image'], $needles ) )
        return array("save_data"=>$post,"error"=>array("message"=>"The background image must be in a .jpg or .png format.","field"=>"timer_background_image"));

Of course, what you do after the function returns true or false is up to you.

Gyral answered 29/4, 2013 at 9:42 Comment(0)
K
1

It seems you are searching for whole words. In this case, something like this might help. As it uses built-in functions, it should be faster than custom code, but you have to profile it:

$words = str_word_count($str, 2);

$word_position_map = array();

foreach($words as $position => $word) {
    if(!isset($word_position_map[$word])) {
        $word_position_map[$word] = array();
    }
    $word_position_map[$word][] = $position;
}

// assuming $needles is an array of words
$result = array_intersect_key($word_position_map, array_flip($needles));

Storing the information (like the needles) in the right format will improve the runtime ( e.g. as you don't have to call array_flip).

Note from the str_word_count documentation:

For the purpose of this function, 'word' is defined as a locale dependent string containing alphabetic characters, which also may contain, but not start with "'" and "-" characters.

So make sure you set the locale right.

Korney answered 1/8, 2011 at 10:21 Comment(4)
Thanks Felix, I wasn't aware of str_word_count(). I will prepare a time test for the two above mentioned ideas. I thinks that your code will be faster.Graphemics
Do you know some way to make str_word_count() to work with uft-8? I can make a similar function to do this, but then there will be no real comparison.Graphemics
@Nikola: It should work if you set the locale right... have a look at php.net/manual/en/function.setlocale.php.Korney
According to the comments str_word_count() is not compatible with uft-8 and some workarounds are suggested php.net/manual/en/function.str-word-count.php#85579 based on preg_match_all()Graphemics
G
0

You could use a regular expression, they support OR operations. This would however make it fairly slow, compared to strpos.

Grassy answered 1/8, 2011 at 9:42 Comment(3)
Yes, using regexp will make it very slow especially for a big number of needles.Graphemics
What if you just call strpos multiple times?Grassy
You mean thousands of times? And how to find all occurrences not just the first one?Graphemics
U
0

How about a simple solution using array_map()?

$string = 'one two three four';
$needles = array( 'five' , 'three' );
$strpos_arr = array_map( function ( $check ) use ( $string ) {
    return strpos( $string, $check );
}, $needles );

As return, you're going to have an array where the keys are the needles positions and the values are the starting positions, if found.

//print_r( $strpos_arr );
Array
(
    [0] => 
    [1] => 8
)
Unceasing answered 21/4, 2021 at 23:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.