Highlight keywords in a paragraph
Asked Answered
C

8

2

I need to highlight a keyword in a paragraph, as google does in its search results. Let's assume that I have a MySQL db with blog posts. When a user searches for a certain keyword I wish to return the posts which contain those keywords, but to show only parts of the posts (the paragraph which contain the searched keyword) and to highlight those keywords.

My plan is this:

  • find the post id which has the searched keyword in it's content;
  • read the content of that post again and put each word in a fixed buffer array (50 words) until I find the keyword.

Can you help me with some logic, or at least to tell my if my logic is ok? I'm in a PHP learning stage.

Colston answered 2/11, 2010 at 19:27 Comment(4)
You don't need to use arrays to store each word, and probably shouldn't do that either.Chalky
How is your data stored? In plain text or in HTML?Coridon
Oh, and how do you want to match and highlight the match? Match whole words or subwords? Highlight whole words or subwords?Coridon
@Gumbo: it is stored as plain text. TY for your valuable replays and comments.Colston
L
9

If it contains html (note that this is a pretty robust solution):

$string = '<p>foo<b>bar</b></p>';
$keyword = 'foo';
$dom = new DomDocument();
$dom->loadHtml($string);
$xpath = new DomXpath($dom);
$elements = $xpath->query('//*[contains(.,"'.$keyword.'")]');
foreach ($elements as $element) {
    foreach ($element->childNodes as $child) {
        if (!$child instanceof DomText) continue;
        $fragment = $dom->createDocumentFragment();
        $text = $child->textContent;
        $stubs = array();
        while (($pos = stripos($text, $keyword)) !== false) {
            $fragment->appendChild(new DomText(substr($text, 0, $pos)));
            $word = substr($text, $pos, strlen($keyword));
            $highlight = $dom->createElement('span');
            $highlight->appendChild(new DomText($word));
            $highlight->setAttribute('class', 'highlight');
            $fragment->appendChild($highlight);
            $text = substr($text, $pos + strlen($keyword));
        }
        if (!empty($text)) $fragment->appendChild(new DomText($text));
        $element->replaceChild($fragment, $child);
    }
}
$string = $dom->saveXml($dom->getElementsByTagName('body')->item(0)->firstChild);

Results in:

<p><span class="highlight">foo</span><b>bar</b></p>

And with:

$string = '<body><p>foobarbaz<b>bar</b></p></body>';
$keyword = 'bar';

You get (broken onto multiple lines for readability):

<p>foo
    <span class="highlight">bar</span>
    baz
    <b>
        <span class="highlight">bar</span>
    </b>
</p>

Beware of non-dom solutions (like regex or str_replace) since highlighting something like "div" has a tendency of completely destroying your HTML... This will only ever "highlight" strings in the body, never inside of a tag...


Edit Since you want Google style results, here's one way of doing it:

function getKeywordStubs($string, array $keywords, $maxStubSize = 10) {
    $dom = new DomDocument();
    $dom->loadHtml($string);
    $xpath = new DomXpath($dom);
    $results = array();
    $maxStubHalf = ceil($maxStubSize / 2);
    foreach ($keywords as $keyword) {
        $elements = $xpath->query('//*[contains(.,"'.$keyword.'")]');
        $replace = '<span class="highlight">'.$keyword.'</span>';
        foreach ($elements as $element) {
            $stub = $element->textContent;
            $regex = '#^.*?((\w*\W*){'.
                 $maxStubHalf.'})('.
                 preg_quote($keyword, '#').
                 ')((\w*\W*){'.
                 $maxStubHalf.'}).*?$#ims';
            preg_match($regex, $stub, $match);
            var_dump($regex, $match);
            $stub = preg_replace($regex, '\\1\\3\\4', $stub);
            $stub = str_ireplace($keyword, $replace, $stub);
            $results[] = $stub;
        }
    }
    $results = array_unique($results);
    return $results;
}

Ok, so what that does is return an array of matches with $maxStubSize words around it (namely up to half that number before, and half after)...

So, given a string:

<p>a whole 
    <b>bunch of</b> text 
    <a>here for</a> 
    us to foo bar baz replace out from this string
    <b>bar</b>
</p>

Calling getKeywordStubs($string, array('bar', 'bunch')) will result in:

array(4) {
  [0]=>
  string(75) "here for us to foo <span class="highlight">bar</span> baz replace out from "
  [3]=>
  string(34) "<span class="highlight">bar</span>"
  [4]=>
  string(62) "a whole <span class="highlight">bunch</span> of text here for "
  [7]=>
  string(39) "<span class="highlight">bunch</span> of"
}

So, then you could build your result blurb by sorting the list by strlen and then picking the two longest matches... (assuming php 5.3+):

usort($results, function($str1, $str2) { 
    return strlen($str2) - strlen($str1);
});
$description = implode('...', array_slice($results, 0, 2));

Which results in:

here for us to foo <span class="highlight">bar</span> baz replace out...a whole <span class="highlight">bunch</span> of text here for 

I hope that helps... (I do feel this is a bit... bloated... I'm sure there are better ways to do this, but here's one way)...

Lemuroid answered 2/11, 2010 at 19:53 Comment(4)
While a fine solution for highlighting, it does not solve the OP's question (at least if I understand it correctly). The OP wants to highlight and return a portion of the surroundings, like Google Search result excerpts.Feck
@Gordon: Edited in a very non-elegant way of doing just that.Lemuroid
@Lemuroid I get the following warning: Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 1 in /../ on line 118. Usually solving it with hetmlentities() would ruin the pros with DOM highlightingEstis
Nice solution, although it's worth noting that it is case sensitive, despite the use of stipos because the xpath function contains is case sensitive. The only way around I found was to test for several matches e.g. '//*[contains(.,"'.$keyword.'") or contains(.,"'.strtolower($keyword).'") or contains(.,"'.strtoupper($keyword).'") or contains(.,"'.ucfirst($keyword).'")]')Experiential
H
2

Maybe you could do something like this when you're connected to the database:

$keyword = $_REQUEST["keyword"]; //fetch the keyword from the request
$result = mysql_query("SELECT * FROM `posts` WHERE `content` LIKE '%".
        mysql_real_escape_string($keyword)."%'"); //ask the database for the posttexts
while ($row = mysql_fetch_array($result)) {//do the following for each result:
  $text = $row["content"];//we're only interested in the content at the moment
  $text=substr ($text, strrpos($text, $keyword)-150, 300); //cut out
  $text=str_replace($keyword, '<strong>'.$keyword.'</strong>', $text); //highlight
  echo htmlentities($text); //print it
  echo "<hr>";//draw a line under it
}
Haleigh answered 2/11, 2010 at 19:32 Comment(0)
S
2

If you wish to cut out the relevant paragraphs, after doing the above mentions str_replace function, you can use stripos() to find the position of these strong sections, and use an offset of that location with substr() to cut out a section of the paragraph, such as:

$searchterms;

foreach($searchterms as $search)
{
$paragraph = str_replace($search, "<strong>$search</strong>", $paragraph);
}

$pos = 0;

for($i = 0; $i < 4; $i++)  
{  
$pos = stripos($paragraph, "<strong>", $pos);  
$section[$i] = substr($paragraph, $pos - 100, 200);
}

which will give you an array of small sentences (200 characters each) to use how you wish. It may also be beneficial to search for the nearest space from the cutting locations, and cut from there to prevent half-words. Oh, and you also need to check for errors, but I'll leave that but up to you.

Shaff answered 2/11, 2010 at 19:45 Comment(0)
T
1

You could try exploding your database search result set into an array using explode and then usearray_search() on each search result. Set the $distance variable in the example below to how many words you'd like to appear on either side of the first match of the $keyword.

In the example, I've included lorum ipsum text as an example database result paragraph and set the $keyword to 'scelerisque'. You'd obviously replace these in your code.

//example paragraph text
$lorum = 'Nunc nec magna at nibh imperdiet dignissim quis eu velit. 
vel mattis odio rutrum nec. Etiam sit amet tortor nibh, molestie 
vestibulum tortor. Integer condimentum magna dictum purus vehicula 
et scelerisque mauris viverra. Nullam in lorem erat. Ut dolor libero, 
tristique et pellentesque sed, mattis eget dui. Cum sociis natoque 
penatibus et magnis dis parturient montes, nascetur ridiculus mus. 
.';

//turn paragraph into array
$ipsum = explode(' ',$lorum);
//set keyword
$keyword = 'scelerisque';
//set excerpt distance
$distance = 10;

//look for keyword in paragraph array, return array key of first match
$match_key = array_search($keyword,$ipsum);

if(!empty($match_key)){

    foreach($ipsum as $key=>$value){
        //if paragraph array key inside excerpt distance
        if($key > $match_key-$distance and $key< $match_key+$distance){ 
            //if array key matches keyword key, bold the word
            if($key == $match_key){
                $word = '<b>'.$value.'</b>';
                }
            else{
                $word = $value;
                }
            //create excerpt array to hold words within distance
            $excerpt[] = $word;
            }

        }
    //turn excerpt array into a string
    $excerpt = implode(' ',$excerpt);
    }
//print the string
echo $excerpt;

$excerpt returns: "vestibulum tortor. Integer condimentum magna dictum purus vehicula et scelerisque mauris viverra. Nullam in lorem erat. Ut dolor libero,"

Tambourin answered 2/11, 2010 at 20:31 Comment(0)
C
1

Here’s a solution for plain text:

$str = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.';
$keywords = array('co');
$wordspan = 5;
$keywordsPattern = implode('|', array_map(function($val) { return preg_quote($val, '/'); }, $keywords));
$matches = preg_split("/($keywordsPattern)/ui", $str, -1, PREG_SPLIT_DELIM_CAPTURE);
for ($i = 0, $n = count($matches); $i < $n; ++$i) {
    if ($i % 2 == 0) {
        $words = preg_split('/(\s+)/u', $matches[$i], -1, PREG_SPLIT_DELIM_CAPTURE);
        if (count($words) > ($wordspan+1)*2) {
            $matches[$i] = '…';
            if ($i > 0) {
                $matches[$i] = implode('', array_slice($words, 0, ($wordspan+1)*2)) . $matches[$i];
            }
            if ($i < $n-1) {
                $matches[$i] .= implode('', array_slice($words, -($wordspan+1)*2));
            }
        }
    } else {
        $matches[$i] = '<b>'.$matches[$i].'</b>';
    }
}
echo implode('', $matches);

With the current pattern "/($keywordsPattern)/ui" subwords are matched and highlighted. But you can change that if you want to:

  • If you want to match only whole words and not just subwords, use word boundaries \b:

    "/\b($keywordsPattern)\b/ui"
    
  • If you want to match subwords but highlight the whole word, use put optional word characters \w in front and after the keywords:

    "/(\w*?(?:$keywordsPattern)\w*)/ui"
    
Coridon answered 3/11, 2010 at 9:31 Comment(0)
V
1

I found this post when doing a search for how to highlight keyword search results. My requirements were:

  • Must be whole words
  • Must work for more than one keyword
  • Must be PHP only

I am fetching my data from a MySQL database, which doesn't contain elements, by design of the form which stores the data.

Here is the code I found most useful:

$keywords = array("fox","jump","quick");
$string = "The quick brown fox jumps over the lazy dog";
$test = "The quick brown fox jumps over the lazy dog"; // used to compare values at the end.

if(isset($keywords)) // For keyword search this will highlight all keywords in the results.
    {
    foreach($keywords as $word)
        {
        $pattern = "/\b".$word."\b/i";
        $string = preg_replace($pattern,"<span class=\"highlight\">".$word."</span>", $string);
        }
    }
 // We must compare the original string to the string altered in the loop to avoid having a string printed with no matches.
if($string === $test)
    {
    echo "No match";
    }
else
    {
    echo $string;
    }

Output:

The <span class="highlight">quick</span> brown <span class="highlight">fox</span> jumps over the lazy dog.

I hope this helps someone.

Vigilantism answered 30/10, 2014 at 7:12 Comment(1)
To my understanding, this will fail for the requirement #1. If you happen to include "row" in your list of keywords, partial matches like in "brown" will also be highlighted. It depends of your concept of "whole words". In the keywords? In the text you are looking into? In both places?Engine
P
0

If you're a beginner this will not be super easy as someone might think...

I think you should do the following steps:

  1. build a query based on what user searched (beware of sql injections)
  2. fetch the results and organize them (an array should be fine)
  3. build the html code from the previous array

In the third step you can use some regular expression to replace the user searched keywords with a bolded equivalent. str_replace could work too...

I hope this helps... If you could provide your database structure maybe I can give you some more precise hints...

Pagan answered 2/11, 2010 at 19:36 Comment(0)
G
0

Browsers have a native API for doing this client-side: the CSS Custom Highlight API. It requires a bit of JavaScript, but with a third-party library like highlight-search-term, it's a one-liner:

<script type="module">
import { highlightSearchTerm } from "https://cdn.jsdelivr.net/npm/[email protected]/src/index.js";
highlightSearchTerm({ search: 'KEYWORD',  selector: ".content" });
</script>

Put the above snippet at the end of the page body, replacing KEYWORD by your search keyword and .content with the CSS selector of the page element(s) where you want to highlight words.

More examples at https://www.npmjs.com/package/highlight-search-term

Gazelle answered 24/4 at 13:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.