Full text search in HTML ignoring tags / &
Asked Answered
M

3

19

I've recently seen a lot of libraries for searching and highlighting terms within an HTML page. However, every library I saw has the same problem, they can't find text partly encased in an html tag and/or they'd fail at finding special characters which are &-expressed.


Example a:

<span> This is a test. This is a <b>test</b> too</span>

Searching for "a test" would find the first instance but not the second.


Example b:

<span> Pencils in spanish are called l&aacute;pices</span>

Searching for "lápices" or "lapices" would fail to produce a result.


Is there a way to circumvent these obstacles?

Thanks in Advance!

Monied answered 4/5, 2011 at 16:43 Comment(1)
Try mark.js, it has an option acrossElementsSirmons
B
44

You can use window.find() in non-IE browsers and TextRange's findText() method in IE. Here's an example:

http://jsfiddle.net/xeSQb/6/

Unfortunately Opera prior to the switch to the Blink rendering engine in version 15 doesn't support either window.find or TextRange. If this is a concern for you, a rather heavyweight alternative is to use a combination of the TextRange and CSS class applier modules of my Rangy library, as in the following demo: http://rangy.googlecode.com/svn/trunk/demos/textrange.html

The following code is an improvement of the fiddle above by unhighlighting the previous search results each time a new search is performed:

function doSearch(text,color="yellow") {
    if (color!="transparent") {
      doSearch(document.getElementById('hid_search').value,"transparent"); 
      document.getElementById('hid_search').value = text; 
      }
    if (window.find && window.getSelection) {
        document.designMode = "on";
        var sel = window.getSelection();
        sel.collapse(document.body, 0);
        
        while (window.find(text)) {
            document.execCommand("HiliteColor", false, color);
            sel.collapseToEnd();
        }
        document.designMode = "off";
    } else if (document.body.createTextRange) {
        var textRange = document.body.createTextRange();
        while (textRange.findText(text)) {
            textRange.execCommand("BackColor", false, color);
            textRange.collapse(false);
        }
    }
}
<input type="text" id="search">
<input type="hidden" id="hid_search">
<input type="button" id="button" onmousedown="doSearch(document.getElementById('search').value)" value="Find">

<div id="content">
    <p>Here is some searchable text with some lápices in it, and more lápices, and some <b>for<i>mat</i>t</b>ing</p>
</div> 
Bookcase answered 4/5, 2011 at 17:53 Comment(15)
Worked like a charm. The iteration over ranges can be a bit slow if there are multiple results, but I don't think that'll be a major problem, plus it does EXACTLY what I need. Two thumbs up.Monied
One (very) minor flaw in this code is that it searches from the current cursor position onwards, so if a user highlights a piece of text and clicks the button, the search starts after the user's highlight. Ideally before the find() call there should be some kind of call to take the cursor to the top.Monied
@tim down How to remove undo highlights which are highlighted with the above code..Lugsail
@user1008575: That's harder: there's no built-in command to do that, excpet "RemoveFormat" which removes all inline formatting, not just the background colour.Bookcase
Opera neither supports window.find nor createTextRange (nor a findText method on document.createRange() objects)Leuctra
@Leuctra So it doesn't. I'm surprised I didn't check that. I've thought for years that it did.Bookcase
@Tim Down: Wow, thanks! I hadn't expected you had something in reserve :-)Leuctra
@TimDown Just FYI, moving the caret left by clicking the button doesn't work from the start of a line.Mosqueda
@ErikE: Thanks. There's a bug I'm working on with taht. It's the major obstacle to a full Rangy 1.3 release. I intend to fix it this week.Bookcase
A little update: Opera 15 and newer supports window.find since it is on webkit.Faultless
how to clear the highlighted words though? Clear it so it is back to normal, no highlights...Vladamar
One way to clear the highlighted words is: pass the background color as a parameter. When you would like to highlight the words, pass "yellow" for example; and when you would like to clear the highlights, pass "transparent" instead.Genarogendarme
I edited the answer to propose code to unhighlight the previous search results each time a new search is performedMcdougald
@TimDown If my search appears multiple times in the document, the browser always scrolls to the last one. Is it possible to scroll to the first one instead of the last one? Ideally some implementation with arrow buttons so you can scroll one by one while pressing up/down arrow keys.Montagnard
@TimDown there is a bug in webkit browsers. If you have a contenteditable="false" attribute, some parts of text inside that element are not showing in search results.Montagnard
P
0

There are 2 problems here. One is the nested content problem, or search matches that span an element boundary. The other is HTML-escaped characters.

One way to handle the HTML-escaped characters is, if you are using jQuery for example, to use the .text() method, and run the search on that. The text that comes back from that already has the escaped characters "translated" into their real character.

Another way to handle those special characters would be to replace the actual character (in the search string) with the escaped version. Since there are a wide variety of possibilities there, however, that could be a lengthy search depending on the implementation.

The same sort of "text" method can be used to find content matches that span entity boundaries. It gets trickier because the "Text" doesn't have any notion of where the actual parts of the content come from, but it gives you a smaller domain to search over if you drill in. Once you are close, you can switch to a more "series of characters" sort of search rather than a word-based search.

I don't know of any libraries that do this however.

Partitive answered 4/5, 2011 at 17:3 Comment(0)
A
0

To highlight search keywords and remove highlighting from a web page using javascript

    <script>


    function highlightAll(keyWords) { 
        document.getElementById('hid_search_text').value = keyWords; 
        document.designMode = "on"; 
        var sel = window.getSelection(); 
        sel.collapse(document.body, 0);
        while (window.find(keyWords)) { 
            document.execCommand("HiliteColor", false, "yellow"); 
            sel.collapseToEnd(); 
        }
        document.designMode = "off";
        goTop(keyWords,1); 
    }

    function removeHighLight() { 
        var keyWords = document.getElementById('hid_search_text').value; 
        document.designMode = "on"; 
        var sel = window.getSelection(); 
        sel.collapse(document.body, 0);
        while (window.find(keyWords)) { 
            document.execCommand("HiliteColor", false, "transparent"); 
            sel.collapseToEnd(); 
        }
        document.designMode = "off"; 
        goTop(keyWords,0); 
    }

    function goTop(keyWords,findFirst) { 
        if(window.document.location.href = '#') { 
            if(findFirst) { 
                window.find(keyWords, 0, 0, 1);
            }
        }
    }
    </script>

    <style>
    #search_para {
     color:grey;
    }
    .highlight {
     background-color: #FF6; 
    }
    </style>

    <div id="wrapper">
        <input type="text" id="search_text" name="search_text"> &nbsp; 
        <input type="hidden" id="hid_search_text" name="hid_search_text"> 
        <input type="button" value="search" id="search" onclick="highlightAll(document.getElementById('search_text').value)" >  &nbsp; 
        <input type="button" value="remove" id="remove" onclick="removeHighLight()" >  &nbsp; 
        <div>
            <p id="search_para">The European languages are members of the same family. Their separate existence is a myth. For science, music, sport, etc, Europe uses the same vocabulary. The languages only differ in their grammar, their pronunciation and their most common words. Everyone realizes why a new common language would be desirable: one could refuse to pay expensive translators. To achieve this, it would be necessary to have uniform grammar, pronunciation and more common words. If several languages coalesce, the grammar of the resulting language is more simple and regular than that of the individual languages. The new common language will be more simple and regular than the existing European languages.</p>
        </div>
    </div>
Alexaalexander answered 11/6, 2019 at 8:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.